How can I compare two columns of Text of different length?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How can I compare two columns of Text of different length?

ljttet
Thank you very much for your help!
I am using SPSS24.
I have two groups of people writing definitions for the same course. The definitions are of different length using different words. Of course some Key words may match.
I want to find out how each row of one column is similar to each row of another column.
If they are similar, do they match on one Key word or three Key words?
Can I get a number of those match for each category, say how many rows have three key word matching, or how many rows have 5 key word matching?

Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

Art Kendall
Do you have pairs of responses?  or do you have a single columns and 2 groups of cases?

Please create a small subset of your data so we can better understand how it is set up.
Then create a syntax file that says
DISPLAY DICTIONARY.


Copy that output and paste it into a reply on this list
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

ljttet
For example, one column says,    

Examines the psychological development of individuals moving from their early twenties into old age.

The other says,
   
Developmental and Child Psychology

Here the word Psychology may be counted as match.
There are over 40,000 cases in both columns.
But we don't know what words they are in each case of each row.


 
 
Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

David Marso
Administrator
But psychological  and Psychology are *NOT* the same word.
How do you propose to resolve this?
--
Some ideas.

1. SPLIT the strings into two VECTORS (search this archive for Parse).

Two alternatives.
2a. Take these vectors from wide to long using VARSTOCASES.
3a. Do a Cartesian merge of the two vectors (search archives for this).

2-3b.  Compare the two vectors with a nested LOOP.

4ab. Decide if the various substrings should be considered the same by applying an appropriate distance function (see archives, this has been discussed.

5ab.  Ennumerate matches with AGGREGATE and merge to original file.

Sorry for lack of specific detail but I'm slammed.  Maybe this will get the ball rolling in the right direction.
HTH.
--

ljttet wrote
For example, one column says,    

Examines the psychological development of individuals moving from their early twenties into old age.

The other says,
   
Developmental and Child Psychology

Here the word Psychology may be counted as match.
There are over 40,000 cases in both columns.
But we don't know what words they are in each case of each row.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

Jon Peck
In reply to this post by ljttet
You should really have full-fledged text analysis software to handle stemming of word forms, but a rough approach could be carried out if you can provide more details, such as how do you deal with strings where the same word appears more than once?  And, presumably, you would want to ignore common words such as a, the, if, and, but, .... and perhaps do something rough about plurals such as always ignoring a final s.  Similarity of words could be exact match or, say, within a few keystrokes.  If you wanted to make lists of word forms for important words, that could also be addressed.  Are you always comparing two variables in the same case, or is cross-case comparison required?

On Wed, May 10, 2017 at 11:22 AM, ljttet <[hidden email]> wrote:
For example, one column says,

Examines the psychological development of individuals moving from their
early twenties into old age.

The other says,

Developmental and Child Psychology

Here the word Psychology may be counted as match.
There are over 40,000 cases in both columns.
But we don't know what words they are in each case of each row.








--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-can-I-compare-two-columns-of-Text-of-different-length-tp5734133p5734135.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

Rich Ulrich

My first thought, too, was "text analysis software."  Without that, the first step must be,

"Fix the spelling."  Then, I would equate synonyms (unless you care about these distinctions;

distinguishing will lengthen the number of 'important' words).

 

I think I would start with VarsToCases, drop [a, and, the, ...], and aggregate to count. I would

start with, say, 1000 cases, to keep the first results to a more readable length.  That helps to

check spelling and synonyms, and might show other ambiguities.


Then I would probably base my comparisons on words or partial-words, taking the top

50 or so most relevant words.  100?  More?  Less?  - be guided by the counts. 


Then: Strip each list to its relevant words; cross-compare; compute a coefficient of some sort

for the Similarity.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Wednesday, May 10, 2017 3:26:23 PM
To: [hidden email]
Subject: Re: How can I compare two columns of Text of different length?
 
You should really have full-fledged text analysis software to handle stemming of word forms, but a rough approach could be carried out if you can provide more details, such as how do you deal with strings where the same word appears more than once?  And, presumably, you would want to ignore common words such as a, the, if, and, but, .... and perhaps do something rough about plurals such as always ignoring a final s.  Similarity of words could be exact match or, say, within a few keystrokes.  If you wanted to make lists of word forms for important words, that could also be addressed.  Are you always comparing two variables in the same case, or is cross-case comparison required?

On Wed, May 10, 2017 at 11:22 AM, ljttet <[hidden email]> wrote:
For example, one column says,

Examines the psychological development of individuals moving from their
early twenties into old age.

The other says,

Developmental and Child Psychology

Here the word Psychology may be counted as match.
There are over 40,000 cases in both columns.
But we don't know what words they are in each case of each row.








--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-can-I-compare-two-columns-of-Text-of-different-length-tp5734133p5734135.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How can I compare two columns of Text of different length?

ljttet
Thank you! Rich. 
I will try. 
Jun 

On Wed, May 10, 2017 at 9:40 PM, Rich Ulrich [via SPSSX Discussion] <[hidden email]> wrote:

My first thought, too, was "text analysis software."  Without that, the first step must be,

"Fix the spelling."  Then, I would equate synonyms (unless you care about these distinctions;

distinguishing will lengthen the number of 'important' words).

 

I think I would start with VarsToCases, drop [a, and, the, ...], and aggregate to count. I would

start with, say, 1000 cases, to keep the first results to a more readable length.  That helps to

check spelling and synonyms, and might show other ambiguities.


Then I would probably base my comparisons on words or partial-words, taking the top

50 or so most relevant words.  100?  More?  Less?  - be guided by the counts. 


Then: Strip each list to its relevant words; cross-compare; compute a coefficient of some sort

for the Similarity.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Wednesday, May 10, 2017 3:26:23 PM
To: [hidden email]
Subject: Re: How can I compare two columns of Text of different length?
 
You should really have full-fledged text analysis software to handle stemming of word forms, but a rough approach could be carried out if you can provide more details, such as how do you deal with strings where the same word appears more than once?  And, presumably, you would want to ignore common words such as a, the, if, and, but, .... and perhaps do something rough about plurals such as always ignoring a final s.  Similarity of words could be exact match or, say, within a few keystrokes.  If you wanted to make lists of word forms for important words, that could also be addressed.  Are you always comparing two variables in the same case, or is cross-case comparison required?

On Wed, May 10, 2017 at 11:22 AM, ljttet <[hidden email]> wrote:
For example, one column says,

Examines the psychological development of individuals moving from their
early twenties into old age.

The other says,

Developmental and Child Psychology

Here the word Psychology may be counted as match.
There are over 40,000 cases in both columns.
But we don't know what words they are in each case of each row.








--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-can-I-compare-two-columns-of-Text-of-different-length-tp5734133p5734135.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


To unsubscribe from How can I compare two columns of Text of different length?, click here.
NAML