basic 'string' question

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

basic 'string' question

Talma
Hello list members,

I fear that because my question is so basic, I couldnt find any previous
discussion in this list.

What I just wanted to do is to count how often a specific term is used
across a number of answers to an open survey question, and to recode it in a
new, numeric variable.

In the SPSS-file, the answers are in string-format -e.g.

VAR_X
I like vanilla ice.
I prefer chocolate ice cream.
I love strawberry ice cream and vanilla ice cream.

and so on.

Now I need to check how often the term "vanilla" is used across all answers
and to recode it to a new variable which takes on the value 1 if the term
vanilla is used and zero if not.

I used

Compute VAR_Z=0.
If Var_X = 'vanilla' VAR_Z = 1.
exe.

But this doesn't work.

Any ideas how to solve my problem?

Many thanks!
T.


if



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Rick Oliver
Try:

compute var_z=(char.index(lower(var_x), "vanilla"))>0.

On Sat, May 5, 2018 at 12:51 AM, Talma <[hidden email]> wrote:
Hello list members,

I fear that because my question is so basic, I couldnt find any previous
discussion in this list.

What I just wanted to do is to count how often a specific term is used
across a number of answers to an open survey question, and to recode it in a
new, numeric variable.

In the SPSS-file, the answers are in string-format -e.g.

VAR_X
I like vanilla ice.
I prefer chocolate ice cream.
I love strawberry ice cream and vanilla ice cream.

and so on.

Now I need to check how often the term "vanilla" is used across all answers
and to recode it to a new variable which takes on the value 1 if the term
vanilla is used and zero if not.

I used

Compute VAR_Z=0.
If Var_X = 'vanilla' VAR_Z = 1.
exe.

But this doesn't work.

Any ideas how to solve my problem?

Many thanks!
T.


if



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Talma
Dear Rick,

many thanks, your suggestion worked indeed!

However, as a follow up question, may I ask I might extend the syntay to
convert multiple words to the numeric value '1' in the new variable var_z?
For example, I might need to identify sentences containing the term
'vanilla', but also those containing the terms 'chocolate' or 'strawberry'?

Best,
Talma



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Jon Peck
You can generalize Rick's syntax like this.
compute var_z=char.index(lower(var_x), "vanilla") > 0 or char.index(lower(var_x), "chocolate") >0.

But if you have a lot of conditions to check, this gets unwieldy.  It also does not consider words like
creamery, i.e., words that contain the word you are looking for.

A more general framework can easily be accommodated, but more information is needed on the real problem first.

On Sat, May 5, 2018 at 8:17 AM, Talma <[hidden email]> wrote:
Dear Rick,

many thanks, your suggestion worked indeed!

However, as a follow up question, may I ask I might extend the syntay to
convert multiple words to the numeric value '1' in the new variable var_z?
For example, I might need to identify sentences containing the term
'vanilla', but also those containing the terms 'chocolate' or 'strawberry'?

Best,
Talma



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Talma
Dear Jon,  

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly  : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
;       Proceedings of the ACM SIGKDD International Conference on Knowledge
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
;       Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!
Talma

 




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Jon Peck
Here is a solution using the SPSSINC TRANS extension command.  It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities.

First you define a dataset of words - I called it lookup - and make sure that your main dataset is active.
data list fixed/words(a30).
begin data
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
end data
dataset name lookup.

data list fixed/text(a50).
begin data
adorer
adoring
end data
dataset name main.
dataset activate main.

Next you define a Python class for use with SPSSINC TRANS.  It reads the lookup dataset and creates a set containing the words ignoring case.  It also creates a function, func, that will be called for each case in the main dataset.  func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case).  In this example, the strings to check are in a variable named text.

begin program.
class vlookup(object):
    """Check values according to a dictionary specified as an SPSS dataset"""
    def __init__(self, dataset):
        """dataset is a dataset of words
        
        Lookups are made after trimming any trailing blanks and ignoring case
        The class creates a function named func that can be referenced for lookups"""
        
        spss.StartDataStep()
        try:
            ds = spss.Dataset(dataset)
            cases = ds.cases
            self.table = set()
            for i in range(len(cases)):
                self.table.add(cases[i, 0][0].rstrip().lower())

            def func(x):
                    x = x.rstrip().split()
                    for word in x:
                        if word.lower() in self.table:
                            return True
                    return False
            self.func = func
        finally:
            spss.EndDataStep()
end program.

This is the call to invoke all this.  It first creates the word set from the named dataset and then processes a variable named text for each case.  The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set.

spssinc trans result=hasword
/initial "vlookup('lookup')"
/formula "func(text)".

Regards,
Jon


On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:
Dear Jon, 

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly  : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
;       Proceedings of the ACM SIGKDD International Conference on Knowledge
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
;       Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!
Talma






--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

David Marso
Administrator
In reply to this post by Talma
Here is an approach which uses standard SPSS syntax ;-)
--
DATA LIST /word (A30).
BEGIN DATA
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
END DATA.
DATASET NAME Lookup.





DATA LIST /phrase (A200).
BEGIN DATA
data to evaluate goes here or GET FILE.....
END DATA.


DATASET NAME rawdata.
COMPUTE LineNumber=$CASENUM.
COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," ").
SET MXLOOP=100000.
STRING Word (A30).
LOOP.
COMPUTE #=CHAR.INDEX(phrase," ").
DO IF # GT 0.
COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1).
COMPUTE phrase=CHAR.SUBSTR(phrase,#+1).
XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word.
END IF.
END LOOP IF #=0.
EXECUTE.
GET FILE "C:\TEMP\parsedwords.sav".
SORT CASES BY Word.
MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word.
AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary).







Talma wrote

> Dear Jon,  
>
> many thanks for your example, which was already very useful – and you are
> right, the real problem refers to many more terms…
>
> Specifically, I’d like to analyse comments from a social media site using
> freely available dictionaries that count certain terms contained in the
> comments. These terms are identified with certain emotions.
> For example, a post containing the adjective “angry” could be classified
> as
> belonging to the category “anger” and so on (for this illustration, ignore
> the multiple problems associated with this approach, such as negations
> etc.).
>
> However, such dictionaries (often available in *.txt or*.csv format, which
> can be changed) easily contain several thousand terms…and requesting each
> term separately would indeed become unwieldly  : ). For illustration,
> here’s
> a sample example (first 40 words) of a similar dictionary (not just
> adjectives) taken from
>
> ;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
> ;       Proceedings of the ACM SIGKDD International Conference on
> Knowledge
> ;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
> ;       Washington, USA,
>
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
>
> ***
>
> In case there is any more general options to use SPSS syntax for finding
> out
> whether a string variable contains one of the terms above or not, it would
> extremely helpful if you could your thoughts here in this already
> superhelpful forum...
>
> Many thanks & regards!!
> Talma
>
>  
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Jon Peck
In reply to this post by Jon Peck
The extendedTransforms.py module has two similar functions to the solution I posted.
vlookup looks up values in a Python dictionary constructed from an SPSS dataset.  It differs from the posted solution in taking a key and returning an associated value.

vlookupinterval is similar but instead of an exact key match, it finds a value in a set of intervals and returns the associated value.

These functions as well as many others in this module can be used with SPSSINC TRANS.  Here is a list of the contents.

subs:                         replace occurrences of a regular expression pattern with specified values
templatesub:                  substitue values in a template expression
levenshteindistance:          calculate similarity between two strings
soundex:                      calculate the soundex value of a string (a rough phonetic encoding)
nysiis:                       enhanced sound encoding (claimed superior to soundex for surnames)
soundexallwords:              calculate the soundex value for each word in a string and return a blank-separated string
median:                       median of a list of values
mode:                         mode of a list of values
multimode:                    up to n modes of a list of values
matchcount:                   compare value with list of values and count matches using
                                  standard or custom comparison function
strtodatetime:                convert a date/time string to an SPSS datetime value using a pattern
datetimetostr:                convert an SPSS date/time value to a string using a pattern
lookup:                       return a value from a table lookup
vlookup:                      return a value from a table lookup (more convenient than lookup w SPSSINC TRANS)
vlookupinterval:              return a value from a table lookup using intervals
sphDist:                      calculate distance between two points on earth using spherical approximation
ellipseDist:                  calculate distance between two points on earth using ellipsoidal approximation
jaroWinkler                   calculate Jaro-Winkler string similarity measure
extractDummies                extract a set of binary variables from a value coded in powers of 2
packDummies                   pack a sequence of numeric and/or string values into a single float
translatechar                 map characters according to a conversion table
countWkdays                   count number of days between two dates that are not excluded
vlookupgroupinterval          return a value associated with a group and a set of intervals for that group
countDaysWExclusions          count days in interval exclusing specificied weekdays and other dates
DiceStringSimilarity          compare strings using Dice bigram metric.
Dictdict                      find best match of strings using Dice metric
setRandomSeed                 initialize random number generator
invGaussian                   inverse Gaussian distribution random numbers
triangular                    triangular random numbers

On Mon, May 7, 2018 at 6:33 AM, William Dudley <[hidden email]> wrote:
Jon,

This is terrific.
I have a project for which this method will be very useful.

Bill


On Sun, May 6, 2018 at 3:42 PM, Jon Peck <[hidden email]> wrote:
Here is a solution using the SPSSINC TRANS extension command.  It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities.

First you define a dataset of words - I called it lookup - and make sure that your main dataset is active.
data list fixed/words(a30).
begin data
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
end data
dataset name lookup.

data list fixed/text(a50).
begin data
adorer
adoring
end data
dataset name main.
dataset activate main.

Next you define a Python class for use with SPSSINC TRANS.  It reads the lookup dataset and creates a set containing the words ignoring case.  It also creates a function, func, that will be called for each case in the main dataset.  func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case).  In this example, the strings to check are in a variable named text.

begin program.
class vlookup(object):
    """Check values according to a dictionary specified as an SPSS dataset"""
    def __init__(self, dataset):
        """dataset is a dataset of words
        
        Lookups are made after trimming any trailing blanks and ignoring case
        The class creates a function named func that can be referenced for lookups"""
        
        spss.StartDataStep()
        try:
            ds = spss.Dataset(dataset)
            cases = ds.cases
            self.table = set()
            for i in range(len(cases)):
                self.table.add(cases[i, 0][0].rstrip().lower())

            def func(x):
                    x = x.rstrip().split()
                    for word in x:
                        if word.lower() in self.table:
                            return True
                    return False
            self.func = func
        finally:
            spss.EndDataStep()
end program.

This is the call to invoke all this.  It first creates the word set from the named dataset and then processes a variable named text for each case.  The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set.

spssinc trans result=hasword
/initial "vlookup('lookup')"
/formula "func(text)".

Regards,
Jon


On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:
Dear Jon, 

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly  : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
;       Proceedings of the ACM SIGKDD International Conference on Knowledge
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
;       Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!
Talma






--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
William N. Dudley, PhD
Professor - Public Health Education
The School of Health and Human Sciences
The University of North Carolina at Greensboro
437-L Coleman Building
Greensboro, NC 27402-6170
See my research on
ResearchGate
VOICE 336.256 2475

email signature image example.png




--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

bdates

David,


When I run your syntax with the words in the file rawdata that Jon supplied, adorer and adoring, I get the following message:


Warning # 10954

The AGGREGATE command has produced an output file which has no cases -

probably as the result of a SELECT IF or WEIGHT command.


The parsedwords file has no words identified.



Brian

From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Monday, May 7, 2018 12:54:03 PM
To: [hidden email]
Subject: Re: basic 'string' question
 
The extendedTransforms.py module has two similar functions to the solution I posted.
vlookup looks up values in a Python dictionary constructed from an SPSS dataset.  It differs from the posted solution in taking a key and returning an associated value.

vlookupinterval is similar but instead of an exact key match, it finds a value in a set of intervals and returns the associated value.

These functions as well as many others in this module can be used with SPSSINC TRANS.  Here is a list of the contents.

subs:                         replace occurrences of a regular expression pattern with specified values
templatesub:                  substitue values in a template expression
levenshteindistance:          calculate similarity between two strings
soundex:                      calculate the soundex value of a string (a rough phonetic encoding)
nysiis:                       enhanced sound encoding (claimed superior to soundex for surnames)
soundexallwords:              calculate the soundex value for each word in a string and return a blank-separated string
median:                       median of a list of values
mode:                         mode of a list of values
multimode:                    up to n modes of a list of values
matchcount:                   compare value with list of values and count matches using
                                  standard or custom comparison function
strtodatetime:                convert a date/time string to an SPSS datetime value using a pattern
datetimetostr:                convert an SPSS date/time value to a string using a pattern
lookup:                       return a value from a table lookup
vlookup:                      return a value from a table lookup (more convenient than lookup w SPSSINC TRANS)
vlookupinterval:              return a value from a table lookup using intervals
sphDist:                      calculate distance between two points on earth using spherical approximation
ellipseDist:                  calculate distance between two points on earth using ellipsoidal approximation
jaroWinkler                   calculate Jaro-Winkler string similarity measure
extractDummies                extract a set of binary variables from a value coded in powers of 2
packDummies                   pack a sequence of numeric and/or string values into a single float
translatechar                 map characters according to a conversion table
countWkdays                   count number of days between two dates that are not excluded
vlookupgroupinterval          return a value associated with a group and a set of intervals for that group
countDaysWExclusions          count days in interval exclusing specificied weekdays and other dates
DiceStringSimilarity          compare strings using Dice bigram metric.
Dictdict                      find best match of strings using Dice metric
setRandomSeed                 initialize random number generator
invGaussian                   inverse Gaussian distribution random numbers
triangular                    triangular random numbers

On Mon, May 7, 2018 at 6:33 AM, William Dudley <[hidden email]> wrote:
Jon,

This is terrific.
I have a project for which this method will be very useful.

Bill


On Sun, May 6, 2018 at 3:42 PM, Jon Peck <[hidden email]> wrote:
Here is a solution using the SPSSINC TRANS extension command.  It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities.

First you define a dataset of words - I called it lookup - and make sure that your main dataset is active.
data list fixed/words(a30).
begin data
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
end data
dataset name lookup.

data list fixed/text(a50).
begin data
adorer
adoring
end data
dataset name main.
dataset activate main.

Next you define a Python class for use with SPSSINC TRANS.  It reads the lookup dataset and creates a set containing the words ignoring case.  It also creates a function, func, that will be called for each case in the main dataset.  func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case).  In this example, the strings to check are in a variable named text.

begin program.
class vlookup(object):
    """Check values according to a dictionary specified as an SPSS dataset"""
    def __init__(self, dataset):
        """dataset is a dataset of words
        
        Lookups are made after trimming any trailing blanks and ignoring case
        The class creates a function named func that can be referenced for lookups"""
        
        spss.StartDataStep()
        try:
            ds = spss.Dataset(dataset)
            cases = ds.cases
            self.table = set()
            for i in range(len(cases)):
                self.table.add(cases[i, 0][0].rstrip().lower())

            def func(x):
                    x = x.rstrip().split()
                    for word in x:
                        if word.lower() in self.table:
                            return True
                    return False
            self.func = func
        finally:
            spss.EndDataStep()
end program.

This is the call to invoke all this.  It first creates the word set from the named dataset and then processes a variable named text for each case.  The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set.

spssinc trans result=hasword
/initial "vlookup('lookup')"
/formula "func(text)".

Regards,
Jon


On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:
Dear Jon, 

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly  : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
;       Proceedings of the ACM SIGKDD International Conference on Knowledge
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
;       Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!
Talma






--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
William N. Dudley, PhD
Professor - Public Health Education
The School of Health and Human Sciences
The University of North Carolina at Greensboro
437-L Coleman Building
Greensboro, NC 27402-6170
See my research on
ResearchGate
VOICE 336.256 2475

email signature image example.png




--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

David Marso
Administrator
Good catch Brian.  I forgot that the padding gets trashed when running in
Unicode.
FIXED here.

DATA LIST /phrase (A200).
BEGIN DATA
adorer
adoring
END DATA.
DATASET NAME rawdata.
COMPUTE LineNumber=$CASENUM.
COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," ").
SET MXLOOP=100000.
STRING Word (A30).
LOOP.
+  COMPUTE #=CHAR.INDEX(phrase," ").
+  DO IF # GT 0.
+   COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1).
+     COMPUTE phrase=CHAR.SUBSTR(phrase,#+1).
+  ELSE.
+     IF (phrase NE "") Word=phrase.
+  END IF.
+  XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word.
END LOOP IF #=0.
EXECUTE.
GET FILE "C:\TEMP\parsedwords.sav".
SORT CASES BY Word.
MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word.
AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary).



bdates wrote

> David,
>
>
> When I run your syntax with the words in the file rawdata that Jon
> supplied, adorer and adoring, I get the following message:
>
>
> Warning # 10954
>
> The AGGREGATE command has produced an output file which has no cases -
>
> probably as the result of a SELECT IF or WEIGHT command.
>
>
> The parsedwords file has no words identified.
>
>
>
> Brian
> ________________________________
> From: SPSSX(r) Discussion &lt;

> SPSSX-L@.UGA

> &gt; on behalf of Jon Peck &lt;

> jkpeck@

> &gt;
> Sent: Monday, May 7, 2018 12:54:03 PM
> To:

> SPSSX-L@.UGA

> Subject: Re: basic 'string' question
>
> The extendedTransforms.py module has two similar functions to the solution
> I posted.
> vlookup looks up values in a Python dictionary constructed from an SPSS
> dataset.  It differs from the posted solution in taking a key and
> returning an associated value.
>
> vlookupinterval is similar but instead of an exact key match, it finds a
> value in a set of intervals and returns the associated value.
>
> These functions as well as many others in this module can be used with
> SPSSINC TRANS.  Here is a list of the contents.
>
> subs:                         replace occurrences of a regular expression
> pattern with specified values
> templatesub:                  substitue values in a template expression
> levenshteindistance:          calculate similarity between two strings
> soundex:                      calculate the soundex value of a string (a
> rough phonetic encoding)
> nysiis:                       enhanced sound encoding (claimed superior to
> soundex for surnames)
> soundexallwords:              calculate the soundex value for each word in
> a string and return a blank-separated string
> median:                       median of a list of values
> mode:                         mode of a list of values
> multimode:                    up to n modes of a list of values
> matchcount:                   compare value with list of values and count
> matches using
>                                   standard or custom comparison function
> strtodatetime:                convert a date/time string to an SPSS
> datetime value using a pattern
> datetimetostr:                convert an SPSS date/time value to a string
> using a pattern
> lookup:                       return a value from a table lookup
> vlookup:                      return a value from a table lookup (more
> convenient than lookup w SPSSINC TRANS)
> vlookupinterval:              return a value from a table lookup using
> intervals
> sphDist:                      calculate distance between two points on
> earth using spherical approximation
> ellipseDist:                  calculate distance between two points on
> earth using ellipsoidal approximation
> jaroWinkler                   calculate Jaro-Winkler string similarity
> measure
> extractDummies                extract a set of binary variables from a
> value coded in powers of 2
> packDummies                   pack a sequence of numeric and/or string
> values into a single float
> translatechar                 map characters according to a conversion
> table
> countWkdays                   count number of days between two dates that
> are not excluded
> vlookupgroupinterval          return a value associated with a group and a
> set of intervals for that group
> countDaysWExclusions          count days in interval exclusing specificied
> weekdays and other dates
> DiceStringSimilarity          compare strings using Dice bigram metric.
> Dictdict                      find best match of strings using Dice metric
> setRandomSeed                 initialize random number generator
> invGaussian                   inverse Gaussian distribution random numbers
> triangular                    triangular random numbers
>
> On Mon, May 7, 2018 at 6:33 AM, William Dudley &lt;

> wndudley@

> &lt;mailto:

> wndudley@

> &gt;> wrote:
> Jon,
>
> This is terrific.
> I have a project for which this method will be very useful.
>
> Bill
>
>
> On Sun, May 6, 2018 at 3:42 PM, Jon Peck &lt;

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;> wrote:
> Here is a solution using the SPSSINC TRANS extension command.  It is
> normally installed with Statistics, but if you don't already have it you
> can install it from the Extensions menu or in older versions Utilities.
>
> First you define a dataset of words - I called it lookup - and make sure
> that your main dataset is active.
> data list fixed/words(a30).
> begin data
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
> end data
> dataset name lookup.
>
> data list fixed/text(a50).
> begin data
> adorer
> adoring
> end data
> dataset name main.
> dataset activate main.
>
> Next you define a Python class for use with SPSSINC TRANS.  It reads the
> lookup dataset and creates a set containing the words ignoring case.  It
> also creates a function, func, that will be called for each case in the
> main dataset.  func splits the indicated variable's value at each blank
> and checks whether it appears in the set (ignoring case).  In this
> example, the strings to check are in a variable named text.
>
> begin program.
> class vlookup(object):
>     """Check values according to a dictionary specified as an SPSS
> dataset"""
>     def __init__(self, dataset):
>         """dataset is a dataset of words
>
>         Lookups are made after trimming any trailing blanks and ignoring
> case
>         The class creates a function named func that can be referenced for
> lookups"""
>
>         spss.StartDataStep()
>         try:
>             ds = spss.Dataset(dataset)
>             cases = ds.cases
>             self.table = set()
>             for i in range(len(cases)):
>                 self.table.add(cases[i, 0][0].rstrip().lower())
>
>             def func(x):
>                     x = x.rstrip().split()
>                     for word in x:
>                         if word.lower() in self.table:
>                             return True
>                     return False
>             self.func = func
>         finally:
>             spss.EndDataStep()
> end program.
>
> This is the call to invoke all this.  It first creates the word set from
> the named dataset and then processes a variable named text for each case.
> The result is a 1 or 0 (true or false) for each case according to whether
> any word in text is found in the lookup set.
>
> spssinc trans result=hasword
> /initial "vlookup('lookup')"
> /formula "func(text)".
>
> Regards,
> Jon
>
>
> On Sun, May 6, 2018 at 2:29 AM, Talma &lt;

> Talma.Claviger@

> &lt;mailto:

> Talma.Claviger@

> &gt;> wrote:
> Dear Jon,
>
> many thanks for your example, which was already very useful – and you are
> right, the real problem refers to many more terms…
>
> Specifically, I’d like to analyse comments from a social media site using
> freely available dictionaries that count certain terms contained in the
> comments. These terms are identified with certain emotions.
> For example, a post containing the adjective “angry” could be classified
> as
> belonging to the category “anger” and so on (for this illustration, ignore
> the multiple problems associated with this approach, such as negations
> etc.).
>
> However, such dictionaries (often available in *.txt or*.csv format, which
> can be changed) easily contain several thousand terms…and requesting each
> term separately would indeed become unwieldly  : ). For illustration,
> here’s
> a sample example (first 40 words) of a similar dictionary (not just
> adjectives) taken from
>
> ;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
> ;       Proceedings of the ACM SIGKDD International Conference on
> Knowledge
> ;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
> ;       Washington, USA,
>
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
>
> ***
>
> In case there is any more general options to use SPSS syntax for finding
> out
> whether a string variable contains one of the terms above or not, it would
> extremely helpful if you could your thoughts here in this already
> superhelpful forum...
>
> Many thanks & regards!!
> Talma
>
>
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;
>
> ===================== To manage your subscription to SPSSX-L, send a
> message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to manage
> subscriptions, send the command INFO REFCARD
>
>
>
> --
> William N. Dudley, PhD
> Professor - Public Health Education
> The School of Health and Human Sciences
> The University of North Carolina at Greensboro
> 437-L Coleman Building
> Greensboro, NC 27402-6170
> See my research on
> GoogleScholar&lt;https://scholar.google.com/citations?user=ZiYmyb4AAAAJ&amp;hl=en&gt;
> ResearchGate&lt;https://www.researchgate.net/profile/William_Dudley&gt;
> VOICE 336.256 2475
>
> [email signature image example.png]
>
>
>
>
> --
> Jon K Peck

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;
>
> ===================== To manage your subscription to SPSSX-L, send a
> message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to manage
> subscriptions, send the command INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

bdates

David,


Thanks! This syntax is really useful.


Brian

From: SPSSX(r) Discussion <[hidden email]> on behalf of David Marso <[hidden email]>
Sent: Monday, May 7, 2018 1:53:34 PM
To: [hidden email]
Subject: Re: basic 'string' question
 
Good catch Brian.  I forgot that the padding gets trashed when running in
Unicode.
FIXED here.

DATA LIST /phrase (A200).
BEGIN DATA
adorer
adoring
END DATA.
DATASET NAME rawdata.
COMPUTE LineNumber=$CASENUM.
COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," ").
SET MXLOOP=100000.
STRING Word (A30).
LOOP.
+  COMPUTE #=CHAR.INDEX(phrase," ").
+  DO IF # GT 0.
+   COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1).
+     COMPUTE phrase=CHAR.SUBSTR(phrase,#+1).
+  ELSE.
+     IF (phrase NE "") Word=phrase.
+  END IF.
+  XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word.
END LOOP IF #=0.
EXECUTE.
GET FILE "C:\TEMP\parsedwords.sav".
SORT CASES BY Word.
MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word.
AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary).



bdates wrote
> David,
>
>
> When I run your syntax with the words in the file rawdata that Jon
> supplied, adorer and adoring, I get the following message:
>
>
> Warning # 10954
>
> The AGGREGATE command has produced an output file which has no cases -
>
> probably as the result of a SELECT IF or WEIGHT command.
>
>
> The parsedwords file has no words identified.
>
>
>
> Brian
> ________________________________
> From: SPSSX(r) Discussion &lt;

> SPSSX-L@.UGA

> &gt; on behalf of Jon Peck &lt;

> jkpeck@

> &gt;
> Sent: Monday, May 7, 2018 12:54:03 PM
> To:

> SPSSX-L@.UGA

> Subject: Re: basic 'string' question
>
> The extendedTransforms.py module has two similar functions to the solution
> I posted.
> vlookup looks up values in a Python dictionary constructed from an SPSS
> dataset.  It differs from the posted solution in taking a key and
> returning an associated value.
>
> vlookupinterval is similar but instead of an exact key match, it finds a
> value in a set of intervals and returns the associated value.
>
> These functions as well as many others in this module can be used with
> SPSSINC TRANS.  Here is a list of the contents.
>
> subs:                         replace occurrences of a regular expression
> pattern with specified values
> templatesub:                  substitue values in a template expression
> levenshteindistance:          calculate similarity between two strings
> soundex:                      calculate the soundex value of a string (a
> rough phonetic encoding)
> nysiis:                       enhanced sound encoding (claimed superior to
> soundex for surnames)
> soundexallwords:              calculate the soundex value for each word in
> a string and return a blank-separated string
> median:                       median of a list of values
> mode:                         mode of a list of values
> multimode:                    up to n modes of a list of values
> matchcount:                   compare value with list of values and count
> matches using
>                                   standard or custom comparison function
> strtodatetime:                convert a date/time string to an SPSS
> datetime value using a pattern
> datetimetostr:                convert an SPSS date/time value to a string
> using a pattern
> lookup:                       return a value from a table lookup
> vlookup:                      return a value from a table lookup (more
> convenient than lookup w SPSSINC TRANS)
> vlookupinterval:              return a value from a table lookup using
> intervals
> sphDist:                      calculate distance between two points on
> earth using spherical approximation
> ellipseDist:                  calculate distance between two points on
> earth using ellipsoidal approximation
> jaroWinkler                   calculate Jaro-Winkler string similarity
> measure
> extractDummies                extract a set of binary variables from a
> value coded in powers of 2
> packDummies                   pack a sequence of numeric and/or string
> values into a single float
> translatechar                 map characters according to a conversion
> table
> countWkdays                   count number of days between two dates that
> are not excluded
> vlookupgroupinterval          return a value associated with a group and a
> set of intervals for that group
> countDaysWExclusions          count days in interval exclusing specificied
> weekdays and other dates
> DiceStringSimilarity          compare strings using Dice bigram metric.
> Dictdict                      find best match of strings using Dice metric
> setRandomSeed                 initialize random number generator
> invGaussian                   inverse Gaussian distribution random numbers
> triangular                    triangular random numbers
>
> On Mon, May 7, 2018 at 6:33 AM, William Dudley &lt;

> wndudley@

> &lt;mailto:

> wndudley@

> &gt;> wrote:
> Jon,
>
> This is terrific.
> I have a project for which this method will be very useful.
>
> Bill
>
>
> On Sun, May 6, 2018 at 3:42 PM, Jon Peck &lt;

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;> wrote:
> Here is a solution using the SPSSINC TRANS extension command.  It is
> normally installed with Statistics, but if you don't already have it you
> can install it from the Extensions menu or in older versions Utilities.
>
> First you define a dataset of words - I called it lookup - and make sure
> that your main dataset is active.
> data list fixed/words(a30).
> begin data
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
> end data
> dataset name lookup.
>
> data list fixed/text(a50).
> begin data
> adorer
> adoring
> end data
> dataset name main.
> dataset activate main.
>
> Next you define a Python class for use with SPSSINC TRANS.  It reads the
> lookup dataset and creates a set containing the words ignoring case.  It
> also creates a function, func, that will be called for each case in the
> main dataset.  func splits the indicated variable's value at each blank
> and checks whether it appears in the set (ignoring case).  In this
> example, the strings to check are in a variable named text.
>
> begin program.
> class vlookup(object):
>     """Check values according to a dictionary specified as an SPSS
> dataset"""
>     def __init__(self, dataset):
>         """dataset is a dataset of words
>
>         Lookups are made after trimming any trailing blanks and ignoring
> case
>         The class creates a function named func that can be referenced for
> lookups"""
>
>         spss.StartDataStep()
>         try:
>             ds = spss.Dataset(dataset)
>             cases = ds.cases
>             self.table = set()
>             for i in range(len(cases)):
>                 self.table.add(cases[i, 0][0].rstrip().lower())
>
>             def func(x):
>                     x = x.rstrip().split()
>                     for word in x:
>                         if word.lower() in self.table:
>                             return True
>                     return False
>             self.func = func
>         finally:
>             spss.EndDataStep()
> end program.
>
> This is the call to invoke all this.  It first creates the word set from
> the named dataset and then processes a variable named text for each case.
> The result is a 1 or 0 (true or false) for each case according to whether
> any word in text is found in the lookup set.
>
> spssinc trans result=hasword
> /initial "vlookup('lookup')"
> /formula "func(text)".
>
> Regards,
> Jon
>
>
> On Sun, May 6, 2018 at 2:29 AM, Talma &lt;

> Talma.Claviger@

> &lt;mailto:

> Talma.Claviger@

> &gt;> wrote:
> Dear Jon,
>
> many thanks for your example, which was already very useful – and you are
> right, the real problem refers to many more terms…
>
> Specifically, I’d like to analyse comments from a social media site using
> freely available dictionaries that count certain terms contained in the
> comments. These terms are identified with certain emotions.
> For example, a post containing the adjective “angry” could be classified
> as
> belonging to the category “anger” and so on (for this illustration, ignore
> the multiple problems associated with this approach, such as negations
> etc.).
>
> However, such dictionaries (often available in *.txt or*.csv format, which
> can be changed) easily contain several thousand terms…and requesting each
> term separately would indeed become unwieldly  : ). For illustration,
> here’s
> a sample example (first 40 words) of a similar dictionary (not just
> adjectives) taken from
>
> ;   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
> ;       Proceedings of the ACM SIGKDD International Conference on
> Knowledge
> ;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
> ;       Washington, USA,
>
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
>
> ***
>
> In case there is any more general options to use SPSS syntax for finding
> out
> whether a string variable contains one of the terms above or not, it would
> extremely helpful if you could your thoughts here in this already
> superhelpful forum...
>
> Many thanks & regards!!
> Talma
>
>
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;
>
> ===================== To manage your subscription to SPSSX-L, send a
> message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to manage
> subscriptions, send the command INFO REFCARD
>
>
>
> --
> William N. Dudley, PhD
> Professor - Public Health Education
> The School of Health and Human Sciences
> The University of North Carolina at Greensboro
> 437-L Coleman Building
> Greensboro, NC 27402-6170
> See my research on
> GoogleScholar&lt;https://scholar.google.com/citations?user=ZiYmyb4AAAAJ&amp;hl=en&gt;
> ResearchGate&lt;https://www.researchgate.net/profile/William_Dudley&gt;
> VOICE 336.256 2475
>
> [email signature image example.png]
>
>
>
>
> --
> Jon K Peck

> jkpeck@

> &lt;mailto:

> jkpeck@

> &gt;
>
> ===================== To manage your subscription to SPSSX-L, send a
> message to

> LISTSERV@.UGA

> &lt;mailto:

> LISTSERV@.UGA

> &gt; (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to manage
> subscriptions, send the command INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

nina
In reply to this post by Jon Peck
Hi
can this syntax be modified in such a way that only complete words are
analyzed? For example,
compute var_z=char.index(lower(var_x), "vanilla") > 0
identifies not only "vanilla", but also "sweetvanilla" or "vanillacream".

Suppose one is only interested in finding the term 'vanilla'. Is there any
subcommand for char.index to achieve this?

Many thanks for your response,
nina



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Jon Peck
The SPSSINC TRANS solution I posted already eliminates the false partial word matches.

On Sun, May 13, 2018 at 5:06 AM, nina <[hidden email]> wrote:
Hi
can this syntax be modified in such a way that only complete words are
analyzed? For example,
compute var_z=char.index(lower(var_x), "vanilla") > 0
identifies not only "vanilla", but also "sweetvanilla" or "vanillacream".

Suppose one is only interested in finding the term 'vanilla'. Is there any
subcommand for char.index to achieve this?

Many thanks for your response,
nina



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Maguin, Eugene
In reply to this post by nina
So, search for " vanilla ". However, that will fail if vanilla is the last word or, I suspect, the first word of the sentence or more generally if there any character other than a " " in either location. Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]> On Behalf Of nina
Sent: Sunday, May 13, 2018 7:06 AM
To: [hidden email]
Subject: Re: basic 'string' question

Hi
can this syntax be modified in such a way that only complete words are analyzed? For example, compute var_z=char.index(lower(var_x), "vanilla") > 0 identifies not only "vanilla", but also "sweetvanilla" or "vanillacream".

Suppose one is only interested in finding the term 'vanilla'. Is there any subcommand for char.index to achieve this?

Many thanks for your response,
nina



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: basic 'string' question

Jon Peck
Here is a more robust version of the spssinc trans usage.  It handles things like 's and a terminal period on a string, in case that matters.  As before, it is not fooled by matches within words.

I'm repeating the whole example, but the only substantive change is the way the input string is split.

data list fixed/words(a30).
begin data
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
end data
dataset name lookup.

data list fixed/text(a50).
begin data
adorer
ADORER
adoring
admire.
admirer's
vanilla
inadequate
end data
dataset name main.
dataset activate main.

begin program.
import spss, re

class vlookup(object):
    """Check values according to a dictionary specified as an SPSS dataset"""
    def __init__(self, dataset):
        """dataset is a dataset of words
        
        Lookups are made after breaking at word boundaries and ignoring letter case
        The class creates a function named func that can be referenced for lookups"""
        
        spss.StartDataStep()
        try:
            ds = spss.Dataset(dataset)
            cases = ds.cases
            self.table = set()
            for i in range(len(cases)):
                self.table.add(cases[i, 0][0].rstrip().lower())

            def func(x):
                    x = re.findall(r"\w+", x.lower())
                    for word in x:
                        if word in self.table:
                            return True
                    return False
            self.func = func
        finally:
            spss.EndDataStep()
end program.

spssinc trans result=hasword
/initial "vlookup('lookup')"
/formula "func(text)".


On Sun, May 13, 2018 at 1:44 PM, Maguin, Eugene <[hidden email]> wrote:
So, search for " vanilla ". However, that will fail if vanilla is the last word or, I suspect, the first word of the sentence or more generally if there any character other than a " " in either location. Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]> On Behalf Of nina
Sent: Sunday, May 13, 2018 7:06 AM
To: [hidden email]
Subject: Re: basic 'string' question

Hi
can this syntax be modified in such a way that only complete words are analyzed? For example, compute var_z=char.index(lower(var_x), "vanilla") > 0 identifies not only "vanilla", but also "sweetvanilla" or "vanillacream".

Suppose one is only interested in finding the term 'vanilla'. Is there any subcommand for char.index to achieve this?

Many thanks for your response,
nina



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD