# Jaccard's Coefficient- Data Preparation

12 messages
Open this post in threaded view
|

## Jaccard's Coefficient- Data Preparation

 Hi there, I have binary data of certain behaviours that have occurred in several series of criminal offences. I'm looking to use Jaccard's Coefficient to get a similarity measure on each of the series in my sample. However, i'm not sure even how to prepare my data for this. I have a number of variables in each series of cases- so do I need to run the analysis variable by variable? For example, in a series of four offences the offender may have stolen in offence one, murdered in offence two and stolen again in offences three and four. How would I present this data for SPSS? Thanks in advance
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 The only Jaccard coefficient I am familiar with, https://en.wikipedia.org/wiki/Jaccard_index, takes two sets. (And is simply the intersection of the sets divided by the union of the sets.) So I wouldn't know how to get the Jaccard coefficient for your simplified example - you need a second set. It may also be easier to start with how you have the data now and your desired end result. In general I imagine I would use Python and its set functionality to do this, but I would need more info on your data to sketch out a more explicit solution. Andy W apwheele@gmail.com http://andrewpwheeler.wordpress.com/
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 "Two sets" is how I see it.  For the example cited, I see two distinct types of offense. If that is the starting point, then you would want to have a list of variables representing the possible offenses, and score them as  Yes=1, No=0.  For two lists, you would count the number of variables that 1, either uniquely or for two subjects.  Does this define what you need? To compare Subjects, you probably have to Flip the file:  Then a matrix of correlations would give one index of similarity for each pair of subjects as their correlation... which, along with the counts could be manipulated to get Jaccard's index.  Instead of the correlation, the Flip-ed file could probably be picked up my Matrix where you could do some simple counting. Is that what you want?  -- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of Andy W <[hidden email]> Sent: Wednesday, January 11, 2017 8:17 AM To: [hidden email] Subject: Re: Jaccard's Coefficient- Data Preparation   The only Jaccard coefficient I am familiar with, https://en.wikipedia.org/wiki/Jaccard_index, takes two sets. (And is simply the intersection of the sets divided by the union of the sets.)   ... ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 Thanks so much to you both for getting back to me so quickly. To be a bit clearer, I have about twenty offence variables (I wish there were only two!). They are currently coded as binary as below: Series      Variable 1    Variable 2    Variable 3    Variable 4 1                   1              0                1                 1 1                   1              1                0                 0 1                   0              0                0                 0   1                   1              0                0                 0 2                   0              1                1                 0 2                   1              0                1                 0 2                   1              1                0                 0 And etc (I have 70 series containing about 280 offences, plus a matching control group). If I understand Jaccard's Coefficient correctly, I have to analyse variable by variable, so if I were looking at V1, Series 1 would be '1,1,0,1' and Series 2 '0,1,1'. I have had a look at how to do the analysis by hand and I think I understand it (for Variable one the coefficients would be: Series 1=0.33, Series 2=0.50 ? Please tell me if i'm wrong, i'm not a natural mathematician!). However, with such a large number of samples I can't really do it all by hand, variable by variable and series by series. I'm not sure the best way to arrange the data for SPSS, though I have tried several ways and been unable to make sense of the agglomeration schedule. If anyone can help it would be much appreciated :)
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 please give a little more detail. What questions are you using the data to explore?  Consistency of coding across coders? Arrests about the same offenses?   What is a series? An individual arrest with offenses coded by several people? What are the variables?  Are they 20 offenses that are charged or not charged? Why does series 1 have 4 lines and series 2 have 3 lines? Jacccard'ts coefficient (most commonly) can be used find groups/piles of series (arrests?) that contain the same pattern of offenses.   Please clarify what you mean by "control group"? Sometimes people just mean that it is a just a group for comparison, and technically it means that cases (entities) were randomly assigned to conditions, treatments etc. Art Kendall Social Research Consultants
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 In case the OP/audience might take interest: Square matrices of binary data association measures - which command PROXIMITIES offers and other - are also easily computed with the help of a simple function !bincnt of mine (found in "Matrix-End matrix" collection on http://www.spsstools.net/en/KO-spssmacros). ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 In reply to this post by Rich Ulrich Hi, My research aim is to investigate the intra-series consistency of these offenders. I have groups of series that are made up of each crime that same offender has committed, with a range of variables for each offence. The series differ in length because some offenders have committed more offences in their series than others. Basically I only want the consistency measure for each series. I am hoping to come up with a coefficient for the consistency of each variable across all of the series, so that I can say that (for example) this sample is consistent in their choice of approach type across a series of offences. I was under the impression that the formula for Jaccard is Sj= a/a+b+c, a being the number of 'joint occurrences' per series and b and c the number of 'single non-joint occurrences' per series. You're right in saying that the example is only a snippet of the data, I have many more variables that I need to test and they are all in a yes/no format. Thanks
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 If it helps, I took the analysis from Harbers, Deslauriers-Varin, Beauregard & van der Kemp (2012) extract below: Statistical analyses Previous studies investigating consistency for crime linkage purposes have often used the Jaccard’s coefficient (Bennell and Canter, 2002; Bennell and Jones, 2005; Bennell et al., 2009; Tonkin et al., 2008, Woodhams and Toye, 2007; Woodhams et al., 2008). This similarity coefficient (Jaccard, 1908) is suitable because it does not include joint nonoccurrences (0/0) of a specific behaviour in its measurement. This means that a specific behaviour will not automatically be consistent because it does not occur in most of the offences. Jaccard’s coefficient is calculated by dividing the number of behaviours shared by two offences (1/1) by the sum of the numbers of behaviours shared (1/1), and the number of behaviours present in one crime but not in the other (1/0 and 0/1). A value of 1 would mean that there is a total similarity on this particular behaviour across the series, and a value of 0 would indicate no similarity at all across the series. An important advantage of using the Jaccard’s coefficient to measure consistency is that low frequencies of certain behaviours do not lead to high consistency scores. This is important in this type of research as the absence of a specific behaviour based on police records does not necessarily mean that the behaviour did not occur. However, one of the disadvantages is that the Jaccard’s coefficient is very sensitive to missing data (Bennell and Jones, 2005; Everitt et al., 2001; Woodhams and Toye, 2007). Consistency of a variable was first measured for each offender within the offender’s series. The consistency score was measured by comparing the variable in each offence with the variable in the previous offence for the full length of the series. If both offences showed the behaviour (1/1), a value of 1 was awarded to the comparison. If the behaviour was present in one of the offences and absent in the other offence (0/1 or 1/0), a value of 0 was awarded to the comparison. If both offences did not show the behaviour (0/0), the comparison was left out of the measurement. To make sure that the results are not biased by the various lengths of the series, the total score was divided by the number of comparisons, the length of the series minus two. For example, in the case where the offender has been linked to five sexual assaults, the use of a disguise during the assaults was scored as follows: absent, absent, present, present, and absent (00110). The first comparison (0/0) is left out of the measurement. The second comparison is awarded with a value of 0, the third comparison with a value of 1, and the last comparison with a value of 0 again. The score (1) divided by the number of comparisons (3) is 0.33. The consistency score for using a disguise by this offender in his or her series is 0.33. Thus, the information of all crimes within the series is used, whereas the results are not biased by undue weighting because of the length of the series. Harbers, E., Deslauriers‐Varin, N., Beauregard, E., & Kemp, J. J. (2012). Testing the behavioural and environmental consistency of serial sex offenders: A signature approach. Journal of Investigative Psychology and Offender Profiling, 9(3), 259-273.
Open this post in threaded view
|

## Re: Jaccard's Coefficient- Data Preparation

 Am i correct in understanding: that "series" means an individual.   There are several events (arrest occasions? ) for each individual.   There are 20 behaviors that are measured about each event. my eSPSS reads your post as saying that you want to know whether within an individual to what degree do events have similar profiles across the 20 behaviors. Paste and run the following into a syntax window. Is this what you are looking for? data list list/ Id (n2) event(a2) Name (a20) ArrestDate(adate10) behavior1 behavior2 behavior3 behavior4 (4f1). begin data 1 a 'John Doe' 10/10/1999 1 0 1 1 1 b 'John Doe' 05/22/2001 0 1 0 0 1 c'John Doe' 12/25/2004 1 1 1 1 1 d 'John Doe' 02/14/2005 0 0 0 0 2 a 'Mary Poe' 11/11/2010 0 1 1 0 2 b 'Mary Poe' 04/04/2011 1 1 0 0 2 c 'Mary Poe' 05/04/2011  0 0 1 1 end data. list. * Proximities wants a string variable as ID here "event". split file by Id. proximities behavior1 to behavior4 /measure=JACCARD /id=event. There are many coefficients for binary input data in PROXIMITIES. Check HELP to see whether additional coefficients would be of help. I cannot recall at this time whether the labels on the proximity matrix can be the arrest#.   Please reply to the list whether this is what you are looking for.  Then if this is what you are looking for perhaps there is a way to remove the case number in the complete file from the labels of the printed proximity matrix. Art Kendall Social Research Consultants