# IF statement problem or maybe a compute problem

13 messages
Open this post in threaded view
|

## IF statement problem or maybe a compute problem

 I want to carry out a few data validation checks on my data.  I have 7 validation checks so far.  It’s check number E2 that I find weird. Any advice would be welcome. string E1 E2 E3 E4 E5 E6 E7 (A1). compute TFN = number(e416, F12.0). if missing(n384) or (n384 eq 0) E1 = '1'. if ((n384 - n381) NE n558) E2 = '2'. if (n558 gt n384) E3 ='3'. if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'. if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'. if any(e490, '401', '410') and missing(TFN) E6 ='6'. if any(e490, '501', '510') and (n558 GT 0) E7 ='7'. select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5 EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7'). list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4 E5 E6 E7. e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7   230   2252.01  2000.00    63.00   252.01    2 410     38.00      .00      .00    38.00                6 410     76.00      .00      .00    76.00                6 410    190.00      .00      .00   190.00                6 410     76.00      .00      .00    76.00                6 410     95.00    38.00      .00    57.00                6 The alignment is not too good here, but what the output is saying in regard to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a text editor it may look better.) Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57 therefore no error and no error is flagged. But look at case 1: 2252.01 - 2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values), but E2 shows it as an error by displaying '2' against the record. The following were used because the data had been held as text, even though the values were numbers: compute n384 = number(e384, F12.2). compute n381 = number(e381, F12.2). compute n529 = number(e529, F12.2). compute n558 = number(e558, F12.2). I don't know if that's relevant to the issue.
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 Calculate the actual difference.  Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point.  You would need a very small fuzz factor in the test.On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote:I want to carry out a few data validation checks on my data.  I have 7 validation checks so far.  It’s check number E2 that I find weird. Any advice would be welcome. string E1 E2 E3 E4 E5 E6 E7 (A1). compute TFN = number(e416, F12.0). if missing(n384) or (n384 eq 0) E1 = '1'. if ((n384 - n381) NE n558) E2 = '2'. if (n558 gt n384) E3 ='3'. if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'. if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'. if any(e490, '401', '410') and missing(TFN) E6 ='6'. if any(e490, '501', '510') and (n558 GT 0) E7 ='7'. select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5 EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7'). list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4 E5 E6 E7. e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7 230   2252.01  2000.00    63.00   252.01    2 410     38.00      .00      .00    38.00                6 410     76.00      .00      .00    76.00                6 410    190.00      .00      .00   190.00                6 410     76.00      .00      .00    76.00                6 410     95.00    38.00      .00    57.00                6 The alignment is not too good here, but what the output is saying in regard to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a text editor it may look better.) Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57 therefore no error and no error is flagged. But look at case 1: 2252.01 - 2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values), but E2 shows it as an error by displaying '2' against the record. The following were used because the data had been held as text, even though the values were numbers: compute n384 = number(e384, F12.2). compute n381 = number(e381, F12.2). compute n529 = number(e529, F12.2). compute n558 = number(e558, F12.2). I don't know if that's relevant to the issue. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD -- Jon K Peck[hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 ﻿ Jon, I've tried to find how the floating point numbers are represented internally in SPSS but couldn't anything.  Wouldn't it be a good idea to declare a print format for n384, n381, and n558 of, say, f20.15 and print out the case with e490=230.  If what you say below is true, we'd expect some number(s) beyond the 2nd decimal value and print out more decimal values should reveal it.  Though this seems reasonable to me I think that the error may be elsewhere but it is hard to determine where because (a) there is distracting junk in the code (for example in list: year sem ftn; there are no values for these), and/or (b) there may be syntax elsewhere that is the source of the problem.   By the way, why are there seven E vars instead of one E with seven values?  The variables E1-E7 are going to be sparse unless a case can take have more than one E variable.  It certainly confuses the output below (even after putting it in Notepad; there is only one column).   -Mike Palij New York University [hidden email]   ----- Original Message ----- From: [hidden email] Sent: Wednesday, October 26, 2016 8:50 PM Subject: Re: IF statement problem or maybe a compute problem Calculate the actual difference.  Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point.  You would need a very small fuzz factor in the test.On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote: I want to carry out a few data validation checks on my data.  I have 7validation checks so far.  It’s check number E2 that I find weird. Anyadvice would be welcome.string E1 E2 E3 E4 E5 E6 E7 (A1).compute TFN = number(e416, F12.0).if missing(n384) or (n384 eq 0) E1 = '1'.if ((n384 - n381) NE n558) E2 = '2'.if (n558 gt n384) E3 ='3'.if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.if any(e490, '401', '410') and missing(TFN) E6 ='6'.if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7').list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4E5 E6 E7.e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7230   2252.01  2000.00    63.00   252.01    2410     38.00      .00      .00    38.00                6410     76.00      .00      .00    76.00                6410    190.00      .00      .00   190.00                6410     76.00      .00      .00    76.00                6410     95.00    38.00      .00    57.00                6The alignment is not too good here, but what the output is saying in regardto E2 is that case 1 is in error and case 6 is fine. (If you copy it to atext editor it may look better.)Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57therefore no error and no error is flagged. But look at case 1: 2252.01 -2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values),but E2 shows it as an error by displaying '2' against the record.The following were used because the data had been held as text, even thoughthe values were numbers:compute n384 = number(e384, F12.2).compute n381 = number(e381, F12.2).compute n529 = number(e529, F12.2).compute n558 = number(e558, F12.2).I don't know if that's relevant to the issue.--View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com.=====================To manage your subscription to SPSSX-L, send a message toLISTSERV@... (not to SPSSX-L), with no body text except thecommand. To leave the list, send the commandSIGNOFF SPSSX-LFor a list of commands to manage subscriptions, send the commandINFO REFCARD-- Jon K Peck[hidden email]===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 Administrator In the long run you will be better off with something like : IF (ABS(n384 - n381 - n558) LE .00000000001)  E2='2'. Ron0z wrote Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 Administrator It might just as easily have been represented as 225200.999999999999999 then you would be screwed. "2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003" David Marso wrote In the long run you will be better off with something like : IF (ABS(n384 - n381 - n558) LE .00000000001)  E2='2'. Ron0z wrote Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 In reply to this post by Ron0z A few points on this.- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).data list list/n384 n381 n558(3f8.2).begin data2252.01  2000.00 252.01end datacompute diff = n384 - n381.compute test = diff - n558.format n384 n381 n558 diff test(e23.16).list.                   n384                    n381                    n558                    diff                    test   2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013This link goes into all the gory details of floating point and the IEEE 754 standard.Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.On Wed, Oct 26, 2016 at 9:28 PM, Ron0z wrote:Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD -- Jon K Peck[hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 Jon, I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a /= b). Gene Maguin   From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck Sent: Thursday, October 27, 2016 9:49 AM To: [hidden email] Subject: Re: IF statement problem or maybe a compute problem   A few points on this.   - The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.   - SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.   Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).   data list list/n384 n381 n558(3f8.2). begin data 2252.01  2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list.                      n384                    n381                    n558                    diff                    test     2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013   This link goes into all the gory details of floating point and the IEEE 754 standard.   Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.     On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote: Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD   -- Jon K Peck [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 In reply to this post by Jon Peck ﻿ Boy, does this bring back memories.  When I was an undergrad in the early 1970s I was a psych major and had minor in bio and computer science -- running punched cards on an IBM mainframe (which took hours to get through the system, meaning it took hours for me to find the errors I committed and rerun the program which was madness). Later in the 1970s I was in grad school in exp/cog psych and was running SPSS and BMDP on a Sperry UNIVAC 1100 which used 7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the IBM.  Wilkinson & Dallal published their American Statistician paper in 1977 on "Accuracy of Sample Moments Calculations etc". which pointed out that most statistical software packages gave incorrect results because (1) of the algorithm used, and (b) the use of single precision 32-bit representation.  This ties into the presentation that Jon Peck makes for double precision down below.   Quoting from W&D.   |Ling (1974) listed five algorithms for calculating the sum of |squares about the sample mean. The first of these is the |desk calculator:   NOTE: Greek symbols don't paste properly into the mail client i use but the desk calculator involves the sum of x^2.   |Of the five algorithms,  it is the least reliable for use on a |computer with a small word size due to its susceptibility |to truncation error. Anscombe (1967) has shown that for a |computer like the IBM 370 with a four-byte, 32-bit w word, |24 of which are used to store a mantissa (approximately |7.2 decimal digits), S sub 11 [standard deviation] cannot be |computed with any accuracy for the numbers 9001, 9002, 9003. |Following the example through, |  X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits) |      9001____ 81018001____ 81018000 |      9002____ 81036004____ 81036000 |      9003____ 81054009____ 81054008. | |Therefore, to the first 24 binary digits, |sum of X^2 sub1 = 243108000. |Again, to the first 24 binary digits, |  3X-bar^2 = 243108000, so |  sum of X^2 sub1  - 3X-bar^2 = 0.   [NOTE: see original article for actual formulas)   |The same result can be obtained in decimal arithmetic by ignoring |any digit beyond the seventh. The three consecutive integers need |not be as large as 9000 for complete loss of accuracy; numbers |smaller than 3000 will do.   Because different software packages used different algorithms as well as precision (double precision helps minimize the problem) whether the results one got were correct or wrong depended upon the algorithm the package used and the precision used (and whether it could be changed from single to double).   BMDP used single precision (32-bit) representation but used the mean deviation formula to calculate the sum of squares, so it avoided overflow/underflow errors that would produce truncated numbers and, hence, incorrect results. BMDP was the only package to provide correct results with single precision. However, BMDP required reading the data twice: first to calculate the appropriated statistics like the mean, and second to subtract the mean from each value which meant the processing took longer.   SPSS 1977 (IBM mainframe version) could calculate the mean and standard deviation correctly but not correlations. As W&D point out this, was due to SPSS using single precision for some calculations and double precisions for others. At this time SPSS didn't provide an option to do double precision calculations throughout.   Other packages (DATATEXT, OSIRIS) which provided an option to use double precision were able to get all of the statistics correct when this option was used.  Subsequent to this article, SPSS would used double precision in its calculations on IBM machines but I'm not sure what year or version of SPSS this was. Not sure if they changed the algorithm.   W&D note the Control Data Corp (CDC) computers used 60 bit words and the desk calculator algorithm would provide correct results here (I assume that all packages running on CDC machines made use of the 60 bit representations). NYU used to have a CDC machine in addition to its IBM mainframes in the early 1980s and it is telling that Elazar Pedhazur used both IBM and CDC machines for analyses -- Pleasure did his Ph.D. under Fred Kerlinger and would become part of the Ed psych faculty in NYU's school of Ed. In Pedhazur's early books on multiple regression he would show results from both CDC and IBM versions of software.   Sidenote: in the 1940 the clinical psychology program was located in the NYU School of Ed, not in the NYU School of Arts & Science, and that is where Jack Cohen got his Ph.D. in clinical psychology as well as his quantitative training.  After getting his Ph.D., Jack would do clinical work and research in one of the Veterans hospitals until the late '50s/'early 60s when he became part of the  NYU Arts & Science psychology dept as well as doing a consulting gig at the New York State Psychiatric Institute (NYSPI or just PI) which was run by Columbia's med school's dept of psychiatry. I believe he was affiliated  with the Biometrics Dept there, which is where his second wife Patricia would also work and does so still today.    Anyway, Jon's presentation below helps to make sense of W&D presentation but in the context of the 32-bit word. I believe that the IEEE 754 standard would help standardize operations across statistical packages.      -Mike Palij New York University [hidden email]   ----- Original Message ----- From: [hidden email] Sent: Thursday, October 27, 2016 9:48 AM Subject: Re: IF statement problem or maybe a compute problem A few points on this. - The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits. - SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2. Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code). data list list/n384 n381 n558(3f8.2). begin data 2252.01  2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list.                    n384                    n381                    n558                    diff                    test     2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013 This link goes into all the gory details of floating point and the IEEE 754 standard. Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system. On Wed, Oct 26, 2016 at 9:28 PM, Ron0z wrote: Got it licked!Trying to trick the system is tricky.  My calculated value of zero actuallyequals 0.0000000000291 which is not the same as 0.Multiplying the values by 100 to remove the decimal component didn’t reallyhelp, as you might surmise even though 2252.01 x 100 = 225201 in spss it isrepresented as 225201.00000000003.  However, when I added TRUNC everythingworked out just fine.The following works:if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 In reply to this post by Maguin, Eugene Fuzzbits is pretty esoteric, and generally one should leave it alone.  I have copied the help below, but what might not be clear from that is that it does not affect ordinary equality calculations.  They are based on the exact bits.  It also does not affect the front end in places, say, in pivot tables, where a format is applied.  The setting is designed to deal with noise in the least significant bits in a calculation as in the present case.  As you can see from the help, it has had a chequered history.  The setting was exposed in V17 because of a few customers who had code that was very sensitive to the rounding/truncation threshold, and they wanted to ensure compatibility.Rounding and Truncation of Numeric Values. For the RND and TRUNC functions, this setting controls the default threshold for rounding up values that are very close to a rounding boundary. The setting is specified as a number of bits and is set to 6 at install time, which should be sufficient for most applications. Setting the number of bits to 0 produces the same results as in release 10. Setting the number of bits to 10 produces the same results as in releases 11 and 12.For the RND function, this setting specifies the number of least-significant bits by which the value to be rounded may fall short of the threshold for rounding up but still be rounded up. For example, when rounding a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 1.5 (the threshold for rounding up to 2.0) and still be rounded up to 2.0.For the TRUNC function, this setting specifies the number of least-significant bits by which the value to be truncated may fall short of the nearest rounding boundary and be rounded up before truncating. For example, when truncating a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 2.0 and be rounded up to 2.0.On Thu, Oct 27, 2016 at 8:00 AM, Maguin, Eugene wrote: Jon, I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a /= b). Gene Maguin   From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck Sent: Thursday, October 27, 2016 9:49 AM To: [hidden email] Subject: Re: IF statement problem or maybe a compute problem   A few points on this.   - The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.   - SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.   Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).   data list list/n384 n381 n558(3f8.2). begin data 2252.01  2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list.                      n384                    n381                    n558                    diff                    test     2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013   This link goes into all the gory details of floating point and the IEEE 754 standard.   Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.     On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote: Got it licked! Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine. The following works: if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD   -- Jon K Peck [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD -- Jon K Peck[hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Open this post in threaded view
|

## Re: IF statement problem or maybe a compute problem

 In reply to this post by Mike Interesting history, Mike.  I don't know when SPSS became fully double precision, but it probably would have been no later than the introduction of SPSSX, which was around 1981-2.  That's before I joined SPSS.At least as big an issue as single/double precision is how the arithmetic calculations are done.  Even something as apparently simple as a sum can run amok if you just use the obvious method if the numbers vary widely in magnitude.  Here is a link to the Kahan algorithm used by Statistics.On Thu, Oct 27, 2016 at 11:05 AM, Mike Palij wrote: Boy, does this bring back memories.  When I was an undergrad in the early 1970s I was a psych major and had minor in bio and computer science -- running punched cards on an IBM mainframe (which took hours to get through the system, meaning it took hours for me to find the errors I committed and rerun the program which was madness). Later in the 1970s I was in grad school in exp/cog psych and was running SPSS and BMDP on a Sperry UNIVAC 1100 which used 7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the IBM.  Wilkinson & Dallal published their American Statistician paper in 1977 on "Accuracy of Sample Moments Calculations etc". which pointed out that most statistical software packages gave incorrect results because (1) of the algorithm used, and (b) the use of single precision 32-bit representation.  This ties into the presentation that Jon Peck makes for double precision down below.   Quoting from W&D.   |Ling (1974) listed five algorithms for calculating the sum of |squares about the sample mean. The first of these is the |desk calculator:   NOTE: Greek symbols don't paste properly into the mail client i use but the desk calculator involves the sum of x^2.   |Of the five algorithms,  it is the least reliable for use on a |computer with a small word size due to its susceptibility |to truncation error. Anscombe (1967) has shown that for a |computer like the IBM 370 with a four-byte, 32-bit w word, |24 of which are used to store a mantissa (approximately |7.2 decimal digits), S sub 11 [standard deviation] cannot be |computed with any accuracy for the numbers 9001, 9002, 9003. |Following the example through, |  X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits) |      9001____ 81018001____ 81018000 |      9002____ 81036004____ 81036000 |      9003____ 81054009____ 81054008. | |Therefore, to the first 24 binary digits, |sum of X^2 sub1 = 243108000. |Again, to the first 24 binary digits, |  3X-bar^2 = 243108000, so |  sum of X^2 sub1  - 3X-bar^2 = 0.   [NOTE: see original article for actual formulas)   |The same result can be obtained in decimal arithmetic by ignoring |any digit beyond the seventh. The three consecutive integers need |not be as large as 9000 for complete loss of accuracy; numbers |smaller than 3000 will do.   Because different software packages used different algorithms as well as precision (double precision helps minimize the problem) whether the results one got were correct or wrong depended upon the algorithm the package used and the precision used (and whether it could be changed from single to double).   BMDP used single precision (32-bit) representation but used the mean deviation formula to calculate the sum of squares, so it avoided overflow/underflow errors that would produce truncated numbers and, hence, incorrect results. BMDP was the only package to provide correct results with single precision. However, BMDP required reading the data twice: first to calculate the appropriated statistics like the mean, and second to subtract the mean from each value which meant the processing took longer.   SPSS 1977 (IBM mainframe version) could calculate the mean and standard deviation correctly but not correlations. As W&D point out this, was due to SPSS using single precision for some calculations and double precisions for others. At this time SPSS didn't provide an option to do double precision calculations throughout.   Other packages (DATATEXT, OSIRIS) which provided an option to use double precision were able to get all of the statistics correct when this option was used.  Subsequent to this article, SPSS would used double precision in its calculations on IBM machines but I'm not sure what year or version of SPSS this was. Not sure if they changed the algorithm.   W&D note the Control Data Corp (CDC) computers used 60 bit words and the desk calculator algorithm would provide correct results here (I assume that all packages running on CDC machines made use of the 60 bit representations). NYU used to have a CDC machine in addition to its IBM mainframes in the early 1980s and it is telling that Elazar Pedhazur used both IBM and CDC machines for analyses -- Pleasure did his Ph.D. under Fred Kerlinger and would become part of the Ed psych faculty in NYU's school of Ed. In Pedhazur's early books on multiple regression he would show results from both CDC and IBM versions of software.   Sidenote: in the 1940 the clinical psychology program was located in the NYU School of Ed, not in the NYU School of Arts & Science, and that is where Jack Cohen got his Ph.D. in clinical psychology as well as his quantitative training.  After getting his Ph.D., Jack would do clinical work and research in one of the Veterans hospitals until the late '50s/'early 60s when he became part of the  NYU Arts & Science psychology dept as well as doing a consulting gig at the New York State Psychiatric Institute (NYSPI or just PI) which was run by Columbia's med school's dept of psychiatry. I believe he was affiliated  with the Biometrics Dept there, which is where his second wife Patricia would also work and does so still today.    Anyway, Jon's presentation below helps to make sense of W&D presentation but in the context of the 32-bit word. I believe that the IEEE 754 standard would help standardize operations across statistical packages.      -Mike Palij New York University [hidden email]   ----- Original Message ----- From: [hidden email] Sent: Thursday, October 27, 2016 9:48 AM Subject: Re: IF statement problem or maybe a compute problem A few points on this. - The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits. - SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2. Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code). data list list/n384 n381 n558(3f8.2). begin data 2252.01  2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list.                    n384                    n381                    n558                    diff                    test     2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013 This link goes into all the gory details of floating point and the IEEE 754 standard. Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system. On Wed, Oct 26, 2016 at 9:28 PM, Ron0z wrote: Got it licked!Trying to trick the system is tricky.  My calculated value of zero actuallyequals 0.0000000000291 which is not the same as 0.Multiplying the values by 100 to remove the decimal component didn’t reallyhelp, as you might surmise even though 2252.01 x 100 = 225201 in spss it isrepresented as 225201.00000000003.  However, when I added TRUNC everythingworked out just fine.The following works:if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'. -- Jon K Peck[hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD