Quantcast

IF statement problem or maybe a compute problem

classic Classic list List threaded Threaded
13 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

IF statement problem or maybe a compute problem

Ron0z
I want to carry out a few data validation checks on my data.  I have 7 validation checks so far.  It’s check number E2 that I find weird. Any advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5 EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4 E5 E6 E7.

e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7
 
230   2252.01  2000.00    63.00   252.01    2
410     38.00      .00      .00    38.00                6
410     76.00      .00      .00    76.00                6
410    190.00      .00      .00   190.00                6
410     76.00      .00      .00    76.00                6
410     95.00    38.00      .00    57.00                6

The alignment is not too good here, but what the output is saying in regard to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57 therefore no error and no error is flagged. But look at case 1: 2252.01 - 2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values), but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Jon Peck
Calculate the actual difference.  Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point.  You would need a very small fuzz factor in the test.

On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote:
I want to carry out a few data validation checks on my data.  I have 7
validation checks so far.  It’s check number E2 that I find weird. Any
advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5
EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4
E5 E6 E7.

e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7

230   2252.01  2000.00    63.00   252.01    2
410     38.00      .00      .00    38.00                6
410     76.00      .00      .00    76.00                6
410    190.00      .00      .00   190.00                6
410     76.00      .00      .00    76.00                6
410     95.00    38.00      .00    57.00                6

The alignment is not too good here, but what the output is saying in regard
to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a
text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57
therefore no error and no error is flagged. But look at case 1: 2252.01 -
2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values),
but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though
the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;LISTSERV@LISTSERV.UGA.EDU&#39;)">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Mike

Jon, I've tried to find how the floating point numbers are represented
internally in SPSS but couldn't anything.  Wouldn't it be a good idea
to declare a print format for n384, n381, and n558 of, say, f20.15
and print out the case with e490=230.  If what you say below is true,
we'd expect some number(s) beyond the 2nd decimal value and
print out more decimal values should reveal it.  Though this seems
reasonable to me I think that the error may be elsewhere but it is
hard to determine where because (a) there is distracting junk in
the code (for example in list: year sem ftn; there are no values for
these), and/or (b) there may be syntax elsewhere that is the source
of the problem.
 
By the way, why are there seven E vars instead of one E with
seven values?  The variables E1-E7 are going to be sparse unless
a case can take have more than one E variable.  It certainly
confuses the output below (even after putting it in Notepad;
there is only one column).
 
-Mike Palij
New York University
 
----- Original Message -----
Sent: Wednesday, October 26, 2016 8:50 PM
Subject: Re: IF statement problem or maybe a compute problem

Calculate the actual difference.  Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point.  You would need a very small fuzz factor in the test.

On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote:
I want to carry out a few data validation checks on my data.  I have 7
validation checks so far.  It’s check number E2 that I find weird. Any
advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR  (E3 EQ '3') OR  (E4 EQ '4') OR  (E5
EQ '5') OR  (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4
E5 E6 E7.

e490     n384     n381     n529     n558 E1 E2 E3 E4 E5 E6 E7

230   2252.01  2000.00    63.00   252.01    2
410     38.00      .00      .00    38.00                6
410     76.00      .00      .00    76.00                6
410    190.00      .00      .00   190.00                6
410     76.00      .00      .00    76.00                6
410     95.00    38.00      .00    57.00                6

The alignment is not too good here, but what the output is saying in regard
to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a
text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57
therefore no error and no error is flagged. But look at case 1: 2252.01 -
2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values),
but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though
the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
<A onclick="_e(event, 'cvml', 'LISTSERV@LISTSERV.UGA.EDU')" href="javascript:;">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Ron0z
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.


Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

David Marso
Administrator
In the long run you will be better off with something like :

IF (ABS(n384 - n381 - n558) LE .00000000001)  E2='2'.

Ron0z wrote
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

David Marso
Administrator

It might just as easily have been represented as 225200.999999999999999 then you would be screwed.

"2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003"

David Marso wrote
In the long run you will be better off with something like :

IF (ABS(n384 - n381 - n558) LE .00000000001)  E2='2'.

Ron0z wrote
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003.  However, when I added TRUNC everything worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Jon Peck
In reply to this post by Ron0z
A few points on this.

- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is
2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).
begin data
2252.01  2000.00 252.01
end data
compute diff = n384 - n381.
compute test = diff - n558.
format n384 n381 n558 diff test(e23.16).
list.

                   n384                    n381                    n558                    diff                    test 
 
 2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.


On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003.  However, when I added TRUNC everything
worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.






--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Maguin, Eugene

Jon,

I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a </>/= b). Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Thursday, October 27, 2016 9:49 AM
To: [hidden email]
Subject: Re: IF statement problem or maybe a compute problem

 

A few points on this.

 

- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.

 

- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.

 

Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

 

data list list/n384 n381 n558(3f8.2).

begin data

2252.01  2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

 

                   n384                    n381                    n558                    diff                    test 

 

 2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013

 

This link goes into all the gory details of floating point and the IEEE 754 standard.

 

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

 

 

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003.  However, when I added TRUNC everything
worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.






--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Mike
In reply to this post by Jon Peck

Boy, does this bring back memories.  When I was an undergrad in the
early 1970s I was a psych major and had minor in bio and computer
science -- running punched cards on an IBM mainframe (which took
hours to get through the system, meaning it took hours for me to find
the errors I committed and rerun the program which was madness).
Later in the 1970s I was in grad school in exp/cog psych and was
running SPSS and BMDP on a Sperry UNIVAC 1100 which used
7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the
IBM.  Wilkinson & Dallal published their American Statistician
paper in 1977 on "Accuracy of Sample Moments Calculations etc".
which pointed out that most statistical software packages gave
incorrect results because (1) of the algorithm used, and (b) the
use of single precision 32-bit representation.  This ties into the
presentation that Jon Peck makes for double precision down
below.
 
Quoting from W&D.
 
|Ling (1974) listed five algorithms for calculating the sum of
|squares about the sample mean. The first of these is the
|desk calculator:
 
NOTE: Greek symbols don't paste properly into the mail
client i use but the desk calculator involves the sum of x^2.
 
|Of the five algorithms,  it is the least reliable for use on a
|computer with a small word size due to its susceptibility
|to truncation error. Anscombe (1967) has shown that for a
|computer like the IBM 370 with a four-byte, 32-bit w word,
|24 of which are used to store a mantissa (approximately
|7.2 decimal digits), S sub 11 [standard deviation] cannot be
|computed with any accuracy for the numbers 9001, 9002, 9003.
|Following the example through,
|  X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits)
|      9001____ 81018001____ 81018000
|      9002____ 81036004____ 81036000
|      9003____ 81054009____ 81054008.
|
|Therefore, to the first 24 binary digits,
|sum of X^2 sub1 = 243108000.
|Again, to the first 24 binary digits,
3X-bar^2 = 243108000, so
|  sum of X^2 sub1  - 3X-bar^2 = 0.
 
[NOTE: see original article for actual formulas)
 
|The same result can be obtained in decimal arithmetic by ignoring
|any digit beyond the seventh. The three consecutive integers need
|not be as large as 9000 for complete loss of accuracy; numbers
|smaller than 3000 will do.
 
Because different software packages used different algorithms
as well as precision (double precision helps minimize the problem)
whether the results one got were correct or wrong depended upon
the algorithm the package used and the precision used (and whether
it could be changed from single to double).
 
BMDP used single precision (32-bit) representation but used
the mean deviation formula to calculate the sum of squares,
so it avoided overflow/underflow errors that would produce
truncated numbers and, hence, incorrect results. BMDP
was the only package to provide correct results with single
precision. However, BMDP required reading the data twice:
first to calculate the appropriated statistics like the mean,
and second to subtract the mean from each value which
meant the processing took longer.
 
SPSS 1977 (IBM mainframe version) could calculate the
mean and standard deviation correctly but not correlations.
As W&D point out this, was due to SPSS using single
precision for some calculations and double precisions
for others. At this time SPSS didn't provide an option
to do double precision calculations throughout.
 
Other packages (DATATEXT, OSIRIS) which provided
an option to use double precision were able to get all of
the statistics correct when this option was used.  Subsequent
to this article, SPSS would used double precision in its
calculations on IBM machines but I'm not sure what year
or version of SPSS this was. Not sure if they changed
the algorithm.
 
W&D note the Control Data Corp (CDC) computers
used 60 bit words and the desk calculator algorithm would
provide correct results here (I assume that all packages running
on CDC machines made use of the 60 bit representations).
NYU used to have a CDC machine in addition to its IBM
mainframes in the early 1980s and it is telling that Elazar Pedhazur
used both IBM and CDC machines for analyses -- Pleasure
did his Ph.D. under Fred Kerlinger and would become
part of the Ed psych faculty in NYU's school of Ed.
In Pedhazur's early books on multiple regression he would
show results from both CDC and IBM versions of software.
 
Sidenote: in the 1940 the clinical psychology program
was located in the NYU School of Ed, not in the NYU
School of Arts & Science, and that is where Jack Cohen
got his Ph.D. in clinical psychology as well as his quantitative
training.  After getting his Ph.D., Jack would do clinical
work and research in one of the Veterans hospitals until
the late '50s/'early 60s when he became part of the 
NYU Arts & Science psychology dept as well as doing
a consulting gig at the New York State Psychiatric Institute
(NYSPI or just PI) which was run by Columbia's med
school's dept of psychiatry. I believe he was affiliated 
with the Biometrics Dept there, which is where his second
wife Patricia would also work and does so still today. 
 
Anyway, Jon's presentation below helps to make sense of
W&D presentation but in the context of the 32-bit word.
I believe that the IEEE 754 standard would help standardize
operations across statistical packages.   
 
-Mike Palij
New York University
 
----- Original Message -----
Sent: Thursday, October 27, 2016 9:48 AM
Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is
2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).
begin data
2252.01  2000.00 252.01
end data
compute diff = n384 - n381.
compute test = diff - n558.
format n384 n381 n558 diff test(e23.16).
list.

                   n384                    n381                    n558                    diff                    test 
 
 2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.


On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003.  However, when I added TRUNC everything
worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Jon Peck
In reply to this post by Maguin, Eugene
Fuzzbits is pretty esoteric, and generally one should leave it alone.  I have copied the help below, but what might not be clear from that is that it does not affect ordinary equality calculations.  They are based on the exact bits.  It also does not affect the front end in places, say, in pivot tables, where a format is applied.  

The setting is designed to deal with noise in the least significant bits in a calculation as in the present case.  As you can see from the help, it has had a chequered history.  The setting was exposed in V17 because of a few customers who had code that was very sensitive to the rounding/truncation threshold, and they wanted to ensure compatibility.

Rounding and Truncation of Numeric Values. For the RND and TRUNC functions, this setting controls the default threshold for rounding up values that are very close to a rounding boundary. The setting is specified as a number of bits and is set to 6 at install time, which should be sufficient for most applications. Setting the number of bits to 0 produces the same results as in release 10. Setting the number of bits to 10 produces the same results as in releases 11 and 12.

  • For the RND function, this setting specifies the number of least-significant bits by which the value to be rounded may fall short of the threshold for rounding up but still be rounded up. For example, when rounding a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 1.5 (the threshold for rounding up to 2.0) and still be rounded up to 2.0.
  • For the TRUNC function, this setting specifies the number of least-significant bits by which the value to be truncated may fall short of the nearest rounding boundary and be rounded up before truncating. For example, when truncating a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 2.0 and be rounded up to 2.0.

On Thu, Oct 27, 2016 at 8:00 AM, Maguin, Eugene <[hidden email]> wrote:

Jon,

I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a </>/= b). Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Thursday, October 27, 2016 9:49 AM
To: [hidden email]
Subject: Re: IF statement problem or maybe a compute problem

 

A few points on this.

 

- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.

 

- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.

 

Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

 

data list list/n384 n381 n558(3f8.2).

begin data

2252.01  2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

 

                   n384                    n381                    n558                    diff                    test 

 

 2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013

 

This link goes into all the gory details of floating point and the IEEE 754 standard.

 

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

 

 

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003.  However, when I added TRUNC everything
worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.






--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD




--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Jon Peck
In reply to this post by Mike
Interesting history, Mike.  I don't know when SPSS became fully double precision, but it probably would have been no later than the introduction of SPSSX, which was around 1981-2.  That's before I joined SPSS.

At least as big an issue as single/double precision is how the arithmetic calculations are done.  Even something as apparently simple as a sum can run amok if you just use the obvious method if the numbers vary widely in magnitude.  Here is a link to the Kahan algorithm used by Statistics.

On Thu, Oct 27, 2016 at 11:05 AM, Mike Palij <[hidden email]> wrote:
Boy, does this bring back memories.  When I was an undergrad in the
early 1970s I was a psych major and had minor in bio and computer
science -- running punched cards on an IBM mainframe (which took
hours to get through the system, meaning it took hours for me to find
the errors I committed and rerun the program which was madness).
Later in the 1970s I was in grad school in exp/cog psych and was
running SPSS and BMDP on a Sperry UNIVAC 1100 which used
7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the
IBM.  Wilkinson & Dallal published their American Statistician
paper in 1977 on "Accuracy of Sample Moments Calculations etc".
which pointed out that most statistical software packages gave
incorrect results because (1) of the algorithm used, and (b) the
use of single precision 32-bit representation.  This ties into the
presentation that Jon Peck makes for double precision down
below.
 
Quoting from W&D.
 
|Ling (1974) listed five algorithms for calculating the sum of
|squares about the sample mean. The first of these is the
|desk calculator:
 
NOTE: Greek symbols don't paste properly into the mail
client i use but the desk calculator involves the sum of x^2.
 
|Of the five algorithms,  it is the least reliable for use on a
|computer with a small word size due to its susceptibility
|to truncation error. Anscombe (1967) has shown that for a
|computer like the IBM 370 with a four-byte, 32-bit w word,
|24 of which are used to store a mantissa (approximately
|7.2 decimal digits), S sub 11 [standard deviation] cannot be
|computed with any accuracy for the numbers 9001, 9002, 9003.
|Following the example through,
|  X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits)
|      9001____ 81018001____ 81018000
|      9002____ 81036004____ 81036000
|      9003____ 81054009____ 81054008.
|
|Therefore, to the first 24 binary digits,
|sum of X^2 sub1 = 243108000.
|Again, to the first 24 binary digits,
3X-bar^2 = 243108000, so
|  sum of X^2 sub1  - 3X-bar^2 = 0.
 
[NOTE: see original article for actual formulas)
 
|The same result can be obtained in decimal arithmetic by ignoring
|any digit beyond the seventh. The three consecutive integers need
|not be as large as 9000 for complete loss of accuracy; numbers
|smaller than 3000 will do.
 
Because different software packages used different algorithms
as well as precision (double precision helps minimize the problem)
whether the results one got were correct or wrong depended upon
the algorithm the package used and the precision used (and whether
it could be changed from single to double).
 
BMDP used single precision (32-bit) representation but used
the mean deviation formula to calculate the sum of squares,
so it avoided overflow/underflow errors that would produce
truncated numbers and, hence, incorrect results. BMDP
was the only package to provide correct results with single
precision. However, BMDP required reading the data twice:
first to calculate the appropriated statistics like the mean,
and second to subtract the mean from each value which
meant the processing took longer.
 
SPSS 1977 (IBM mainframe version) could calculate the
mean and standard deviation correctly but not correlations.
As W&D point out this, was due to SPSS using single
precision for some calculations and double precisions
for others. At this time SPSS didn't provide an option
to do double precision calculations throughout.
 
Other packages (DATATEXT, OSIRIS) which provided
an option to use double precision were able to get all of
the statistics correct when this option was used.  Subsequent
to this article, SPSS would used double precision in its
calculations on IBM machines but I'm not sure what year
or version of SPSS this was. Not sure if they changed
the algorithm.
 
W&D note the Control Data Corp (CDC) computers
used 60 bit words and the desk calculator algorithm would
provide correct results here (I assume that all packages running
on CDC machines made use of the 60 bit representations).
NYU used to have a CDC machine in addition to its IBM
mainframes in the early 1980s and it is telling that Elazar Pedhazur
used both IBM and CDC machines for analyses -- Pleasure
did his Ph.D. under Fred Kerlinger and would become
part of the Ed psych faculty in NYU's school of Ed.
In Pedhazur's early books on multiple regression he would
show results from both CDC and IBM versions of software.
 
Sidenote: in the 1940 the clinical psychology program
was located in the NYU School of Ed, not in the NYU
School of Arts & Science, and that is where Jack Cohen
got his Ph.D. in clinical psychology as well as his quantitative
training.  After getting his Ph.D., Jack would do clinical
work and research in one of the Veterans hospitals until
the late '50s/'early 60s when he became part of the 
NYU Arts & Science psychology dept as well as doing
a consulting gig at the New York State Psychiatric Institute
(NYSPI or just PI) which was run by Columbia's med
school's dept of psychiatry. I believe he was affiliated 
with the Biometrics Dept there, which is where his second
wife Patricia would also work and does so still today. 
 
Anyway, Jon's presentation below helps to make sense of
W&D presentation but in the context of the 32-bit word.
I believe that the IEEE 754 standard would help standardize
operations across statistical packages.   
 
-Mike Palij
New York University
 
----- Original Message -----
Sent: Thursday, October 27, 2016 9:48 AM
Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number.  It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values.  That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.  The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here.  Using scientific notation exposes the actual values although technically there is about half an additional digit not shown.  When 2252.01 is converted to binary, the stored value is
2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).
begin data
2252.01  2000.00 252.01
end data
compute diff = n384 - n381.
compute test = diff - n558.
format n384 n381 n558 diff test(e23.16).
list.

                   n384                    n381                    n558                    diff                    test 
 
 2.252010000000000E+003  2.000000000000000E+003  2.520100000000000E+002  2.520100000000002E+002  2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.


On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:
Got it licked!

Trying to trick the system is tricky.  My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003.  However, when I added TRUNC everything
worked out just fine.

The following works:
if  ((TRUNC(n384*100) - TRUNC(n381*100) -  TRUNC(n558*100) ) ne 0) E2 = '2'.



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

John F Hall
In reply to this post by Mike
Mike

Punched cards?  At least you could see the text along the top!  When writing the Salford Survey Suite in the mid-1960s the computer was an English Electric KDF9 (64K RAM and as big as a small house: input and output on 8-hole paper tape).  I first encountered cards at LSE in 1970, when data, programs and JCL were needed on actual cards to feed into SDTAB (a program for Survey Data Tabulation) and MUTOS (a program for spreading out data supplied by fieldwork agencies on multi-punched cards).  Other software we tried included BMDP and OSIRIS, but local programs abounded elsewhere.  For survey processing before SPSS (and personal computers) see slide-show
http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/slides_1_-_history_and_background.ppt 

SPSS was only implemented in the UK in 1970 after some political infighting at Edinburgh in which an in-house proposal was defeated.  
(see: http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/appendix_1_how_spss_came_to_the_uk.pdf)  Edinburgh was one of a handful of universities then allowed to buy American computers.

My first encounter with SPSS was a completely disastrous blind date (http://surveyresearch.weebly.com/my-first-time-in-at-the-spss-deep-end-1972.html) but perseverance paid off.  

It was a revelation and spread like wildfire among university social science departments, not least because the manual was available in academic book-stores and the command language was not Fortran, but English.  At last we were free of constraint by the mathematicians, statisticians and programmers guarding their precious machines, many of whom criticised SPSS for their own reasons, including inefficient programming and statistical inaccuracy.  We just ignored them.

I have used it ever since, starting at LSE in 1972 (data and setups still on actual 80-column cards, data later on magnetic tape) using the CDC 2000 and, when SPSS was available for other manufacturers, the ICL 2900.  From 1976 (still on cards: card-punch on another campus a mile away) on ICL 1900, then from 1982 (when my unit acquired a single card-punch) and from the mid-1980s (Oh joy! A new lab with 16 VDUs, 4 fast servers and 2 lineprinters: no more cards!) on Dec-10, Dec-20 and finally a Vax cluster (but still with JCL in the form of EDT and VMS).  Even then the social scientists weren't consulted about their needs, but were dependent on what Librarianship and Information Sciences had ordered.

SPSS for Windows offers ultimate freedom from all constraints except personal incompetence.

NB: the links above are taken from my 2006 presentation to ASSESS (SPSS users in Europe)
http://surveyresearch.weebly.com/old-dog-old-tricks-using-spss-syntax-to-beat-the-mouse-trap.html

Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: IF statement problem or maybe a compute problem

Ron0z
John F Hall wrote
and finally a Vax cluster
That's a term from the past.  I spent several years using a VAX and grew to love it with an affection no other system has come close to equaling. (Tough hardware too.)  A mainframe where SPSS batch jobs could be scheduled (many tasks running simultaneously); VAXMate which was their version of a PC that see the mainframe drive as a virtual drive (huge disk space); a programmable editor (TPU ?); and my greatest love - DCL that permitted so many office automation processes to be created with an elegance that makes DOS look like a toy.
Loading...