Quantcast

SPSS-Stats question regarding outliers

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

SPSS-Stats question regarding outliers

Robert Marshall-7
Hi list,

I have a SPSS-STATS question.   SPSS shows outliers using the BOX PLOT feature.  These outliers look to me to be based on the data's frequency distribution.   SPSS also shows outliers in the form of residuals.  In my work, I removed any data points with residuals greater than > 3 SD.   There were only three outliers according to an analysis of residuals.  When I look at the box plot, there are about twice as many outliers.    I don't really know the practical difference between the outliers.

Which way should I be examining outliers, by frequency distribution or by residuals?

Any help greatly appreciated.

Robert
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Hector Maletta
         Robert,
         Nothing weird is happening. It is just that the new outliers are
computed relative to the new standard deviation. They are still at more than
3 SD from the mean; only the new std dev is smaller than before. If your
data have a near normal distribution, for instance, with more and more cases
as you approach the mean, the more outliers you exclude the more new
outliers will appear, because the >3SD tail of the distribution will be more
densely populated as the SD diminishes in size.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Robert Marshall
Sent: 11 August 2007 14:07
To: [hidden email]
Subject: SPSS-Stats question regarding outliers

         Hi list,

         I have a SPSS-STATS question.   SPSS shows outliers using the BOX
PLOT feature.  These outliers look to me to be based on the data's frequency
distribution.   SPSS also shows outliers in the form of residuals.  In my
work, I removed any data points with residuals greater than > 3 SD.   There
were only three outliers according to an analysis of residuals.  When I look
at the box plot, there are about twice as many outliers.    I don't really
know the practical difference between the outliers.

         Which way should I be examining outliers, by frequency distribution
or by residuals?

         Any help greatly appreciated.

         Robert
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Robert Marshall-7
In reply to this post by Robert Marshall-7
Thank you so much Hector.   Now I get it.   :-)

-------------- Original message --------------
From: Hector Maletta <[hidden email]>

> Robert,
> Nothing weird is happening. It is just that the new outliers are
> computed relative to the new standard deviation. They are still at more than
> 3 SD from the mean; only the new std dev is smaller than before. If your
> data have a near normal distribution, for instance, with more and more cases
> as you approach the mean, the more outliers you exclude the more new
> outliers will appear, because the >3SD tail of the distribution will be more
> densely populated as the SD diminishes in size.
>
> Hector
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Robert Marshall
> Sent: 11 August 2007 14:07
> To: SPSS [hidden email]
> Subject: SPSS-Stats question regarding outliers
>
> Hi list,
>
> I have a SPSS-STATS question. SPSS shows outliers using the BOX
> PLOT feature. These outliers look to me to be based on the data's frequency
> distribution. SPSS also shows outliers in the form of residuals. In my
> work, I removed any data points with residuals greater than > 3 SD. There
> were only three outliers according to an analysis of residuals. When I look
> at the box plot, there are about twice as many outliers. I don't really
> know the practical difference between the outliers.
>
> Which way should I be examining outliers, by frequency distribution
> or by residuals?
>
> Any help greatly appreciated.
>
> Robert
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Ken Belzer
In reply to this post by Robert Marshall-7
Hi,

I just had two brief follow-up questions to  Robert's concerning outliers in
SPSS. First, Robert mentioned "SPSS  also shows outliers in the form of
residuals." Where is this found in SPSS?  Is it derived from  DESCRIPTIVES, EXPLORE,
or a REGRESSION  procedure?

Second, is > 3 standard deviations the commonly accepted -- or default
definition for outliers -- that is used in SPSS, particularly with the boxplots
that are provided in the EXPLORE procedure?

Thanks very much in advance.

Regards,
Ken

In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time,
[hidden email] writes:

Robert,
Nothing weird is  happening. It is just that the new outliers are
computed relative to the  new standard deviation. They are still at more than
3 SD from the mean;  only the new std dev is smaller than before. If your
data have a near  normal distribution, for instance, with more and more cases
as you approach  the mean, the more outliers you exclude the more new
outliers will appear,  because the >3SD tail of the distribution will be more
densely populated  as the SD diminishes in size.

Hector

-----Original  Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On  Behalf Of
Robert Marshall
Sent: 11 August 2007 14:07
To:  [hidden email]
Subject: SPSS-Stats question regarding  outliers

Hi list,

I have a SPSS-STATS question.   SPSS  shows outliers using the BOX
PLOT feature.  These outliers look to me  to be based on the data's frequency
distribution.   SPSS also  shows outliers in the form of residuals.  In my
work, I removed any  data points with residuals greater than > 3 SD.   There
were  only three outliers according to an analysis of residuals.  When I  look
at the box plot, there are about twice as many outliers.     I don't really
know the practical difference between the  outliers.

Which way should I be  examining outliers, by frequency distribution
or by  residuals?

Any help greatly  appreciated.

Robert







************************************** Get a sneak peek of the all-new AOL at
http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Peck, Jon
Large regression residuals can be tabulated with the /CASEWISE OUTLIERS(n) syntax in REGRESSION, and, of course residuals can be saved and analyzed with EXAMINE/EXPLORE and other procedures.

I don't think as a general matter, though, that automatically removing residuals larger than 3 sd is a good idea.  They are evidence against your model and ought to be carefully considered.  If they have large leverage in the regression, this is especially important.

I would look at leverage and residual size, but also plots like residuals vs fitted values to see if there is a pattern that can be discerned.  If you do remove them, you might want to require a higher significance level or, at least, to document this when reporting results.  Ultimately this is a judgment call that often is not easy.

Regards,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken Belzer
Sent: Saturday, August 11, 2007 3:42 PM
To: [hidden email]
Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers

Hi,

I just had two brief follow-up questions to  Robert's concerning outliers in
SPSS. First, Robert mentioned "SPSS  also shows outliers in the form of
residuals." Where is this found in SPSS?  Is it derived from  DESCRIPTIVES, EXPLORE,
or a REGRESSION  procedure?

Second, is > 3 standard deviations the commonly accepted -- or default
definition for outliers -- that is used in SPSS, particularly with the boxplots
that are provided in the EXPLORE procedure?

Thanks very much in advance.

Regards,
Ken

In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time,
[hidden email] writes:

Robert,
Nothing weird is  happening. It is just that the new outliers are
computed relative to the  new standard deviation. They are still at more than
3 SD from the mean;  only the new std dev is smaller than before. If your
data have a near  normal distribution, for instance, with more and more cases
as you approach  the mean, the more outliers you exclude the more new
outliers will appear,  because the >3SD tail of the distribution will be more
densely populated  as the SD diminishes in size.

Hector

-----Original  Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On  Behalf Of
Robert Marshall
Sent: 11 August 2007 14:07
To:  [hidden email]
Subject: SPSS-Stats question regarding  outliers

Hi list,

I have a SPSS-STATS question.   SPSS  shows outliers using the BOX
PLOT feature.  These outliers look to me  to be based on the data's frequency
distribution.   SPSS also  shows outliers in the form of residuals.  In my
work, I removed any  data points with residuals greater than > 3 SD.   There
were  only three outliers according to an analysis of residuals.  When I  look
at the box plot, there are about twice as many outliers.     I don't really
know the practical difference between the  outliers.

Which way should I be  examining outliers, by frequency distribution
or by  residuals?

Any help greatly  appreciated.

Robert







************************************** Get a sneak peek of the all-new AOL at
http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Hector Maletta
         I agree with Jon's idea that removing outliers is not generally a
sound idea. The only case I would go for it is when the outlier is
manifestly a wrong datum (say, age=542), in which case the outlier should be
recoded to some missing value (or in some particular instances replaced by
some imputed value as the case might be).

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: 11 August 2007 17:59
To: [hidden email]
Subject: Re: SPSS-Stats question regarding outliers

         Large regression residuals can be tabulated with the /CASEWISE
OUTLIERS(n) syntax in REGRESSION, and, of course residuals can be saved and
analyzed with EXAMINE/EXPLORE and other procedures.

         I don't think as a general matter, though, that automatically
removing residuals larger than 3 sd is a good idea.  They are evidence
against your model and ought to be carefully considered.  If they have large
leverage in the regression, this is especially important.

         I would look at leverage and residual size, but also plots like
residuals vs fitted values to see if there is a pattern that can be
discerned.  If you do remove them, you might want to require a higher
significance level or, at least, to document this when reporting results.
Ultimately this is a judgment call that often is not easy.

         Regards,
         Jon Peck

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of Ken Belzer
         Sent: Saturday, August 11, 2007 3:42 PM
         To: [hidden email]
         Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers

         Hi,

         I just had two brief follow-up questions to  Robert's concerning
outliers in
         SPSS. First, Robert mentioned "SPSS  also shows outliers in the
form of
         residuals." Where is this found in SPSS?  Is it derived from
DESCRIPTIVES, EXPLORE,
         or a REGRESSION  procedure?

         Second, is > 3 standard deviations the commonly accepted -- or
default
         definition for outliers -- that is used in SPSS, particularly with
the boxplots
         that are provided in the EXPLORE procedure?

         Thanks very much in advance.

         Regards,
         Ken

         In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time,
         [hidden email] writes:

         Robert,
         Nothing weird is  happening. It is just that the new outliers are
         computed relative to the  new standard deviation. They are still at
more than
         3 SD from the mean;  only the new std dev is smaller than before.
If your
         data have a near  normal distribution, for instance, with more and
more cases
         as you approach  the mean, the more outliers you exclude the more
new
         outliers will appear,  because the >3SD tail of the distribution
will be more
         densely populated  as the SD diminishes in size.

         Hector

         -----Original  Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Robert Marshall
         Sent: 11 August 2007 14:07
         To:  [hidden email]
         Subject: SPSS-Stats question regarding  outliers

         Hi list,

         I have a SPSS-STATS question.   SPSS  shows outliers using the BOX
         PLOT feature.  These outliers look to me  to be based on the data's
frequency
         distribution.   SPSS also  shows outliers in the form of residuals.
In my
         work, I removed any  data points with residuals greater than > 3
SD.   There
         were  only three outliers according to an analysis of residuals.
When I  look
         at the box plot, there are about twice as many outliers.     I
don't really
         know the practical difference between the  outliers.

         Which way should I be  examining outliers, by frequency
distribution
         or by  residuals?

         Any help greatly  appreciated.

         Robert







         ************************************** Get a sneak peek of the
all-new AOL at
         http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

F. Gabarrot
Hello List,

I agree that detecting outliers which are + / - 3 SD around the mean is not a good idea. I also agree that outliers deserve a closer look and that automatic suppression isn't a good idea. However, an outlier is likely to alter both the Type I and Type II error rates. According to McClelland (2000), linear models are quite robust against many assumptions violations. However, he thinks the most "dangerous" violation is distribution "thick tails" caused by outlliers.

There are many ways to detect outliers, according to the dimension on which they differ from the expected distribution. For instance, an observation can be extreme on a continous criterion, but also on continuous predictors, or on both of them. For instance, you expect a positive relation between the criterion and the predictor, an observation can be very high on both of them, but can still fit the model, and then, wouldn't be a real outlier. However, using the M+/-3SD detection rule, it is likely to be detectedd as an outlier. Conversely, an observation can be around the mean on the criterion, but very high (or very low) on the predictor, thus, it would'nt be detected as an outlier with the M+/-3SD rule. However, this observation truly is an outlier. One problem with using z-scores is that an outlier may distort the estimated mean and standard deviation so that the outlier no longer looks extreme, as Hector highlighted.

A solution is then to leave out an observation, recalculate the mean and standard deviation of the remaining observations, and then calculate the z-score. McClelland recommand to use the Studentized Deleted Residual, as an indicator for outliers. SDR compares the distribution with all the observations with the distribution with all the observations minus one (the observation for which SPSS gives the SDR value). SDR can be compared with a t-ditribution of (n-2) degree of freedom. However, as SDR implies making n analysis, you must adjust your alpha level using Bonferroni adjustment.

Concerning the closer look to give to outliers, McClelland makes no recommandations about it. In my opinion, you can look to outliers like you should do with missing values. I would recommand to recode outliers with 0 (not outlier) and 1 (outlier) values, and then to regress this new variable on your model in order to assess if your outliers are randomly distributed or not. If they are, I think there is no problem with deleting them. If they are not, as pointed by Jon, I think you should be more careful with your results interpretation.

My opinion is also that you can replace outlier values like you should do with missing values (as Hector proposed, see also Tabachnick and Fidell, 2004, or Cohen, Cohen, West, & Aiken, 2003) use this 0/1 variable as a predictor in your model. This will prevent from a loss of power due to observation deletion.

Hector Maletta wrote
         I agree with Jon's idea that removing outliers is not generally a
sound idea. The only case I would go for it is when the outlier is
manifestly a wrong datum (say, age=542), in which case the outlier should be
recoded to some missing value (or in some particular instances replaced by
some imputed value as the case might be).

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Peck, Jon
Sent: 11 August 2007 17:59
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: SPSS-Stats question regarding outliers

         Large regression residuals can be tabulated with the /CASEWISE
OUTLIERS(n) syntax in REGRESSION, and, of course residuals can be saved and
analyzed with EXAMINE/EXPLORE and other procedures.

         I don't think as a general matter, though, that automatically
removing residuals larger than 3 sd is a good idea.  They are evidence
against your model and ought to be carefully considered.  If they have large
leverage in the regression, this is especially important.

         I would look at leverage and residual size, but also plots like
residuals vs fitted values to see if there is a pattern that can be
discerned.  If you do remove them, you might want to require a higher
significance level or, at least, to document this when reporting results.
Ultimately this is a judgment call that often is not easy.

         Regards,
         Jon Peck

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On
Behalf Of Ken Belzer
         Sent: Saturday, August 11, 2007 3:42 PM
         To: SPSSX-L@LISTSERV.UGA.EDU
         Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers

         Hi,

         I just had two brief follow-up questions to  Robert's concerning
outliers in
         SPSS. First, Robert mentioned "SPSS  also shows outliers in the
form of
         residuals." Where is this found in SPSS?  Is it derived from
DESCRIPTIVES, EXPLORE,
         or a REGRESSION  procedure?

         Second, is > 3 standard deviations the commonly accepted -- or
default
         definition for outliers -- that is used in SPSS, particularly with
the boxplots
         that are provided in the EXPLORE procedure?

         Thanks very much in advance.

         Regards,
         Ken

         In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time,
         hmaletta@fibertel.com.ar writes:

         Robert,
         Nothing weird is  happening. It is just that the new outliers are
         computed relative to the  new standard deviation. They are still at
more than
         3 SD from the mean;  only the new std dev is smaller than before.
If your
         data have a near  normal distribution, for instance, with more and
more cases
         as you approach  the mean, the more outliers you exclude the more
new
         outliers will appear,  because the >3SD tail of the distribution
will be more
         densely populated  as the SD diminishes in size.

         Hector

         -----Original  Message-----
         From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On
Behalf Of
         Robert Marshall
         Sent: 11 August 2007 14:07
         To:  SPSSX-L@LISTSERV.UGA.EDU
         Subject: SPSS-Stats question regarding  outliers

         Hi list,

         I have a SPSS-STATS question.   SPSS  shows outliers using the BOX
         PLOT feature.  These outliers look to me  to be based on the data's
frequency
         distribution.   SPSS also  shows outliers in the form of residuals.
In my
         work, I removed any  data points with residuals greater than > 3
SD.   There
         were  only three outliers according to an analysis of residuals.
When I  look
         at the box plot, there are about twice as many outliers.     I
don't really
         know the practical difference between the  outliers.

         Which way should I be  examining outliers, by frequency
distribution
         or by  residuals?

         Any help greatly  appreciated.

         Robert







         ************************************** Get a sneak peek of the
all-new AOL at
         http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Tom Werner
In reply to this post by Robert Marshall-7
My understanding of SPSS's Boxplot feature is that it produces a boxplot by
calculating the Interquartile Range (the middle half of the sample around
the median).

The Interquartile Range is the "box" in an SPSS boxplot.

SPSS then identifies outliers as data points that are more than
one-and-a-half box-lengths from each end of the box.

(See the bottom of the page at
http://www.maths.murdoch.edu.au/units/statsnotes/samplestats/boxplot.html.)

So, in drawing a boxplot, SPSS is using median and interquartile range
(rather than mean and standard deviation).


Am I right about this? Do others have the same understanding?


Tom Werner



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Robert Marshall
Sent: Saturday, August 11, 2007 1:07 PM
To: [hidden email]
Subject: SPSS-Stats question regarding outliers

Hi list,

I have a SPSS-STATS question.   SPSS shows outliers using the BOX PLOT
feature.  These outliers look to me to be based on the data's frequency
distribution.   SPSS also shows outliers in the form of residuals.  In my
work, I removed any data points with residuals greater than > 3 SD.   There
were only three outliers according to an analysis of residuals.  When I look
at the box plot, there are about twice as many outliers.    I don't really
know the practical difference between the outliers.

Which way should I be examining outliers, by frequency distribution or by
residuals?

Any help greatly appreciated.

Robert
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Peck, Jon
You can find the exact details in the Help/Algorithms/Examine/Plots/Boxplots topic

Outliers are indeed based on 1.5 IQR.

IQR = Q3 - Q1
STEP = 1.5 IQR
outlier if Q3 + STEP <= y(i) < Q3 + 2 STEP  (high case)
extreme if further out than that.

Regression outliers use a moment-based calculation and will generally give different results, but both are useful.  Of course, the boxplot does not know that the values are residuals, so it does not make adjustments for that.

-Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Tom Werner
Sent: Sunday, August 12, 2007 10:37 AM
To: [hidden email]
Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers

My understanding of SPSS's Boxplot feature is that it produces a boxplot by
calculating the Interquartile Range (the middle half of the sample around
the median).

The Interquartile Range is the "box" in an SPSS boxplot.

SPSS then identifies outliers as data points that are more than
one-and-a-half box-lengths from each end of the box.

(See the bottom of the page at
http://www.maths.murdoch.edu.au/units/statsnotes/samplestats/boxplot.html.)

So, in drawing a boxplot, SPSS is using median and interquartile range
(rather than mean and standard deviation).


Am I right about this? Do others have the same understanding?


Tom Werner



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Robert Marshall
Sent: Saturday, August 11, 2007 1:07 PM
To: [hidden email]
Subject: SPSS-Stats question regarding outliers

Hi list,

I have a SPSS-STATS question.   SPSS shows outliers using the BOX PLOT
feature.  These outliers look to me to be based on the data's frequency
distribution.   SPSS also shows outliers in the form of residuals.  In my
work, I removed any data points with residuals greater than > 3 SD.   There
were only three outliers according to an analysis of residuals.  When I look
at the box plot, there are about twice as many outliers.    I don't really
know the practical difference between the outliers.

Which way should I be examining outliers, by frequency distribution or by
residuals?

Any help greatly appreciated.

Robert
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Ken Belzer
In reply to this post by Robert Marshall-7
Thanks very much to Robert for raising this issue and to those who  responded
to my follow-up questions. I had been using a simple rule-of-thumb  from
Tabachnick & Fidell (1996 - probably a bit outdated) of excluding  ouliers from
the model/analysis if their z-scores relative to their  distribution were equal
to or greater than +/-3.28 -- primarily for  univariate procedures.

Clearly, there's quite a bit more to consider, and many  more ways to examine
the nature and impact  of outliers before simply excluding them. I've saved
these  responses for future reference -- thanks again.

Kind regards,
Ken



************************************** Get a sneak peek of the all-new AOL at
http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Hector Maletta
         About the exclusion of outliers outside +/- 3 SD, a big "if"
concerns the distribution of the variable. In biological variables, mostly
with a normal distribution, cases outside the -3 to +3 range are rare, and
more so if outside +/- 4 or 5. Moreover, they are often the result of data
entry errors or sample flukes. But in other kinds of variables it ain't
necessarily so. Income, for instance, is clearly skewed, and excluding those
cases above +3 SD (even in the case of using log income) may imply leaving
the rich, and with them a big chunk of aggregate income, outside the
analysis.
         As a general rule, do not exclude any valid datum.
         Hector



         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken
Belzer
Sent: 13 August 2007 01:02
To: [hidden email]
Subject: Re: SPSS-Stats question regarding outliers

         Thanks very much to Robert for raising this issue and to those who
responded
         to my follow-up questions. I had been using a simple rule-of-thumb
from
         Tabachnick & Fidell (1996 - probably a bit outdated) of excluding
ouliers from
         the model/analysis if their z-scores relative to their
distribution were equal
         to or greater than +/-3.28 -- primarily for  univariate procedures.

         Clearly, there's quite a bit more to consider, and many  more ways
to examine
         the nature and impact  of outliers before simply excluding them.
I've saved
         these  responses for future reference -- thanks again.

         Kind regards,
         Ken



         ************************************** Get a sneak peek of the
all-new AOL at
         http://discover.aol.com/memed/aolcom30tour
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Art Kendall-2
<soapbox>

"Outliers" is a very problematic concept.  There are a wide variety of
meanings ascribed to the term.

Extreme values may be valid. They may be the most important values.
Arbitrary treatment of values as outliers should rarely if ever be done.
Leverage stats,  etc.  only identify potential or suspected outliers.

Based on consulting on stat and methodology for over 30 years, I believe
the usual explanation when there are suspicious values is failure of the
quality assurance procedure. .  I think of a *potential* outlier as a
surprising or suspicious value for a variable  (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry.  In my experience, rechecking
qa typically eliminates over 80% of suspicious data values. This is one
reason I advocate thorough exploration of a  set of data before doing
the analysis. By thorough exploration I mean things like frequencies,
multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
reliability, etc.

Derived variables such as residuals and rates, should be subjected to
the same thorough examination and understanding-seeking as raw variables.

In cluster analysis, sometimes there are singleton clusters, e.g., Los
Angeles county is distinct from other counties in the western US states.
Some times there are 500 lb persons.  There might be a rose growing in a
cornfield.


The first thing to do with outliers is to *_prevent_ * them by careful
quality assurance procedures in data gathering and data entry.

A thorough search for suspect data values and potentially treating them
as outliers in analysis is an important part of data quality assurance.
Values for a variable are suspect and in need of further review when
they are unusual given the subject matter area, outside the legitimate
range of the response scale, show as isolated on scattergrams, have
subjectively extreme residuals, when the data shows very high order
interaction on ANOVA analyses, when they result in a case being
extremely influential in a regression, etc.  Recall that researchers
consider Murphy a Pollyanna.


The detection of odd/peculiar/suspicious values late in the data
analysis process is one one reason to assure that you can go all the way
back and redo the process.  Keeping all of the data gathering
instruments, and preserving the syntax for all data transformation are
important parts of going back and checking on "outliers".  The
occurrence of many outliers suggests the data entry was sloppy.  There
are likely to be incorrectly entered values that are not "outliers".
Although it is painful, another round of data entry and verification may
be in order.


*Correcting the data.*

Sometimes you can actually go back to redo the measurements.  (Is there
really a 500 pound 9 year old?).  You should always have all the paper
from which data were transcribed.
On the rare occasions when there are very good reasons, you might modify
the value for a particular case. e.g., percent correct entered as 1000%
==> 100%.


*Modifying the data.*
Values of variables  should be trimmed or recoded to "missing" only when
there is a clear rationale.  And then only when it is not possible to
redo the measurement process. (Maybe there really is a six year old who
weighs 400 lbs. Go back and look if possible.)

If suspected outliers are recoded or trimmed, the analysis should be
done as is and as modified  to see what the effect of the modification
is. Changing the values of variables suspected to be outliers frequently
leads to misleading results. These procedures should be used very
sparingly.

Math criteria can identify suspects.  There should be a trial before
there is a verdict and  the presumption should be against outlier status
for a value.


I don't recommend undesirable practices such as cavalierly trimming to 3
SDs.  Having a value beyond 3 SD can be reason to examine a case more
thoroughly.

It is advisable to consult with a statistician before changing the
values of suspected outliers.

*Multiple analyses.*

If you have re-entered the data, or re-run the experiment, and done very
thorough exploration of the data, you are stuck as a last resort with
doing multiple analyses: including vs excluding the case(s); changing
the values for the case(s) to hotdeck values, to some central tendency
value, or to max or min on the response scale (e.g., for achievement,
personality,  or attitude measures), etc.

In the small minority of occasions where the data can not be cleaned up,
the analysis should be done in  three  or more ways (include the
outliers as is, trim the values, treat the values as missing, transform
to ranks, include in the model variables that flag those cases,  or
...).  The reporting becomes much more complex.  Consider yourself very
lucky if the conclusions do  not vary substantially.

</soapbox>

Art Kendall
Social Research Consultants

Hector Maletta wrote:

>          About the exclusion of outliers outside +/- 3 SD, a big "if"
> concerns the distribution of the variable. In biological variables, mostly
> with a normal distribution, cases outside the -3 to +3 range are rare, and
> more so if outside +/- 4 or 5. Moreover, they are often the result of data
> entry errors or sample flukes. But in other kinds of variables it ain't
> necessarily so. Income, for instance, is clearly skewed, and excluding those
> cases above +3 SD (even in the case of using log income) may imply leaving
> the rich, and with them a big chunk of aggregate income, outside the
> analysis.
>          As a general rule, do not exclude any valid datum.
>          Hector
>
>
>
>          -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken
> Belzer
> Sent: 13 August 2007 01:02
> To: [hidden email]
> Subject: Re: SPSS-Stats question regarding outliers
>
>          Thanks very much to Robert for raising this issue and to those who
> responded
>          to my follow-up questions. I had been using a simple rule-of-thumb
> from
>          Tabachnick & Fidell (1996 - probably a bit outdated) of excluding
> ouliers from
>          the model/analysis if their z-scores relative to their
> distribution were equal
>          to or greater than +/-3.28 -- primarily for  univariate procedures.
>
>          Clearly, there's quite a bit more to consider, and many  more ways
> to examine
>          the nature and impact  of outliers before simply excluding them.
> I've saved
>          these  responses for future reference -- thanks again.
>
>          Kind regards,
>          Ken
>
>
>
>          ************************************** Get a sneak peek of the
> all-new AOL at
>          http://discover.aol.com/memed/aolcom30tour
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Tom Werner
This is a most interesting discussion.

I might suggest that one area in which it may be important to identify (and
perhaps remove) outliers is ratings by judges.

If we were to have panels of judges judging entries by rating them on
numerical scales (such as in an awards program, skating/gymnastics judging,
or applicant judging), we would need to identify and manage inter-rater
reliability/agreement.

One way to achieve appropriate inter-rater reliability would be to identify
and remove outlier ratings (overly strict or overly lenient scores relative
to other judges' scores on the same entry).

It strikes me that if this isn't done, the awards program (or sports contest
or application process) risks having variation that is due more to the
judges than to the entries.

(Other steps could also be taken to increase inter-rater reliability, such
as training of the judges. But it seems that identifying and removing
outlier scores would always be a worthwhile additional step.)

I'd be grateful for any thoughts on this.


Tom Werner



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art
Kendall
Sent: Monday, August 13, 2007 7:46 AM
To: [hidden email]
Subject: Re: SPSS-Stats question regarding outliers

<soapbox>

"Outliers" is a very problematic concept.  There are a wide variety of
meanings ascribed to the term.

Extreme values may be valid. They may be the most important values.
Arbitrary treatment of values as outliers should rarely if ever be done.
Leverage stats,  etc.  only identify potential or suspected outliers.

Based on consulting on stat and methodology for over 30 years, I believe the
usual explanation when there are suspicious values is failure of the quality
assurance procedure. .  I think of a *potential* outlier as a surprising or
suspicious value for a variable  (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry.  In my experience, rechecking qa
typically eliminates over 80% of suspicious data values. This is one reason
I advocate thorough exploration of a  set of data before doing the analysis.
By thorough exploration I mean things like frequencies, multi-way crosstabs,
scatterplots, box plots, rechecking scale keys and reliability, etc.

Derived variables such as residuals and rates, should be subjected to the
same thorough examination and understanding-seeking as raw variables.

In cluster analysis, sometimes there are singleton clusters, e.g., Los
Angeles county is distinct from other counties in the western US states.
Some times there are 500 lb persons.  There might be a rose growing in a
cornfield.


The first thing to do with outliers is to *_prevent_ * them by careful
quality assurance procedures in data gathering and data entry.

A thorough search for suspect data values and potentially treating them as
outliers in analysis is an important part of data quality assurance.
Values for a variable are suspect and in need of further review when they
are unusual given the subject matter area, outside the legitimate range of
the response scale, show as isolated on scattergrams, have subjectively
extreme residuals, when the data shows very high order interaction on ANOVA
analyses, when they result in a case being extremely influential in a
regression, etc.  Recall that researchers consider Murphy a Pollyanna.


The detection of odd/peculiar/suspicious values late in the data analysis
process is one one reason to assure that you can go all the way back and
redo the process.  Keeping all of the data gathering instruments, and
preserving the syntax for all data transformation are important parts of
going back and checking on "outliers".  The occurrence of many outliers
suggests the data entry was sloppy.  There are likely to be incorrectly
entered values that are not "outliers".
Although it is painful, another round of data entry and verification may be
in order.


*Correcting the data.*

Sometimes you can actually go back to redo the measurements.  (Is there
really a 500 pound 9 year old?).  You should always have all the paper from
which data were transcribed.
On the rare occasions when there are very good reasons, you might modify the
value for a particular case. e.g., percent correct entered as 1000% ==>
100%.


*Modifying the data.*
Values of variables  should be trimmed or recoded to "missing" only when
there is a clear rationale.  And then only when it is not possible to redo
the measurement process. (Maybe there really is a six year old who weighs
400 lbs. Go back and look if possible.)

If suspected outliers are recoded or trimmed, the analysis should be done as
is and as modified  to see what the effect of the modification is. Changing
the values of variables suspected to be outliers frequently leads to
misleading results. These procedures should be used very sparingly.

Math criteria can identify suspects.  There should be a trial before there
is a verdict and  the presumption should be against outlier status for a
value.


I don't recommend undesirable practices such as cavalierly trimming to 3
SDs.  Having a value beyond 3 SD can be reason to examine a case more
thoroughly.

It is advisable to consult with a statistician before changing the values of
suspected outliers.

*Multiple analyses.*

If you have re-entered the data, or re-run the experiment, and done very
thorough exploration of the data, you are stuck as a last resort with doing
multiple analyses: including vs excluding the case(s); changing the values
for the case(s) to hotdeck values, to some central tendency value, or to max
or min on the response scale (e.g., for achievement, personality,  or
attitude measures), etc.

In the small minority of occasions where the data can not be cleaned up, the
analysis should be done in  three  or more ways (include the outliers as is,
trim the values, treat the values as missing, transform to ranks, include in
the model variables that flag those cases,  or ...).  The reporting becomes
much more complex.  Consider yourself very lucky if the conclusions do  not
vary substantially.

</soapbox>

Art Kendall
Social Research Consultants

Hector Maletta wrote:

>          About the exclusion of outliers outside +/- 3 SD, a big "if"
> concerns the distribution of the variable. In biological variables,
> mostly with a normal distribution, cases outside the -3 to +3 range
> are rare, and more so if outside +/- 4 or 5. Moreover, they are often
> the result of data entry errors or sample flukes. But in other kinds
> of variables it ain't necessarily so. Income, for instance, is clearly
> skewed, and excluding those cases above +3 SD (even in the case of
> using log income) may imply leaving the rich, and with them a big
> chunk of aggregate income, outside the analysis.
>          As a general rule, do not exclude any valid datum.
>          Hector
>
>
>
>          -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of Ken Belzer
> Sent: 13 August 2007 01:02
> To: [hidden email]
> Subject: Re: SPSS-Stats question regarding outliers
>
>          Thanks very much to Robert for raising this issue and to
> those who responded
>          to my follow-up questions. I had been using a simple
> rule-of-thumb from
>          Tabachnick & Fidell (1996 - probably a bit outdated) of
> excluding ouliers from
>          the model/analysis if their z-scores relative to their
> distribution were equal
>          to or greater than +/-3.28 -- primarily for  univariate
procedures.

>
>          Clearly, there's quite a bit more to consider, and many  more
> ways to examine
>          the nature and impact  of outliers before simply excluding them.
> I've saved
>          these  responses for future reference -- thanks again.
>
>          Kind regards,
>          Ken
>
>
>
>          ************************************** Get a sneak peek of
> the all-new AOL at
>          http://discover.aol.com/memed/aolcom30tour
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SPSS-Stats question regarding outliers

Peck, Jon
In reply to this post by Art Kendall-2
When you can check the data and eliminate outliers that way, it's a pure win.  Many times you can't, though.  And sometimes the outliers are what really tell the story.  If what you observe is actually a mixture of two different processes, it may be outliers that allow you to separate them.

At the same time, extreme values tend to be the high leverage values, so their treatment is most important.  If your model isn't too sensitive to the outliers, then keep them.

In the presence of unrejectable and influential outliers, you may want to use robust methods instead of typical least squares methods (starting with medians instead of means), but looking at the outlier pattern may reveal a model misspecification that can be fixed and will eliminate them.

When we benchmark our software, we often see some extreme time differences for a few runs that obscure the effect of some change we have made.  In those cases, we can generally assume that other things happening in the computer and not measured or controlled can be blamed and the case eliminated without much worry.  But I always advocate nonparametrics as the right way to summarize benchmark runs.

-Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall
Sent: Monday, August 13, 2007 6:46 AM
To: [hidden email]
Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers

<soapbox>

"Outliers" is a very problematic concept.  There are a wide variety of
meanings ascribed to the term.

Extreme values may be valid. They may be the most important values.
Arbitrary treatment of values as outliers should rarely if ever be done.
Leverage stats,  etc.  only identify potential or suspected outliers.

Based on consulting on stat and methodology for over 30 years, I believe
the usual explanation when there are suspicious values is failure of the
quality assurance procedure. .  I think of a *potential* outlier as a
surprising or suspicious value for a variable  (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry.  In my experience, rechecking
qa typically eliminates over 80% of suspicious data values. This is one
reason I advocate thorough exploration of a  set of data before doing
the analysis. By thorough exploration I mean things like frequencies,
multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
reliability, etc.
[>>>Peck, Jon] [snip]
Loading...