Fwd: A basic question about outliers

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: A basic question about outliers

fjmenendez

I d´appreciate if somebody can answer a novice doubt.

Box plots mark a series of observations as outliers. That is clear in a normal distribution: those cases that are more than 1.5 interquartile ranges above P75 or below P25 are considered outliers (some authors say 2.2 IR instead of 1.5 IR). That makes sense for me.

But I don´t know how to consider outliers in skewed distributions. The meaning of outliers comes from lie outside: we are trying to analyse if observations belong to a distribution.

But in a skewed distribution a lot of observations are above 1.5 or 2.2 or more interquartile ranges and belong to the distribution... I feel confused. Does it make any sense to talk about outliers in skewed distributions? How to identify them?

I´d appreciate any help. Thanks in advance.
Florentino Menéndez.

Libre de virus. www.avg.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: A basic question about outliers

Art Kendall
The question of "outliers" has come up many times over the last few decades
on this discussion list.

take a look at those discussions in the archives.

On the first page of this list on the right side sea "search" and then
"advanced search".

A better term might be "suspicious values" or "values that should be
checked".

Remember assumptions of distribution are about residuals not the raw data.

The "anomalous values" tool can help identify mis-keyed  or unreasonable
values.

Most rules of thumb are highly questionable when blindly applied.



-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: A basic question about outliers

Rich Ulrich
In reply to this post by fjmenendez

There is testing, and there is model-fitting. We do, usually, want to have tests

on the models, so we almost always want to meet the condition for testing.


Measures with extreme skewness need to be transformed for least-squares

statistics (ANOVA) or to be fitted with a non-linear maximum likelihood models.


Remember that the assumption for least-squares testing is that equal intervals

of the scale should be equal in their influence (or in being influenced) regardless

of where they fall on the scale, be it the middle or an extreme. ("Equal interval"

describes a /relationship/, not the character of a single measure.)


Tukey gave a rule of thumb -- if the largest of a natural measurement (non-negative)

is 20 times the smallest, you almost always should use a transformation.  IIRC, "10 times"

the smallest suggests that you should consider one. What you want to look at first in

choosing a transformation is not the skewness, however, but is the mechanism for

generating the numbers. For the first choices, counts imply square roots; intensities

imply logs (or logistic transforms); distances imply reciprocals.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Florentino Jorge Menendez <[hidden email]>
Sent: Monday, May 21, 2018 4:01:51 PM
To: [hidden email]
Subject: Fwd: A basic question about outliers
 

I d´appreciate if somebody can answer a novice doubt.

Box plots mark a series of observations as outliers. That is clear in a normal distribution: those cases that are more than 1.5 interquartile ranges above P75 or below P25 are considered outliers (some authors say 2.2 IR instead of 1.5 IR). That makes sense for me.

But I don´t know how to consider outliers in skewed distributions. The meaning of outliers comes from lie outside: we are trying to analyse if observations belong to a distribution.

But in a skewed distribution a lot of observations are above 1.5 or 2.2 or more interquartile ranges and belong to the distribution... I feel confused. Does it make any sense to talk about outliers in skewed distributions? How to identify them?

I´d appreciate any help. Thanks in advance.
Florentino Menéndez.

Libre de virus. www.avg.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: A basic question about outliers

Anthony Babinec
In reply to this post by fjmenendez

Florentino

When a data distribution is skewed, you might summarize it through percentiles

such as: 75 90 95 99 99.5 etc.

 

Tony Babinec

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: A basic question about outliers

fjmenendez
In reply to this post by Rich Ulrich
Thanks Art, thanks Rich for your kindness and your knowledge :)

I read the posts about outliers in the list, and I have benefited from them. The idea of thinking about them as suspicious values that need additional checking before decision makes a lot of sense for me. Perhaps I should think this topic using different words.
I don't know the anomalous values tool more than superficially. Perhaps it is a good idea reread about it. 

Also transformations deserve attention. I feel a little shy about them because of problems of interpretation. 

Again, thanks Art, thanks Rich :)

On Mon, May 21, 2018 at 11:34 PM, Rich Ulrich <[hidden email]> wrote:

There is testing, and there is model-fitting. We do, usually, want to have tests

on the models, so we almost always want to meet the condition for testing.


Measures with extreme skewness need to be transformed for least-squares

statistics (ANOVA) or to be fitted with a non-linear maximum likelihood models.


Remember that the assumption for least-squares testing is that equal intervals

of the scale should be equal in their influence (or in being influenced) regardless

of where they fall on the scale, be it the middle or an extreme. ("Equal interval"

describes a /relationship/, not the character of a single measure.)


Tukey gave a rule of thumb -- if the largest of a natural measurement (non-negative)

is 20 times the smallest, you almost always should use a transformation.  IIRC, "10 times"

the smallest suggests that you should consider one. What you want to look at first in

choosing a transformation is not the skewness, however, but is the mechanism for

generating the numbers. For the first choices, counts imply square roots; intensities

imply logs (or logistic transforms); distances imply reciprocals.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Florentino Jorge Menendez <[hidden email]>
Sent: Monday, May 21, 2018 4:01:51 PM
To: [hidden email]
Subject: Fwd: A basic question about outliers
 

I d´appreciate if somebody can answer a novice doubt.

Box plots mark a series of observations as outliers. That is clear in a normal distribution: those cases that are more than 1.5 interquartile ranges above P75 or below P25 are considered outliers (some authors say 2.2 IR instead of 1.5 IR). That makes sense for me.

But I don´t know how to consider outliers in skewed distributions. The meaning of outliers comes from lie outside: we are trying to analyse if observations belong to a distribution.

But in a skewed distribution a lot of observations are above 1.5 or 2.2 or more interquartile ranges and belong to the distribution... I feel confused. Does it make any sense to talk about outliers in skewed distributions? How to identify them?

I´d appreciate any help. Thanks in advance.
Florentino Menéndez.

Libre de virus. www.avg.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: A basic question about outliers

Rich Ulrich

You are right - when using transformations, "interpretation" is the main snag.

Sometimes you can report the medians for group, or percentiles (as someone suggested).


Sometimes the original means are still meaningful, and you can use those -- By the way,

when the means do NOT seem like appropriate measures for a group, that is a sure sign

that ANOVA is not appropriate.


 - You can back-transform to get the so-called "geometric mean" after log transformation.

 - Some versions of reciprocal make sense when you invert the descriptive units. For instance

in the USA, we talk about MPG, miles per gallon, whereas analyses are often better scaled

by the European convention of Liters per 100 kilometers.

 - I was impressed with an analysis of track-and-field records which gained both better scaling
and a unified presentation across distances by using meters-per-second instead of using the very-
different times for different distances, like, for instance, "9.80 seconds for the 100 meter dash."

There are another couple of skewed-data models where transformation is the second consideration.
 - When there are a large number of zeros, it is sometimes /logically/ appropriate to make the
break into two variables, e.g., AnyIncome (yes/no), and then, perhaps analyzing the subset with
income, AveIncome.  The non-zero data might or might not have notable skew.
 - When the measures, as collected, represent counts or amounts, it is proper to ask if there
should be a denominator to make rates (ratios).  So we analyze crime rates, birth rates, etc.,
instead of "total crimes" or "total births" (highly skewed data) across cities or countries of
different sizes.

--
Rich Ulrich


From: Florentino Jorge Menendez <[hidden email]>
Sent: Tuesday, May 22, 2018 11:35 AM
To: Rich Ulrich
Cc: [hidden email]
Subject: Re: A basic question about outliers
 ...
Also transformations deserve attention. I feel a little shy about them because of problems of interpretation. 
...

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: A basic question about outliers

Bruce Weaver
Administrator
In reply to this post by fjmenendez
I think that generalized linear models with appropriate error distributions &
link functions can often yield results that are more interpretable.  (I
think this is what Rich was getting at when he mentioned "non-linear maximum
likelihood models".)  Here's an example for the case where the outcome
variable is positive and positively skewed:

http://rstudio-pubs-static.s3.amazonaws.com/5691_192685385fc445c9b3fb1619960a20e2.html

Notice especially the Differences and Similarities section, where the author
says this:

"Thus, if the outcome is log transformed before entering the linear
regression model, the inference about the geometric mean. In contrast, the
generalized linear model approach allows inference about the arithmetic mean
on the original scale."

Finally, the models estimated on that page using R can also be estimated
using GENLIN, as it allows one to select a Gamma error distribution.

https://www.ibm.com/support/knowledgecenter/en/SSLVMB_25.0.0/statistics_reference_project_ddita/spss/advanced/syn_genlin_model.html

HTH.



fjmenendez wrote

> Thanks Art, thanks Rich for your kindness and your knowledge :)
>
> I read the posts about outliers in the list, and I have benefited from
> them. The idea of thinking about them as suspicious values that need
> additional checking before decision makes a lot of sense for me. Perhaps I
> should think this topic using different words.
> I don't know the anomalous values tool more than superficially. Perhaps it
> is a good idea reread about it.
>
> Also transformations deserve attention. I feel a little shy about them
> because of problems of interpretation.
>
> Again, thanks Art, thanks Rich :)
>
> On Mon, May 21, 2018 at 11:34 PM, Rich Ulrich &lt;

> rich-ulrich@

> &gt; wrote:
>
>> There is testing, and there is model-fitting. We do, usually, want to
>> have
>> tests
>>
>> on the models, so we almost always want to meet the condition for
>> testing.
>>
>>
>> Measures with extreme skewness need to be transformed for least-squares
>>
>> statistics (ANOVA) or to be fitted with a non-linear maximum likelihood
>> models.
>>
>>
>> Remember that the assumption for least-squares testing is that equal
>> intervals
>>
>> of the scale should be equal in their influence (or in being influenced)
>> regardless
>>
>> of where they fall on the scale, be it the middle or an extreme. ("Equal
>> interval"
>>
>> describes a /relationship/, not the character of a single measure.)
>>
>>
>> Tukey gave a rule of thumb -- if the largest of a natural measurement
>> (non-negative)
>>
>> is 20 times the smallest, you almost always should use a transformation.
>> IIRC, "10 times"
>>
>> the smallest suggests that you should consider one. What you want to look
>> at first in
>>
>> choosing a transformation is not the skewness, however, but is the
>> mechanism for
>>
>> generating the numbers. For the first choices, counts imply square roots;
>> intensities
>>
>> imply logs (or logistic transforms); distances imply reciprocals.
>>
>>
>> --
>>
>> Rich Ulrich
>> ------------------------------
>> *From:* SPSSX(r) Discussion &lt;

> SPSSX-L@.UGA

> &gt; on behalf of
>> Florentino Jorge Menendez &lt;

> fjmenendez@

> &gt;
>> *Sent:* Monday, May 21, 2018 4:01:51 PM
>> *To:*

> SPSSX-L@.UGA

>> *Subject:* Fwd: A basic question about outliers
>>
>>
>> I d´appreciate if somebody can answer a novice doubt.
>>
>> Box plots mark a series of observations as outliers. That is clear in a
>> normal distribution: those cases that are more than 1.5 interquartile
>> ranges above P75 or below P25 are considered outliers (some authors say
>> 2.2
>> IR instead of 1.5 IR). That makes sense for me.
>>
>> But I don´t know how to consider outliers in skewed distributions. The
>> meaning of outliers comes from lie outside: we are trying to analyse if
>> observations belong to a distribution.
>>
>> But in a skewed distribution a lot of observations are above 1.5 or 2.2
>> or
>> more interquartile ranges and belong to the distribution... I feel
>> confused. Does it make any sense to talk about outliers in skewed
>> distributions? How to identify them?
>>
>> I´d appreciate any help. Thanks in advance.
>> Florentino Menéndez.
>>
>>
>> &lt;http://www.avg.com/email-signature?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail&gt;
>> Libre
>> de virus. www.avg.com
>> &lt;http://www.avg.com/email-signature?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail&gt;
>> <#m_-4902436190330674248_x_m_6938251499252016520_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> ===================== To manage your subscription to SPSSX-L, send a
>> message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text
>> except the command. To leave the list, send the command SIGNOFF SPSSX-L
>> For
>> a list of commands to manage subscriptions, send the command INFO REFCARD
>>
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.