
12

Dear All,
I would be grateful to you if you could help me with the following. I am getting desperate as I have to present my data on Wednesday.
I checked the linear assumption for my only continuous variable and it is violated.
I used the natural logarithm. How can I check that now it is not?
Thank you in advance,
Dimitrios

Administrator

One straightforward way to get an idea about the functional relationship between a continuous explanatory variable and the logodds of an "event" (with "event" being defined as Outcome variable = 1) is as follows:
1. For exploratory purposes only, recode the continuous variable into some number of categories (e.g., quintiles).
2. Estimate a model with the categorical variable in place of the continuous variable, and save the predicted probabilities.
3. Convert the predicted probabilities to predicted logodds.
4. Make a scatterplot with X = the original continuous variable and Y = predicted logodds.
Here's an example from something I helped a colleague with a while ago.
* Model 1: Exploratory with categorical Age variable.
LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER AgeGroup Sex ED_only locum
/CONTRAST (AgeGroup)=Indicator(1)
/PRINT=CI(95)
/SAVE pred(PP1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
COMPUTE LogOdds1 = ln(PP1 / (1  PP1)).
VARIABLE LABELS LogOdds1 "Logodds of outcome (Model 1)".
DESCRIPTIVES PP1 LogOdds1.
GRAPH /SCATTERPLOT(BIVAR)=AgeGroup WITH LogOdds1 .
* That scatterplot shows a clear quadratic (Ushaped) relationship.
* Therefore, when we use Age as a continuous variable in Model 2,
* we'll want to include Agesquared as well.
* Model 2: Treat Age as a continuous variable,
* and include Agesquared.
COMPUTE AgeSq = Age**2.
LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER Age AgeSq Sex ED_only locum
/PRINT=CI(95)
/SAVE pred(PP2)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
COMPUTE LogOdds2 = ln(PP2 / (1  PP2)).
VARIABLE LABELS LogOdds2 "Logodds of outcome (Model 2)".
HTH.
dimitrios wrote
Dear All,
I would be grateful to you if you could help me with the following. I am getting desperate as I have to present my data on Wednesday.
I checked the linear assumption for my only continuous variable and it is violated.
I used the natural logarithm. How can I check that now it is not?
Thank you in advance,
Dimitrios


Thank you for your reply.
Is it acceptable to transform a continuous variable to a categorigal one for the logistic regression since my variable is not linear or is it advisable to go through the transformation?
Thank you in advance,
Dimitrios

Administrator

I don't entirely understand your question, but will offer these comments.
1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc).
http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdf2. In the example I gave earlier in the thread, I carved age into categories for a *preliminary*, *exploratory* analysis that was carried out to provide information about the shape of the functional relationship between age and the logodds of the 10 outcome variable being = 1. A plot of the fitted logodds as a function of age showed a clear Ushaped functional relationship. Therefore, when I reverted to treating age as a continuous variable (in my final model), I knew I had to include both Age and Agesquared as explanatory variables. Including Agesquared allowed the functional relationship to be Ushaped.
I hope this clarifies things somewhat.
dimitrios wrote
Thank you for your reply.
Is it acceptable to transform a continuous variable to a categorigal one for the logistic regression since my variable is not linear or is it advisable to go through the transformation?
Thank you in advance,
Dimitrios


At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>when I reverted to treating age as a continuous variable (in my
>final model), I knew I had to include both Age and Agesquared as
>explanatory variables. Including Agesquared allowed the functional
>relationship to be Ushaped.
Bruce is far more the methodologist than I, but it's worth adding
that, for variables (like age) with strictly positive values, the
linear and squared terms tend to be highly correlated, leading to the
usual difficulties when estimating using correlated independent variables.
One can meancenter the age before estimating, to avoid this. Or, it
works pretty well to choose an age near the middle of the range you
have, and use the square of the difference from that age. (It's fine
to use the plain age, rather than meancentered, as the linear term.)
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Administrator

Hi Richard. Just a quick off the cuff response here, because it's time to get off home for the Easter weekend.
I would argue that the collinearity of X and Xsquared is "illusory", meaning that it is completely nonproblematic. (I know there is an published article somewhere making this argument, but I can't lay my hands on it right now.) Here's one reason for thinking that: If you run the model with and without centering, and save the fitted values of Y (or the predicted probabilities, in the case of logistic regression), those fitted values (or predicted probabilities) will be identical. And the Rsquared (for OLS models) or 2LL values (for models fit via MLE) will be identical too. So it's the same model, regardless of whether you center or not.
Having said that, I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to meancenter, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.)
Cheers!
Bruce
Richard Ristow wrote
At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>when I reverted to treating age as a continuous variable (in my
>final model), I knew I had to include both Age and Agesquared as
>explanatory variables. Including Agesquared allowed the functional
>relationship to be Ushaped.
Bruce is far more the methodologist than I, but it's worth adding
that, for variables (like age) with strictly positive values, the
linear and squared terms tend to be highly correlated, leading to the
usual difficulties when estimating using correlated independent variables.
One can meancenter the age before estimating, to avoid this. Or, it
works pretty well to choose an age near the middle of the range you
have, and use the square of the difference from that age. (It's fine
to use the plain age, rather than meancentered, as the linear term.)
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD


http://m.orm.sagepub.com/content/15/3/339.abstract> On Apr 17, 2014, at 7:01 PM, Bruce Weaver < [hidden email]> wrote:
>
> Hi Richard. Just a quick off the cuff response here, because it's time to
> get off home for the Easter weekend.
>
> I would argue that the collinearity of X and Xsquared is "illusory",
> meaning that it is completely nonproblematic. (I know there is an
> published article somewhere making this argument, but I can't lay my hands
> on it right now.) Here's one reason for thinking that: If you run the
> model with and without centering, and save the fitted values of Y (or the
> predicted probabilities, in the case of logistic regression), those fitted
> values (or predicted probabilities) will be identical. And the Rsquared
> (for OLS models) or 2LL values (for models fit via MLE) will be identical
> too. So it's the same model, regardless of whether you center or not.
>
> Having said that, I often do center the variables. But I do so simply to
> make (some of) the coefficients more interpretable. And rather than center
> on the mean, I often center on a convenient value near the minimum. Part of
> the reason I do that is to emphasize the point that it is nowhere written in
> stone that thou shalt center on the mean! (Even if one does want to
> meancenter, it is better practice, I think, to center on a value near the
> mean, and to center on the same value each time if one is conducting
> multiple studies. After all, the sample means will not all be the same; so
> centering on the same value each time makes the results more comparable
> across studies.)
>
>
> Cheers!
> Bruce
>
>
>
> Richard Ristow wrote
>> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>>
>>> when I reverted to treating age as a continuous variable (in my
>>> final model), I knew I had to include both Age and Agesquared as
>>> explanatory variables. Including Agesquared allowed the functional
>>> relationship to be Ushaped.
>>
>> Bruce is far more the methodologist than I, but it's worth adding
>> that, for variables (like age) with strictly positive values, the
>> linear and squared terms tend to be highly correlated, leading to the
>> usual difficulties when estimating using correlated independent variables.
>>
>> One can meancenter the age before estimating, to avoid this. Or, it
>> works pretty well to choose an age near the middle of the range you
>> have, and use the square of the difference from that age. (It's fine
>> to use the plain age, rather than meancentered, as the linear term.)
>>
>> =====================
>> To manage your subscription to SPSSXL, send a message to
>
>> LISTSERV@.UGA
>
>> (not to SPSSXL), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSXL
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> 
> 
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an email, please use the address shown above.
>
> 
> View this message in context: http://spssxdiscussion.1045642.n5.nabble.com/logisticregressionassumptiontp5725433p5725508.html> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSXL, send a message to
> [hidden email] (not to SPSSXL), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSXL
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Administrator

That's the one I was thinking of. Thanks Ryan.
Ryan Black wrote
http://m.orm.sagepub.com/content/15/3/339.abstract> On Apr 17, 2014, at 7:01 PM, Bruce Weaver < [hidden email]> wrote:
>
> Hi Richard. Just a quick off the cuff response here, because it's time to
> get off home for the Easter weekend.
>
> I would argue that the collinearity of X and Xsquared is "illusory",
> meaning that it is completely nonproblematic. (I know there is an
> published article somewhere making this argument, but I can't lay my hands
> on it right now.) Here's one reason for thinking that: If you run the
> model with and without centering, and save the fitted values of Y (or the
> predicted probabilities, in the case of logistic regression), those fitted
> values (or predicted probabilities) will be identical. And the Rsquared
> (for OLS models) or 2LL values (for models fit via MLE) will be identical
> too. So it's the same model, regardless of whether you center or not.
>
> Having said that, I often do center the variables. But I do so simply to
> make (some of) the coefficients more interpretable. And rather than center
> on the mean, I often center on a convenient value near the minimum. Part of
> the reason I do that is to emphasize the point that it is nowhere written in
> stone that thou shalt center on the mean! (Even if one does want to
> meancenter, it is better practice, I think, to center on a value near the
> mean, and to center on the same value each time if one is conducting
> multiple studies. After all, the sample means will not all be the same; so
> centering on the same value each time makes the results more comparable
> across studies.)
>
>
> Cheers!
> Bruce
>
>
>
> Richard Ristow wrote
>> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>>
>>> when I reverted to treating age as a continuous variable (in my
>>> final model), I knew I had to include both Age and Agesquared as
>>> explanatory variables. Including Agesquared allowed the functional
>>> relationship to be Ushaped.
>>
>> Bruce is far more the methodologist than I, but it's worth adding
>> that, for variables (like age) with strictly positive values, the
>> linear and squared terms tend to be highly correlated, leading to the
>> usual difficulties when estimating using correlated independent variables.
>>
>> One can meancenter the age before estimating, to avoid this. Or, it
>> works pretty well to choose an age near the middle of the range you
>> have, and use the square of the difference from that age. (It's fine
>> to use the plain age, rather than meancentered, as the linear term.)
>>
>> =====================
>> To manage your subscription to SPSSXL, send a message to
>
>> LISTSERV@.UGA
>
>> (not to SPSSXL), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSXL
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> 
> 
> Bruce Weaver
> [hidden email]> http://sites.google.com/a/lakeheadu.ca/bweaver/>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an email, please use the address shown above.
>
> 
> View this message in context: http://spssxdiscussion.1045642.n5.nabble.com/logisticregressionassumptiontp5725433p5725508.html> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSXL, send a message to
> [hidden email] (not to SPSSXL), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSXL
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD


I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list:
Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462477. (No preprint PDF I'm afraid, doi here.)
I would note  if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation  let alone the regression with both the square and the cubed term.
Also FYI I wrote a macro to estimate restricted cubic spline basis, a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression  as I got a comment asking about that as well.


Amen, Bruce. I see this misconception
repeated all the time on this list and elsewhere. No matter how many
times I assert that computationally this makes no difference, it doesn't
seem to get through, and the results are exactly equivalent up to
a very high level of numerical exactness. Maybe people
will believe it when you say it.
Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 7203425621
From:
Bruce Weaver <[hidden email]>
To:
[hidden email],
Date:
04/17/2014 05:02 PM
Subject:
Re: [SPSSXL]
logistic regression assumption
Sent by:
"SPSSX(r)
Discussion" <[hidden email]>
Hi Richard. Just a quick off the cuff response
here, because it's time to
get off home for the Easter weekend.
I would argue that the collinearity of X and Xsquared is "illusory",
meaning that it is completely nonproblematic. (I know there is an
published article somewhere making this argument, but I can't lay my hands
on it right now.) Here's one reason for thinking that: If you
run the
model with and without centering, and save the fitted values of Y (or the
predicted probabilities, in the case of logistic regression), those fitted
values (or predicted probabilities) will be identical. And the Rsquared
(for OLS models) or 2LL values (for models fit via MLE) will be identical
too. So it's the same model, regardless of whether you center or
not.
Having said that, I often do center the variables. But I do so simply
to
make (some of) the coefficients more interpretable. And rather than
center
on the mean, I often center on a convenient value near the minimum. Part
of
the reason I do that is to emphasize the point that it is nowhere written
in
stone that thou shalt center on the mean! (Even if one does want
to
meancenter, it is better practice, I think, to center on a value near
the
mean, and to center on the same value each time if one is conducting
multiple studies. After all, the sample means will not all be the
same; so
centering on the same value each time makes the results more comparable
across studies.)
Cheers!
Bruce
Richard Ristow wrote
> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>
>>when I reverted to treating age as a continuous variable (in my
>>final model), I knew I had to include both Age and Agesquared
as
>>explanatory variables. Including Agesquared allowed the
functional
>>relationship to be Ushaped.
>
> Bruce is far more the methodologist than I, but it's worth adding
> that, for variables (like age) with strictly positive values, the
> linear and squared terms tend to be highly correlated, leading to
the
> usual difficulties when estimating using correlated independent variables.
>
> One can meancenter the age before estimating, to avoid this. Or,
it
> works pretty well to choose an age near the middle of the range you
> have, and use the square of the difference from that age. (It's fine
> to use the plain age, rather than meancentered, as the linear term.)
>
> =====================
> To manage your subscription to SPSSXL, send a message to
> LISTSERV@.UGA
> (not to SPSSXL), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSXL
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD


Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/
"When all else fails, RTFM."
NOTE: My Hotmail account is not monitored regularly.
To send me an email, please use the address shown above.

View this message in context: http://spssxdiscussion.1045642.n5.nabble.com/logisticregressionassumptiontp5725433p5725508.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Hi Bruce,
I just posted the link to that article without comment before because I was preoccupied, but now that I have a moment I'd like to chime in here. First and foremost, I agree with you entirely. I have not encountered a situation in which centering a variable resulted in any change in the actual model being fit. I have, on occasion, encountered challenges achieving convergence when fitting random effects Bayesian estimation models using WINBUGS and SAS without meancentering [due to high autocorrelationan issue with Bayesian estimation I care not to delve into at the moment].
Knowing that (1) generally, regression models do not change by mean centering variables and (2) I can utilize the coefficient matrix L to obtain parameter estimates/contrasts at whatever values of the variables I so desire by utilizing subcommands of various procedures (e.g., LMATRIX in GLM, TEST in MIXED), I virtually never mean center before fitting models.
Best,
Ryan


Naturally, I have to agree with the mathematics. If you want to say that the difference is "illusory", that's okay, too, for certain values of the word "illusory". I have to say, here, please keep in mind that "illusions" can serve a useful function. Twenty frames per second of fixed images showing moving figures gives the human viewer the illusion of perceived motion. That makes possible flipbooks and movies. Have you ever had to show your results to someone else? I assure you, it is easier to discuss two regression coefficients  their sizes and tests  when they are not highly correlated. I try to avoid modeling with such terms, period. For two highly correlated variables among the IV's, I suggest to consultees that they be modeled by some (relatively uncorrelated) composites for the sum and difference, or sum and difference of the logarithms. Putting in two highly correlated terms is something that we only should do when it is unavoidable, that is, we *want* to puzzle over their confounding, after the fact. What you can say about the correlated ones, most often, has to come down to, "Ignore these numbers; take my word that it means what I say." My own consultees have been happier with the illusion presented by values and tests for separate terms. And it *does* tell them about the relative impact of the terms, fairly concisely and precisely. But I learned to center for the other purpose that was mentioned, the *occasional* failure of a program to get an answer because of nearcollinearity error  convergence, or otherwise. That purpose is not illusory. It seems like sloppy practice to wait for the error to happen when it can be prevented.  Rich Ulrich
Date: Thu, 17 Apr 2014 19:41:54 0600 From: [hidden email]Subject: Re: logistic regression assumption To: [hidden email]Amen, Bruce. I see this misconception
repeated all the time on this list and elsewhere. No matter how many
times I assert that computationally this makes no difference, it doesn't
seem to get through, and the results are exactly equivalent up to
a very high level of numerical exactness. Maybe people
will believe it when you say it.
[snip, previous]


Might I also add that one could perform a Likelihood Ratio Test (LRT) to test whether including the AgeSq term significantly improves model fit in Bruce's example. Although untested, I'm fairly certain the following adjustment to Bruce's syntax will provide the LRT in the Omnibus Tests of Model Coefficients Table:
LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER Age Sex ED_only locum /METHOD=ENTER Age AgeSq Sex ED_only locum.


While I agree the mean centered variables are easier to interpret  please add a chart if you want to substantively talk about them! I can do the derivatives in my head  although I would suspect much of any audience won't go to that trouble. I also do not have a good mental model of the steepness of the parabola from just the estimated parameters nor do I have a good mental model of how large or small the estimates get towards the reasonable values of the explanatory variable in question. (This is important, as polynomial terms often behave badly in the tails  one of the reasons to use restricted cubic splines.) My mental model of these things gets worse if you include a cubed term.
So please, graph your effect estimates! All the things of interest (inflection point, how fast the curve rises or falls, how extreme the tails are) are immediately visible in a graph. You can also add confidence intervals or prediction intervals to the graph.
This advice extends to any set of functionally related explanatory variables.


Just curious. It seems that some people post directly and only to nabble and some the same to this list. When I was looking the logistic regression discussion this morning, I noticed that one of the posts "had not been accepted by the list", which I think Bruce, David or Andy have noted before. What is the functional relationship between nabble and this list? And, is that relationship bidirectional or unidirectional only? Then, why the delay?
Gene Maguin
Original Message
From: SPSSX(r) Discussion [mailto: [hidden email]] On Behalf Of Andy W
Sent: Thursday, April 17, 2014 9:05 PM
To: [hidden email]
Subject: Re: logistic regression assumption
I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list:
Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462477. (No preprint PDF I'm afraid, doi here < http://dx.doi.org/10.1111/j.20448317.2010.02002.x> .)
I would note  if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration < http://andrewpwheeler.wordpress.com/2013/04/03/somenotesonsinglelinechartsinspss/>
I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation  let alone the regression with both the square and the cubed term.
Also FYI I wrote a macro to estimate restricted cubic spline basis < http://andrewpwheeler.wordpress.com/2013/06/06/restrictedcubicsplinesinspss/>
, a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression  as I got a comment asking about that as well.

Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
View this message in context: http://spssxdiscussion.1045642.n5.nabble.com/logisticregressionassumptiontp5725433p5725513.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com.
=====================
To manage your subscription to SPSSXL, send a message to [hidden email] (not to SPSSXL), with no body text except the command. To leave the list, send the command SIGNOFF SPSSXL For a list of commands to manage subscriptions, send the command INFO REFCARD
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD


At 07:01 PM 4/17/2014, Bruce Weaver wrote:
>I would argue that the collinearity of X and Xsquared is
>"illusory", meaning that it is completely nonproblematic. (I know
>there is an published article somewhere making this argument, but I
>can't lay my hands on it right now.) Here's one reason for thinking
>that: If you run the model with and without centering, and save the
>fitted values of Y (or the predicted probabilities, in the case of
>logistic regression), those fitted values (or predicted
>probabilities) will be identical.
Whatever the collinearity is, it isn't illusory; it's there, and
readily calculable and displayable in the usual fashions.
What you, and others, are arguing is, that reparamaterizing the
model as I've suggested doesn't change the subspace of possible
models (defining a 'model' as a set of predicted values), which is
correct; that, therefore, it doesn't change the bestfitting model,
which is also correct; and that, therefore, it doesn't matter, which
I disagree with.
The two reasons I advocate reparamaterizing are, first, that it
makes the resulting coefficients much more interpretable, as others
have noted  the linear term becomes the predicted DV change per
unit IV change in a central part of the range; and second, that
keeping the original, nearcollinear paramaterization greatly
inflates the SEEs and confidence intervals of the estimated
coefficients. Among other things, that makes using t or Ftests for
whether nonlinear terms belong in the model, very insensitive. (It
may be argued that using ANY test to exclude terms from a model
results in overstating the Fbased significance of the model; but
that argument applies equally to choosing whether to include
higherorder terms on the basis of a graph.)
It's been noted that collinear predictors also make the estimation
more difficult, numerically, though with modern hardware and software
that's a lesser issue.
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Administrator

Good morning Richard. :)
For the record, I want to clarify that I did not intend to advocate NOT centering variables. As I said...
"... I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to meancenter, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.)"
Upon reflection, one change I would make in that (off the cuff) paragraph is to change "simply" to "mainly" in the second sentence, i.e., "I do so MAINLY to make (some) of the coefficients more interpretable".
The main point I was *trying* to make is that I disagree with those authors who say that one MUST (mean) center their variables when the model includes product terms or higher order polynomial terms (which are really product terms tooXsq = X*X, for example).
But...having read some of the other posts in the thread, I will concede that even with modern computing power & software, one may sometimes run into computational difficulties that can be alleviated by centering on some reasonable, intheobservedrange values (not necessarily the mean).
By the way, I also strongly agree with Andy W on the importance of plotting fitted values for models that include product terms. Looking at such plots is FAR more illuminating than looking at tables of coefficients. (Even if one does wish to interpret the coefficients, it is much easier to do so having looked at plots of fitted values, in my experience.)
Cheers!
Bruce
Richard Ristow wrote
At 07:01 PM 4/17/2014, Bruce Weaver wrote:
>I would argue that the collinearity of X and Xsquared is
>"illusory", meaning that it is completely nonproblematic. (I know
>there is an published article somewhere making this argument, but I
>can't lay my hands on it right now.) Here's one reason for thinking
>that: If you run the model with and without centering, and save the
>fitted values of Y (or the predicted probabilities, in the case of
>logistic regression), those fitted values (or predicted
>probabilities) will be identical.
Whatever the collinearity is, it isn't illusory; it's there, and
readily calculable and displayable in the usual fashions.
What you, and others, are arguing is, that reparamaterizing the
model as I've suggested doesn't change the subspace of possible
models (defining a 'model' as a set of predicted values), which is
correct; that, therefore, it doesn't change the bestfitting model,
which is also correct; and that, therefore, it doesn't matter, which
I disagree with.
The two reasons I advocate reparamaterizing are, first, that it
makes the resulting coefficients much more interpretable, as others
have noted  the linear term becomes the predicted DV change per
unit IV change in a central part of the range; and second, that
keeping the original, nearcollinear paramaterization greatly
inflates the SEEs and confidence intervals of the estimated
coefficients. Among other things, that makes using t or Ftests for
whether nonlinear terms belong in the model, very insensitive. (It
may be argued that using ANY test to exclude terms from a model
results in overstating the Fbased significance of the model; but
that argument applies equally to choosing whether to include
higherorder terms on the basis of a graph.)
It's been noted that collinear predictors also make the estimation
more difficult, numerically, though with modern hardware and software
that's a lesser issue.
=====================
To manage your subscription to SPSSXL, send a message to
[hidden email] (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Bruce and others,
Suppose the population regression model is:
Y = 0.5 + 1.5*x + 2.0*(x^2) + Epsilon
Further, suppose we randomly select 10,000 subjects, and collect data on both y and x for each subject. BELOW my name is a simulation experiment which shows that we can obtain estimated parameters (intercept and main effect), standard errors, tstatistics, and pvalues from a model employed on noncentered data that are *identical* to a model employed on centered data. The TEST statements of the MIXED procedure provide proof of what I claim, at least for this simulation example.
To construct those TEST statements, all I needed to do was to recognize the relationship between the noncentered and centered equations.
With the exception of numerical instability due to unknown various factors (which I very rarely encounter and certainly did not encounter with this simulation experiment), I continue to assert that there is no need to meancenter the model provided above with respect to accurately estimating model fit, parameters, standard errors/confidence intervals, teststatistics, pvalues, etc.
Ryan 
*Generate Data. set seed 1234. new file. input program. loop ID= 1 to 10000. compute x = rv.normal(2,1).
compute y = 0.5 + 1.5*x + 2.0*(x**2) + rv.normal(0,1). end case. end loop. end file. end input program. execute.
COMPUTE x_squared=x*x.
EXECUTE.
*OLS Regression without mean centering. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT y /METHOD=ENTER x x_squared.
COMPUTE x_mean_centered=x  1.9797462214653716. COMPUTE x_mean_centered_sqrd = x_mean_centered**2.
*OLS Regression with mean centering. REGRESSION /STATISTICS COEFF OUTS R ANOVA
/DEPENDENT y /METHOD=ENTER x_mean_centered x_mean_centered_sqrd.
*REML Regression without mean centering. *Note: Used TEST subcommand to recover intercept and main effect
test from OLS Regression with mean centering. MIXED y WITH x x_squared /FIXED=x x_squared  SSTYPE(3) /PRINT = SOLUTION /METHOD=REML /TEST 'intercept @ x=0' intercept 1 x 0 x_squared 0
/TEST 'main eff @ x=0' intercept 0 x 1 x_squared 0 /TEST 'intercept @ x=mean' intercept 1 x 1.9797462214653716 x_squared 3.919395101406414 /TEST 'main eff @ x=mean' intercept 0 x 1 x_squared 3.959492442930742.


Thank you all for your input.
I am rather naive in stats so I would like to clarify this:
When I use age as a continous variable, it is not linear and therefore I cannot use it.
If I use age as ordinal (1840, 4160, 6180), I guess I do not need to worry about linearity.
Results come back similar and from practical point of view, it does not change a lot.
I may miss the information that a continuous variable offer (e.g. HR per year), but I still get valuable information about the impact of age.
Is this considered acceptable?
I am grateful to you for all your input, but I am a little concerned to use advanced stats (at least for me) since I may make a significant mistake, without even realizing it.
Thank you in advance,

Administrator

I'll repeat something I noted earlier in the thread, and expand on it.
Here's the repeated bit:
1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc).
http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdfAnd here is the expansion.
With the age groups you list below:
1. Everyone within an age group will have exactly the same fitted value, despite differing in age by up to about 20 years for those at the extremes.
2. Two people just on either side of the age group cutpoints can have very different fitted values, despite tiny differences in age.
3. The agegroup cutpoints are probably arbitrary, and the fitted values for individuals near the cutpoints will likely change fairly substantially if you change the cutpoints.
These are some of the reasons why it is usually preferable (if at all possible) to model continuous variables (like Age) as continuous.
HTH.
dimitrios wrote
Thank you all for your input.
I am rather naive in stats so I would like to clarify this:
When I use age as a continous variable, it is not linear and therefore I cannot use it.
If I use age as ordinal (1840, 4160, 6180), I guess I do not need to worry about linearity.
Results come back similar and from practical point of view, it does not change a lot.
I may miss the information that a continuous variable offer (e.g. HR per year), but I still get valuable information about the impact of age.
Is this considered acceptable?
I am grateful to you for all your input, but I am a little concerned to use advanced stats (at least for me) since I may make a significant mistake, without even realizing it.
Thank you in advance,

12
