# Importance of Indep vars in Classification trees

7 messages
Open this post in threaded view
|

## Importance of Indep vars in Classification trees

 In the CRT growing method it's possible to rank the independent predictors by importance to the model. But the top ranked predictor is NOT necessarily the same as the first split predictor. I'm new to trees & would appreciate an explanation for this. Is importance determined on the whole model rather than the order of the splits? If this reasoning is correct then one can't comment on the importance of predictors when other growth methods are used. So if Chaid is used it's not correct to say that the first splitting variable is the most important/influential. Can someone clear this up for me ? Regards Mark
Open this post in threaded view
|

## Re: Importance of Indep vars in Classification trees

 Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro.
Open this post in threaded view
|

## Re: Importance of Indep vars in Classification trees

 Mark-Thanks for the reply-can I ask a few more questions ? I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com
Open this post in threaded view
|

## Re: Importance of Indep vars in Classification trees

 Hi Mark, As far as I know, there is no order to the list in the Model Summary table of the variables used in the table. It merely lists those which are significant, and which have been used in build to the tree. Whilst the CRT method gives you the Importance to Model table, there is no equivalent for CHAID (or Exhaustive CHAID). You could pick the variables used in the top levels of the tree and use regression to assess the importance using the standardised beta values. You could use enter or stepwise methods. Rgds, Antro.
Open this post in threaded view
|

## Re: Importance of Indep vars in Classification trees

 In reply to this post by Mark Webb-3 Yes, you can read something into the Model Summary (not Notes) table, because the list of variables used in the model indicates the variables that provided a significant contribution. I think you can interpret the order in which they're listed as a general indication of relative importance, although I now see that this does not always exactly match the "normalized" importance. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb Sent: Friday, September 15, 2006 4:43 AM To: [hidden email] Subject: Re: Importance of Indep vars in Classification trees Mark-Thanks for the reply-can I ask a few more questions ? I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com