Hi everybody:
This is more a theoretical question, rather than a SPSS related one (skip it safely if you are not interested in CI for R-square). There is a program for computing confidence intervals for R2 (called R2.exe) http://www.interchange.ubc.ca/steiger/r2.htm BTW, the link to the program inside the page is wrong (points to a non existent ftp site), and the correct one is: http://www.interchange.ubc.ca/steiger/r2.zip Since my students are kind of alergic to DOS programs, I have been trying to write code to compute CI for R2 using bootstrap. TO my surprise, the 95% bootstrap CI is quite different to the one Steiger&Fouladi program gives (narrower and the lower limit is higher than it should). I have been trying to adjust the limits of the bootstrap CI to make it match (more or less) the one R2.exe gives. I had to replace percentyle 2.5 by percentyle 0.7 in order to adjust the lower limit (see an example dataset and code below). My question is: should I leave the percentyles as they are (2.5 & 97.5) and simply discuss with my students that bootstrapped CIs will be narrower than the ones given by R2.exe, or should I replace percentyle 2.5 by 0.7 and still call the bootstrap results a "95% CI"? -- Regards, Marta Garcia-Granero, PhD Statistician _______________________________________________________________________ CODE: * Sample dataset (from 'Statistics at Square One', BMJ on-line book) *. DATA LIST LIST/height deadsp (2 F8.0). BEGIN DATA 110 44 116 31 124 43 129 45 131 56 138 79 142 57 150 56 153 58 155 92 156 78 159 64 164 88 168 112 174 101 END DATA. VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT deadsp /METHOD=ENTER height . * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *. PRESERVE. SET SEED=29147290. SET MXLOOPS=1000. MATRIX. PRINT /TITLE='Confidence Interval for R-square in Simple Linear Regression'. GET data /VAR=height deadsp /MISSING OMIT. * Sample statistics *. COMPUTE n=NROW(data). COMPUTE mean=CSUM(data)/n. COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). COMPUTE x=data(:,1). COMPUTE y=data(:,2). COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). COMPUTE r=covxy/SQRT(variance(1)*variance(2)). PRINT r**2 /FORMAT='F8.3' /TITLE='Sample R-square'. COMPUTE b=covxy/variance(1). COMPUTE a=mean(2)-b*mean(1). PRINT {a,b} /FORMAT='F8.3' /CLABEL='a','b' /TITLE='Regression line'. * Bootstraping R2 *. COMPUTE lowersum=0. COMPUTE uppersum=0. COMPUTE k=1000. /* Number of bootsamples (change it if you want, increase MXLOOPS then) *. COMPUTE bootR2=MAKE(k,1,0). COMPUTE bootsamp=MAKE(n,2,0). LOOP h=1 TO 10. - LOOP i=1 TO k. /* Extracting bootstrap samples *. - LOOP j= 1 TO n. - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). - COMPUTE bootsamp(j,:)=data(flipcoin,:)). - END LOOP. - COMPUTE mean=CSUM(bootsamp)/n. - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). - COMPUTE x=bootsamp(:,1). - COMPUTE y=bootsamp(:,2). - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). - COMPUTE bootR2(i)=r&**2. - END LOOP. * Ordered array: sorting algorithm by R Ristow & J Peck *. - COMPUTE sortedR2=bootR2. - COMPUTE sortedR2(GRADE(bootR2))=bootR2. * NP confidence interval *. - COMPUTE lower=sortedR2(k*0.007). - COMPUTE upper=sortedR2(1+k*0.975). - COMPUTE lowersum=lowersum+lower. - COMPUTE uppersum=uppersum+upper. END LOOP. PRINT {lowersum/10,uppersum/10} /FORMAT='F8.4' /TITLE='R2 CI: Mean of 10 bootstrap runs (1000 reps. each)' /CLABEL='LowerCI','UpperCI'. END MATRIX. RESTORE. |
Hi Marta,
I'm not sophisticated enough with MATRIX commands to follow all that you've done. However, I don't see where you centered the bootstrap distribution on R^2-Adjusted (what Steiger calls P^2). It appears that the distribution is centered on R^2 itself. The population parameter estimate, P^2, is 1 - [(1 - R^2)(N - 1)]/(N - K), where K is the number of variables in the equation (i.e., 2 for your example). Could this be the difficulty? Greg | -----Original Message----- | From: SPSSX(r) Discussion [mailto:[hidden email]] | On Behalf Of Marta García-Granero | Sent: Monday, June 19, 2006 4:56 AM | To: [hidden email] | Subject: Confidence Interval for R-square: R2.exe | (Steiger&Fouladi) vs Bootstrapping | | Hi everybody: | | This is more a theoretical question, rather than a SPSS related one | (skip it safely if you are not interested in CI for R-square). | | There is a program for computing confidence intervals for R2 (called | R2.exe) | | http://www.interchange.ubc.ca/steiger/r2.htm | | BTW, the link to the program inside the page is wrong (points to a non | existent ftp site), and the correct one is: | | http://www.interchange.ubc.ca/steiger/r2.zip | | Since my students are kind of alergic to DOS programs, I have been | trying to write code to compute CI for R2 using bootstrap. TO my | surprise, the 95% bootstrap CI is quite different to the one | Steiger&Fouladi program gives (narrower and the lower limit is higher | than it should). I have been trying to adjust the limits of the | bootstrap CI to make it match (more or less) the one R2.exe gives. I | had to replace percentyle 2.5 by percentyle 0.7 in order to adjust the | lower limit (see an example dataset and code below). | | My question is: should I leave the percentyles as they are (2.5 & | 97.5) and simply discuss with my students that bootstrapped CIs will | be narrower than the ones given by R2.exe, or should I replace | percentyle 2.5 by 0.7 and still call the bootstrap results a "95% CI"? | | -- | Regards, | Marta Garcia-Granero, PhD | Statistician | | ______________________________________________________________ | _________ | CODE: | | * Sample dataset (from 'Statistics at Square One', BMJ | on-line book) *. | DATA LIST LIST/height deadsp (2 F8.0). | BEGIN DATA | 110 44 | 116 31 | 124 43 | 129 45 | 131 56 | 138 79 | 142 57 | 150 56 | 153 58 | 155 92 | 156 78 | 159 64 | 164 88 | 168 112 | 174 101 | END DATA. | VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'. | | REGRESSION | /STATISTICS COEFF OUTS R ANOVA | /DEPENDENT deadsp | /METHOD=ENTER height . | | * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *. | | PRESERVE. | SET SEED=29147290. | SET MXLOOPS=1000. | MATRIX. | PRINT /TITLE='Confidence Interval for R-square in Simple | Linear Regression'. | GET data /VAR=height deadsp /MISSING OMIT. | * Sample statistics *. | COMPUTE n=NROW(data). | COMPUTE mean=CSUM(data)/n. | COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). | COMPUTE x=data(:,1). | COMPUTE y=data(:,2). | COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | PRINT r**2 | /FORMAT='F8.3' | /TITLE='Sample R-square'. | COMPUTE b=covxy/variance(1). | COMPUTE a=mean(2)-b*mean(1). | PRINT {a,b} | /FORMAT='F8.3' | /CLABEL='a','b' | /TITLE='Regression line'. | * Bootstraping R2 *. | COMPUTE lowersum=0. | COMPUTE uppersum=0. | COMPUTE k=1000. /* Number of bootsamples (change it if you | want, increase MXLOOPS then) *. | COMPUTE bootR2=MAKE(k,1,0). | COMPUTE bootsamp=MAKE(n,2,0). | LOOP h=1 TO 10. | - LOOP i=1 TO k. /* Extracting bootstrap samples *. | - LOOP j= 1 TO n. | - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). | - COMPUTE bootsamp(j,:)=data(flipcoin,:)). | - END LOOP. | - COMPUTE mean=CSUM(bootsamp)/n. | - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). | - COMPUTE x=bootsamp(:,1). | - COMPUTE y=bootsamp(:,2). | - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | - COMPUTE bootR2(i)=r&**2. | - END LOOP. | * Ordered array: sorting algorithm by R Ristow & J Peck *. | - COMPUTE sortedR2=bootR2. | - COMPUTE sortedR2(GRADE(bootR2))=bootR2. | * NP confidence interval *. | - COMPUTE lower=sortedR2(k*0.007). | - COMPUTE upper=sortedR2(1+k*0.975). | - COMPUTE lowersum=lowersum+lower. | - COMPUTE uppersum=uppersum+upper. | END LOOP. | PRINT {lowersum/10,uppersum/10} | /FORMAT='F8.4' | /TITLE='R2 CI: Mean of 10 bootstrap runs (1000 reps. each)' | /CLABEL='LowerCI','UpperCI'. | END MATRIX. | RESTORE. | |
Hi Gregory,
Although it is true I had forgotten to bootstrap adjusted R2 (Rho-square) instead of sample R2 (thanks!), this doesn't explain the discrepancies between both methods of estimating CI for Rho-square. I still have to modify the percentyles (1.2 instead of 2.5 & 98 instead of 97.5) to get results more consistent with those R2.exe gives (see below). Therefore, my question is still the same: should I just indicate the differences or adjust the percentyles?... * Sample dataset (from 'Statistics at Square One', BMJ on-line book) *. DATA LIST FREE/height deadsp (2 F8.0). BEGIN DATA 110 44 116 31 124 43 129 45 131 56 138 79 142 57 150 56 153 58 155 92 156 78 159 64 164 88 168 112 174 101 END DATA. VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT deadsp /METHOD=ENTER height . * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *. PRESERVE. SET SEED=29147290. SET MXLOOPS=1000. MATRIX. PRINT /TITLE='Confidence Interval for Rho-square in Simple Linear Regression'. GET data /VAR=height deadsp /MISSING OMIT. * Sample statistics *. COMPUTE n=NROW(data). COMPUTE mean=CSUM(data)/n. COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). COMPUTE x=data(:,1). COMPUTE y=data(:,2). COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). COMPUTE r=covxy/SQRT(variance(1)*variance(2)). COMPUTE b=covxy/variance(1). COMPUTE a=mean(2)-b*mean(1). PRINT {a,b} /FORMAT='F8.3' /CLABEL='a','b' /TITLE='Regression line'. PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)} /FORMAT='F8.3' /CLABEL='R-Square','P-Square' /TITLE='Sample & Population (Adjusted) R-square'. * Bootstraping PR2 *. COMPUTE k=1000. COMPUTE bootR2 =MAKE(k,1,0). COMPUTE bootsamp=MAKE(n,2,0). COMPUTE lowersum=0. COMPUTE uppersum=0. LOOP h=1 TO 10. /* 10 runs (to average them later) *. - LOOP i=1 TO k. /* Extracting k bootstrap samples *. - LOOP j= 1 TO n./* with sample size n *. - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). - COMPUTE bootsamp(j,:)=data(flipcoin,:)). - END LOOP. - COMPUTE mean=CSUM(bootsamp)/n. - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). - COMPUTE x=bootsamp(:,1). - COMPUTE y=bootsamp(:,2). - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). - COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2). - END LOOP. * Ordered array: sorting algorithm by R Ristow & J Peck *. - COMPUTE sortedR2=bootR2. - COMPUTE sortedR2(GRADE(bootR2))=bootR2. * NP confidence interval *. - COMPUTE lower=sortedR2(k*0.012). - COMPUTE upper=sortedR2(1+k*0.98). - COMPUTE lowersum=lowersum+lower. - COMPUTE uppersum=uppersum+upper. END LOOP. PRINT {lowersum/10,uppersum/10} /FORMAT='F8.4' /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)' /CLABEL='LowerCI','UpperCI'. END MATRIX. RESTORE. MGJ> I'm not sophisticated enough with MATRIX commands to follow MGJ> all that you've done. However, I don't see where you centered the MGJ> bootstrap distribution on R^2-Adjusted (what Steiger calls P^2). MGJ> It appears that the distribution is centered on R^2 itself. MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N MGJ> - 1)]/(N - K), where K is the number of variables in the equation MGJ> (i.e., 2 for your example). Regards Marta |
Hi again Marta,
Interesting problem. I tinkered with your syntax and replaced the SET SEED=29147290 command with SET SEED=Random. Then I ran 10 iterations of your bootstrap. Here's what I found. As you reported, using the R2.exe program, the 95% CI limits should be: Lower: Upper: .3303 .8879 What we get from your syntax is: Lower: Upper: .3306 .8861 .3377 .8906 .3358 .8888 .3412 .8875 .3356 .8911 .3169 .8898 .3288 .8886 .3316 .8867 .3396 .8809 .3363 .8861 .3391 .8885 Ave: .3339 .8877 SD: .0069 .0028 It seems that what appeared to be fixed deviations from the R2.exe program in your bootstrap was really the consequence of having your random number seed fixed at a specific value. Using a fluctuating random seed it becomes clear that despite the large bootstrap samples, there's still a good amount of variability in the CI estimates from one run to the next. The R2.exe findings are within less than one SD of the mean from the 10 estimates from your bootstrap so it looks like both programs are targeting the same parameters. The Steiger & Fouladi program allows 2 options for computing the CI; one quicker and less exact and the other slower but more precise. If the R2.exe results are based on the latter, perhaps it would be useful to increase your bootstrap parameters to something like h = 100 and k = 10000. On the other hand, it may be that your program is more precise and theirs more variable. I don't recall the parameters they used to generate their estimates. Hope this helps. Greg | -----Original Message----- | From: SPSSX(r) Discussion [mailto:[hidden email]] | On Behalf Of Marta García-Granero | Sent: Monday, June 19, 2006 12:28 PM | To: [hidden email] | Subject: Re: Confidence Interval for R-square: R2.exe | (Steiger&Fouladi) vs Bootstrapping | | Hi Gregory, | | Although it is true I had forgotten to bootstrap adjusted R2 | (Rho-square) instead of sample R2 (thanks!), this doesn't explain the | discrepancies between both methods of estimating CI for Rho-square. I | still have to modify the percentyles (1.2 instead of 2.5 & 98 instead | of 97.5) to get results more consistent with those R2.exe gives (see | below). Therefore, my question is still the same: should I | just indicate | the differences or adjust the percentyles?... | | * Sample dataset (from 'Statistics at Square One', BMJ | on-line book) *. | DATA LIST FREE/height deadsp (2 F8.0). | BEGIN DATA | 110 44 116 31 124 43 129 45 131 56 | 138 79 142 57 150 56 153 58 155 92 | 156 78 159 64 164 88 168 112 174 101 | END DATA. | VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'. | | REGRESSION | /STATISTICS COEFF OUTS R ANOVA | /DEPENDENT deadsp | /METHOD=ENTER height . | | * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *. | | PRESERVE. | SET SEED=29147290. | SET MXLOOPS=1000. | MATRIX. | PRINT /TITLE='Confidence Interval for Rho-square in Simple | Linear Regression'. | GET data /VAR=height deadsp /MISSING OMIT. | * Sample statistics *. | COMPUTE n=NROW(data). | COMPUTE mean=CSUM(data)/n. | COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). | COMPUTE x=data(:,1). | COMPUTE y=data(:,2). | COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | COMPUTE b=covxy/variance(1). | COMPUTE a=mean(2)-b*mean(1). | PRINT {a,b} | /FORMAT='F8.3' | /CLABEL='a','b' | /TITLE='Regression line'. | PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)} | /FORMAT='F8.3' | /CLABEL='R-Square','P-Square' | /TITLE='Sample & Population (Adjusted) R-square'. | * Bootstraping PR2 *. | COMPUTE k=1000. | COMPUTE bootR2 =MAKE(k,1,0). | COMPUTE bootsamp=MAKE(n,2,0). | COMPUTE lowersum=0. | COMPUTE uppersum=0. | LOOP h=1 TO 10. /* 10 runs (to average them later) *. | - LOOP i=1 TO k. /* Extracting k bootstrap samples *. | - LOOP j= 1 TO n./* with sample size n *. | - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). | - COMPUTE bootsamp(j,:)=data(flipcoin,:)). | - END LOOP. | - COMPUTE mean=CSUM(bootsamp)/n. | - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). | - COMPUTE x=bootsamp(:,1). | - COMPUTE y=bootsamp(:,2). | - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | - COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2). | - END LOOP. | * Ordered array: sorting algorithm by R Ristow & J Peck *. | - COMPUTE sortedR2=bootR2. | - COMPUTE sortedR2(GRADE(bootR2))=bootR2. | * NP confidence interval *. | - COMPUTE lower=sortedR2(k*0.012). | - COMPUTE upper=sortedR2(1+k*0.98). | - COMPUTE lowersum=lowersum+lower. | - COMPUTE uppersum=uppersum+upper. | END LOOP. | PRINT {lowersum/10,uppersum/10} | /FORMAT='F8.4' | /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)' | /CLABEL='LowerCI','UpperCI'. | END MATRIX. | RESTORE. | | MGJ> I'm not sophisticated enough with MATRIX commands to follow | MGJ> all that you've done. However, I don't see where you centered the | MGJ> bootstrap distribution on R^2-Adjusted (what Steiger calls P^2). | MGJ> It appears that the distribution is centered on R^2 itself. | | MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N | MGJ> - 1)]/(N - K), where K is the number of variables in the equation | MGJ> (i.e., 2 for your example). | | Regards | | Marta | |
In reply to this post by Marta García-Granero
Marta, just a quick clarification. The first values listed below from your program were obtained with the seed you specified. The following 10 were based on a random seed. So the M and SD were based on 11 runs, not 10. Hope that wasn't confusing.
Greg | -----Original Message----- | From: Meyer, Gregory J | Sent: Monday, June 19, 2006 6:40 PM | To: 'Marta García-Granero'; '[hidden email]' | Subject: RE: Re: Confidence Interval for R-square: R2.exe | (Steiger&Fouladi) vs Bootstrapping | | Hi again Marta, | | Interesting problem. I tinkered with your syntax and replaced | the SET SEED=29147290 command with SET SEED=Random. Then I | ran 10 iterations of your bootstrap. Here's what I found. | | As you reported, using the R2.exe program, the 95% CI limits | should be: | Lower: Upper: | .3303 .8879 | | What we get from your syntax is: | Lower: Upper: | .3306 .8861 | .3377 .8906 | .3358 .8888 | .3412 .8875 | .3356 .8911 | .3169 .8898 | .3288 .8886 | .3316 .8867 | .3396 .8809 | .3363 .8861 | .3391 .8885 | | Ave: | .3339 .8877 | SD: | .0069 .0028 | | It seems that what appeared to be fixed deviations from the | R2.exe program in your bootstrap was really the consequence | of having your random number seed fixed at a specific value. | Using a fluctuating random seed it becomes clear that despite | the large bootstrap samples, there's still a good amount of | variability in the CI estimates from one run to the next. The | R2.exe findings are within less than one SD of the mean from | the 10 estimates from your bootstrap so it looks like both | programs are targeting the same parameters. | | The Steiger & Fouladi program allows 2 options for computing | the CI; one quicker and less exact and the other slower but | more precise. If the R2.exe results are based on the latter, | perhaps it would be useful to increase your bootstrap | parameters to something like h = 100 and k = 10000. On the | other hand, it may be that your program is more precise and | theirs more variable. I don't recall the parameters they used | to generate their estimates. Hope this helps. | | Greg | | | -----Original Message----- | | From: SPSSX(r) Discussion [mailto:[hidden email]] | | On Behalf Of Marta García-Granero | | Sent: Monday, June 19, 2006 12:28 PM | | To: [hidden email] | | Subject: Re: Confidence Interval for R-square: R2.exe | | (Steiger&Fouladi) vs Bootstrapping | | | | Hi Gregory, | | | | Although it is true I had forgotten to bootstrap adjusted R2 | | (Rho-square) instead of sample R2 (thanks!), this doesn't | explain the | | discrepancies between both methods of estimating CI for | Rho-square. I | | still have to modify the percentyles (1.2 instead of 2.5 & | 98 instead | | of 97.5) to get results more consistent with those R2.exe gives (see | | below). Therefore, my question is still the same: should I | | just indicate | | the differences or adjust the percentyles?... | | | | * Sample dataset (from 'Statistics at Square One', BMJ | | on-line book) *. | | DATA LIST FREE/height deadsp (2 F8.0). | | BEGIN DATA | | 110 44 116 31 124 43 129 45 131 56 | | 138 79 142 57 150 56 153 58 155 92 | | 156 78 159 64 164 88 168 112 174 101 | | END DATA. | | VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'. | | | | REGRESSION | | /STATISTICS COEFF OUTS R ANOVA | | /DEPENDENT deadsp | | /METHOD=ENTER height . | | | | * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *. | | | | PRESERVE. | | SET SEED=29147290. | | SET MXLOOPS=1000. | | MATRIX. | | PRINT /TITLE='Confidence Interval for Rho-square in Simple | | Linear Regression'. | | GET data /VAR=height deadsp /MISSING OMIT. | | * Sample statistics *. | | COMPUTE n=NROW(data). | | COMPUTE mean=CSUM(data)/n. | | COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). | | COMPUTE x=data(:,1). | | COMPUTE y=data(:,2). | | COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | | COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | | COMPUTE b=covxy/variance(1). | | COMPUTE a=mean(2)-b*mean(1). | | PRINT {a,b} | | /FORMAT='F8.3' | | /CLABEL='a','b' | | /TITLE='Regression line'. | | PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)} | | /FORMAT='F8.3' | | /CLABEL='R-Square','P-Square' | | /TITLE='Sample & Population (Adjusted) R-square'. | | * Bootstraping PR2 *. | | COMPUTE k=1000. | | COMPUTE bootR2 =MAKE(k,1,0). | | COMPUTE bootsamp=MAKE(n,2,0). | | COMPUTE lowersum=0. | | COMPUTE uppersum=0. | | LOOP h=1 TO 10. /* 10 runs (to average them later) *. | | - LOOP i=1 TO k. /* Extracting k bootstrap samples *. | | - LOOP j= 1 TO n./* with sample size n *. | | - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). | | - COMPUTE bootsamp(j,:)=data(flipcoin,:)). | | - END LOOP. | | - COMPUTE mean=CSUM(bootsamp)/n. | | - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). | | - COMPUTE x=bootsamp(:,1). | | - COMPUTE y=bootsamp(:,2). | | - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). | | - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). | | - COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2). | | - END LOOP. | | * Ordered array: sorting algorithm by R Ristow & J Peck *. | | - COMPUTE sortedR2=bootR2. | | - COMPUTE sortedR2(GRADE(bootR2))=bootR2. | | * NP confidence interval *. | | - COMPUTE lower=sortedR2(k*0.012). | | - COMPUTE upper=sortedR2(1+k*0.98). | | - COMPUTE lowersum=lowersum+lower. | | - COMPUTE uppersum=uppersum+upper. | | END LOOP. | | PRINT {lowersum/10,uppersum/10} | | /FORMAT='F8.4' | | /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)' | | /CLABEL='LowerCI','UpperCI'. | | END MATRIX. | | RESTORE. | | | | MGJ> I'm not sophisticated enough with MATRIX commands to follow | | MGJ> all that you've done. However, I don't see where you | centered the | | MGJ> bootstrap distribution on R^2-Adjusted (what Steiger | calls P^2). | | MGJ> It appears that the distribution is centered on R^2 itself. | | | | MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N | | MGJ> - 1)]/(N - K), where K is the number of variables in | the equation | | MGJ> (i.e., 2 for your example). | | | | Regards | | | | Marta | | | |
In reply to this post by Meyer, Gregory J
Hi Greg
MGJ> Interesting problem. I tinkered with your syntax and MGJ> replaced the SET SEED=29147290 command with SET SEED=Random. The reason for having a fixed seed is that my students need to have all the same answer (because the will have to select the "correct" answer as part of a multiple choice exam, I know this is not the best way, but they have really simple minds, you know). I plan to explain the difference between a random and a fixed seed, make them modify the code and see what happens, but I need to have "standardized" results in order to examine them. MGJ> Then I ran 10 iterations of your bootstrap. Here's what I found. I did that too (random seed and several runs). MGJ> As you reported, using the R2.exe program, the 95% CI limits should be: MGJ> Lower: Upper: MGJ> .3303 .8879 MGJ> What we get from your syntax is: MGJ> Lower: Upper: MGJ> .3306 .8861 MGJ> .3377 .8906 MGJ> .3358 .8888 MGJ> .3412 .8875 MGJ> .3356 .8911 MGJ> .3169 .8898 MGJ> .3288 .8886 MGJ> .3316 .8867 MGJ> .3396 .8809 MGJ> .3363 .8861 MGJ> .3391 .8885 MGJ> Ave: MGJ> .3339 .8877 MGJ> SD: MGJ> .0069 .0028 MGJ> It seems that what appeared to be fixed deviations from the MGJ> R2.exe program in your bootstrap was really the consequence of MGJ> having your random number seed fixed at a specific value. If you have taken a look at the end of the matrix program, you'll have seen that the lower & upper limits are computed using in fact percentyles 1.2 (instead of 2.5) and 98 (instead of 97.5): - COMPUTE lower=sortedR2(k*0.012). - COMPUTE upper=sortedR2(1+k*0.98). If I set the lower & upper limits to the theoretically correct values, I get very different results. - COMPUTE lower=sortedR2(k*0.025). - COMPUTE upper=sortedR2(1+k*0.975). Adding and extra nested loop to get 10 different CI (based on the average of 10 runs of 1000 boostrapped samples), I get the following: Sample & Population (Adjusted) R-square R-Square P-Square .716 .694 Rho-Square CI: 10 Means of 10 bootstrap runs (1000 reps. each) LowerCI UpperCI .3972 .8806 .4072 .8808 .4120 .8803 .4057 .8806 .4013 .8836 .4065 .8825 .4107 .8782 .4109 .8796 .3998 .8796 .4150 .8816 Mean .4066 .8807 SD .0058 .0015 Min .3972 .8782 Max .4150 .8836 As you can see, the upper limit is systematically slightly lower than the one reportad by Steiger&Fouladi, and the lower limit is always much higher than the one reported by S&F. That's what I was talking about when I said that my boostrap CI were biased. MGJ> The Steiger & Fouladi program allows 2 options for MGJ> computing the CI; one quicker and less exact and the other slower MGJ> but more precise. If the R2.exe results are based on the latter, MGJ> perhaps it would be useful to increase your bootstrap parameters MGJ> to something like h = 100 and k = 10000. Ouch! That will take really long, I can't do that with my students (the University has a lot -really a lot- of computers, and the configuration is quite basic: only 256 Mb of RAM and some are a bit old: low speed processor). Nonetheless, I'll try that at home, with my own computer (perhaps while I'm cooking, or something like that). I had been using the speed method in R2.exe. Using the algorithm to maximize accuracy, the results are these: Lower Limit: 0.33036 Upper Limit: 0.88852 Quite close to the ones obtained with the speed method: Lower Limit: 0.33029 Upper Limit: 0.88794 MGJ> On the other hand, it may be that your program is more precise MGJ> and theirs more variable. I don't recall the parameters they used MGJ> to generate their estimates. Michael Healy suggested that I tested different datasets, that it could be due to this particular dataset (small sample size and perhaps some characteristics of its distribution). I'm going to do that now, also, I think I'll leave the limits in their theoretically correct places (percentyles 2.5 & 97.5). Thanks a lot for your interest in this little problem. Marta |
In reply to this post by Meyer, Gregory J
Hi everybody:
This message closes (hopefully) this thread. Conclusions: 1) The first dataset was *partially* responsible of the discrepancies between bootstrapping & R2.exe. Thanks to Michael Healy for pointing that possibility. Anyway, with a different (bigger, residuals normally distributed, and linear relationship OK), I still find a small discrepancy at the lower limit (not important, but persistent along different runs). I can live with that. 2) Eliminating spurious precision helps a bit (thanks Art for that private comment). 3) Changing a bit the boostrapping conditions (increasing k, the number of repetitions) improved the results (thanks Greg). I have also eliminated the 10 runs and average of the 10 lower & upper limits (since I get more stable results after increasing k to 10000). I use now a random seed, since the differences in these conditions are alwyas below the second decimal place. The problem is that the code now takes around 15 seconds to run (and my computer has a faster processor and more RAM than the average computer at the University). I daren't think what will happen when I write code for boostrapping Rho-square from multiple regression models... This is the final code, with a new dataset: * Sample dataset (random sample of 50 cases from Rosner's dataset FEV&Smoke): Residuals follow a normal distribution, there is also no evidence of lack of linear relationship *. DATA LIST FREE/fev(F8.3) hgt(F8.1). BEGIN DATA 1.987 58.5 1.735 54.0 2.604 61.5 2.980 60.0 3.000 65.5 1.776 51.0 2.531 58.0 3.842 69.0 1.751 58.0 1.698 54.5 1.697 59.0 2.288 61.5 3.114 64.5 2.135 58.5 1.759 53.0 2.048 64.5 1.658 53.0 1.789 52.0 3.004 64.0 2.503 63.0 2.316 59.5 1.704 51.0 1.606 57.5 1.165 47.0 2.164 60.0 2.639 63.0 1.728 56.5 2.303 57.0 2.382 62.0 1.535 55.0 1.514 52.0 2.524 64.0 3.490 67.0 2.292 63.0 2.889 64.0 2.957 64.5 2.250 58.0 2.633 62.0 2.417 62.5 4.273 72.5 2.751 63.0 3.774 67.0 3.169 64.0 2.704 61.0 3.255 66.0 2.901 59.5 3.680 67.0 3.022 61.5 3.780 70.0 2.822 69.5 END DATA. VARIABLE LABEL hgt 'Height (inches)' /fev 'FEV (l)'. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT fev /METHOD=ENTER hgt . * Using R2.exe: 95%CI Lower Limit: 0.63; Upper Limit: 0.86 *. * Using Boostrap: 95%CI Lower Limit: 0.65; Upper Limit: 0.86 *. * Bootstrapping conditions: k=10000; random seed *. PRESERVE. SET MXLOOPS=10000. MATRIX. PRINT /TITLE='Confidence Interval for Rho-square in Simple Linear Regression'. GET data /VAR=hgt fev /MISSING OMIT. * Sample statistics *. COMPUTE n=NROW(data). COMPUTE mean=CSUM(data)/n. COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1). COMPUTE x=data(:,1). COMPUTE y=data(:,2). COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). COMPUTE r=covxy/SQRT(variance(1)*variance(2)). COMPUTE b=covxy/variance(1). COMPUTE a=mean(2)-b*mean(1). PRINT {a,b} /FORMAT='F8.3' /CLABEL='a','b' /TITLE='Regression line'. PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)} /FORMAT='F8.2' /CLABEL='R-Square','P-Square' /TITLE='Sample R-square & Adjusted (Population) Rho-square'. * Bootstraping conditions *. COMPUTE k=10000. /* Nr of bootsamples (if you increase it, increase also MXLOOPS) *. PRINT k /FORMAT='F8.0' /TITLE='Boostrapping conditions: k (Nr. reps.)'. * Bootstrapping P2 (Rho-square) *. COMPUTE bootP2=MAKE(k,1,0). COMPUTE bootsamp=MAKE(n,2,0). LOOP i=1 TO k. /* Extracting K bootstrap samples *. - LOOP j= 1 TO n. /* of sample size n (sampling with replacement) *. - COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). /* Random number between 1 & n *. - COMPUTE bootsamp(j,:)=data(flipcoin,:)). /* A case is selected at random *. - END LOOP. /* Until n cases are selected *. * Statistics for every bootsample *. - COMPUTE mean=CSUM(bootsamp)/n. - COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1). - COMPUTE x=bootsamp(:,1). - COMPUTE y=bootsamp(:,2). - COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1). - COMPUTE r=covxy/SQRT(variance(1)*variance(2)). - COMPUTE bootP2(i)=1-(1-r**2)*(n-1)/(n-2). END LOOP. * Ordered array: sorting algorithm by R Ristow & J Peck *. COMPUTE sortedP2=bootP2. COMPUTE sortedP2(GRADE(bootP2))=bootP2. * NP confidence interval (percentyles 2.5 & 97.5) *. COMPUTE lower=sortedP2(k*0.025). COMPUTE upper=sortedP2(1+k*0.975). PRINT {lower,upper} /FORMAT='F8.2' /TITLE='Rho-Square Bootstraped CI:' /CLABEL='LowerCI','UpperCI'. END MATRIX. |
Free forum by Nabble | Edit this page |