Concurrent validity for nominal/categorical items

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Concurrent validity for nominal/categorical items

Benjamin Spivak (Med)

Hi everyone,

 

I am trying to determine my options for statistics to assess the concurrent validity of a set of items by different raters against a “gold standard” set of items completed by a professional. From what I know, in cases with scale/continuous variables, this is typically done by performing a Pearson’s correlation and associated significance test. However, I have a set of categorical items with no order. I have 66 participants that I would like to compare against one rater. Each item contains three possible answers. Is anyone aware of a statistic that would be suitable for comparing a group of raters against a single rater in these circumstances?

 

Any help would be greatly appreciated.

 

Kind regards.

 

Sent from Mail for Windows 10

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Rich Ulrich

Your overall information for each participant is "rate of agreement" with the gold standard -- like, grading a multiple-choice test.

(For information on "difficulty", or toward seeing if the pro got it right, also check how many errors there were for each choice.)


If your categories are separately interesting, you could score them separately for the agreements with answers 1, 2, and 3; a step

further might look separately at "sensitivity" and "specificity" for answers 1, 2, and 3.  Are answers such that it is interesting whether

an answer is overused?


For an overall "agreement" statistic that confounds the different sorts of differences, you could use kappa for the pro versus

each participant.


--

Rich Ulrich




From: SPSSX(r) Discussion <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Thursday, November 24, 2016 6:36:36 PM
To: [hidden email]
Subject: Concurrent validity for nominal/categorical items
 

Hi everyone,

 

I am trying to determine my options for statistics to assess the concurrent validity of a set of items by different raters against a “gold standard” set of items completed by a professional. From what I know, in cases with scale/continuous variables, this is typically done by performing a Pearson’s correlation and associated significance test. However, I have a set of categorical items with no order. I have 66 participants that I would like to compare against one rater. Each item contains three possible answers. Is anyone aware of a statistic that would be suitable for comparing a group of raters against a single rater in these circumstances?

 

Any help would be greatly appreciated.

 

Kind regards.

 

Sent from Mail for Windows 10

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Art Kendall
In reply to this post by Benjamin Spivak (Med)
Please explain what you mean "each item has three possible answers".  Is it the same 3 for all items?  If so what are the possible answers?

If the 3 are different for each item please provide some examples of items and responses.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Benjamin Spivak (Med)

It is the same for all items- yes, no, not scored.

 

Thanks.

Sent from Mail for Windows 10

 

From: [hidden email]
Sent: Friday, 25 November 2016 11:53 PM
To: [hidden email]
Subject: Re: Concurrent validity for nominal/categorical items

 

Please explain what you mean "each item has three possible answers".  Is it

the same 3 for all items?  If so what are the possible answers?

 

If the 3 are different for each item please provide some examples of items

and responses.

 

 

 

-----

Art Kendall

Social Research Consultants

--

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concurrent-validity-for-nominal-categorical-items-tp5733514p5733516.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

 

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Benjamin Spivak (Med)
In reply to this post by Rich Ulrich

 

HI Rich,

 

I went with calculating a Fleiss Kappa after transforming scores to agree v disagree with the gold standard. I don’t want to calculate individual agreement between pairs because I have a relatively large number of raters (>70). However, my overall kappa is quite low, likely due to the fact that I have a disproportionate number of agreements compared with disagreement. Is there a way to calculate maximum Kappa for the Fleiss variant for n>2 raters? This might help interpretation. The problem is that I can’t find any references where this sort of thing is calculated.

 

Thanks.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Rich Ulrich

First, you need to figure out what it is that you want to know.  I assumed that you would want information about individual raters, no matter how many.

Are you interested in evaluating and reporting on items?  raters?  the TOTAL score for items?


Here is a good starting article which is relatively brief:  http://www.john-uebersax.com/stat/agree.htm

www.john-uebersax.com
Statistical Methods for Diagnostic Agreement. A resource for researchers concerned with the analysis of agreement data. Kappa coefficients, agreement indices, latent ...
He does mention "simple", like, frequency of agreements -- which is what I emphasized.


Second, you probably don't want kappa for multiple raters because it has no provision for Gold Standard.  

Third, I would consider "not marked" as (usually) falling between Yes and No.  Ordinal.
 - If the Gold Standard pro did use Not Marked, you have correlations.
 - If the Gold Standard pro did not use it, then a kappa between pro and rater should probably be 2x2, using Yes/No; and every Not Marked
would be recoded to whichever choice is wrong on the item.

But handle the "FIrst", first:  What is it that you want to know?

--
Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Sunday, November 27, 2016 5:04 AM
To: [hidden email]
Subject: Re: Concurrent validity for nominal/categorical items
 

 

HI Rich,

 

I went with calculating a Fleiss Kappa after transforming scores to agree v disagree with the gold standard. I don’t want to calculate individual agreement between pairs because I have a relatively large number of raters (>70). However, my overall kappa is quite low, likely due to the fact that I have a disproportionate number of agreements compared with disagreement. Is there a way to calculate maximum Kappa for the Fleiss variant for n>2 raters? This might help interpretation. The problem is that I can’t find any references where this sort of thing is calculated.

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Mike
I may be sounding like I'm coming out of left field here but
the type of problem/situation being discussed (judgments
against a gold standard or "true state") sounds a lot like
a signal detection (SDT) problem.  Rich's link to John Uebersax's
doesn't really connect to SDT analysis but if you go to the
related page for Raw Agreement Indices, the basic ideas
are laid out even though Uebersax does not refer to SDT
analyses.
 
SDT has a theoretical/mathematical basis and a bunch of
assumptions (though the analysis can be modified to take
specific assumptions into account) but the one of the key
results is represented in the Receiver Operating Characteristic
(ROC) curve which has true positives (or Sensitivity)on the y-axis
and false positives (or 1 - Specificity) on the x-axis
See this Wikipedia entry for more detail:
 
A number of statistics can be calculated in this situation
but perhaps the best known is the Area Under the Curve
(AUC) or in SDT terms, A' (A-prime).  The ROC curse
is a unit square because the x and y axes are probabilities
ranging from 0 to 1.  The minor diagonal from (0,0) to (1,1)
cuts the area inside the square in half giving one .50 of the
area below it.  This is also called the "chance diagonal"
because the probability of a false positive = probability of
a true positive, meaning the response being made are
random.  A single rater provides a pair of false positive rates
and true positive rates, which either falls on the chance
diagonal (meaning random performance or no discrimination),
above the diagonal (to the upper left) which means that
the person has better than chance performance, or
falls below the diagonal (lower right corner) which indicate
systematically BAD performance (performance worse than
chance.  With a pair of false positive, true positive values,
one can calculate the AUC for a person and "good performance"
will be some value between 0.50 and 1.00 (close to 1, the
better).  This is used in radiology (reading x-rays or scans)
and many other medical areas..
 
As the Wiki entry points out, other statistics can be calculated
in this situation and Cohen's Kappa and Fleiss' Kappa are
among them (see refs 3 and 22 for the entry). See also:
as well as
There is a large literature on this starting with SDT's
origination in psychophysics in the early 1950s to its
application to diagnostic issues starting in the 1970-80s.
 
Just something to think about.
 
-Mike Palij
New York University
 
 
----- Original Message -----
Sent: Sunday, November 27, 2016 1:31 PM
Subject: Re: Concurrent validity for nominal/categorical items

First, you need to figure out what it is that you want to know.  I assumed that you would want information about individual raters, no matter how many.

Are you interested in evaluating and reporting on items?  raters?  the TOTAL score for items?


Here is a good starting article which is relatively brief:  http://www.john-uebersax.com/stat/agree.htm

www.john-uebersax.com
Statistical Methods for Diagnostic Agreement. A resource for researchers concerned with the analysis of agreement data. Kappa coefficients, agreement indices, latent ...
He does mention "simple", like, frequency of agreements -- which is what I emphasized.


Second, you probably don't want kappa for multiple raters because it has no provision for Gold Standard.  

Third, I would consider "not marked" as (usually) falling between Yes and No.  Ordinal.
 - If the Gold Standard pro did use Not Marked, you have correlations.
 - If the Gold Standard pro did not use it, then a kappa between pro and rater should probably be 2x2, using Yes/No; and every Not Marked
would be recoded to whichever choice is wrong on the item.

But handle the "FIrst", first:  What is it that you want to know?

--
Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Sunday, November 27, 2016 5:04 AM
To: [hidden email]
Subject: Re: Concurrent validity for nominal/categorical items
 

 

HI Rich,

 

I went with calculating a Fleiss Kappa after transforming scores to agree v disagree with the gold standard. I don’t want to calculate individual agreement between pairs because I have a relatively large number of raters (>70). However, my overall kappa is quite low, likely due to the fact that I have a disproportionate number of agreements compared with disagreement. Is there a way to calculate maximum Kappa for the Fleiss variant for n>2 raters? This might help interpretation. The problem is that I can’t find any references where this sort of thing is calculated.

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Andy W
For those suggesting ROC curves, can you show an example of how to go from the OP's data to an ROC curve? I don't understand how you go from the three categorical inputs to a continuous score necessary to compute the sensitivity and specificity.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Art Kendall


It would help the OP and those of us trying to respond if there were much more context, e.g., describing the task as presented to the 'expert(s?)' and to the respondents.
Were there subgroups of items? Of respondents?

The OP mentioned 'not scored'. This could have many meanings, e.g., it could mean 'does not apply'  or 'Respondent skipped but went on' or 'Respondent stopped responding',
etc.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

bdates
I agree with Art to the extent we still need more data. First, if I understand correctly there is a gold standard. If so, it needs to be treated as a separate rater, and then a series of kappa's generated between each rater and the standard. If all the raters are included, then the resultant kappa is their collective agreement and inseparable from each rater's level of agreement with the gold standard. Additionally, Fleiss kappa and Cohen's kappa are the same for two raters, so it's really Cohen and not Fleiss that's being carried out.

At this point, there would need to be an average of all the two-rater kappa's produced. This is the same as Light's kappa, an extension of Cohen's kappa to more than two raters, so it's allowable, at least from a literature and research background.

Finally, there's the matter of including 'not scored' as a category. If this sheds light on the 'difficulty' of an item to rate, then maybe it should be included as a category. If it can be interpreted in a number of ways, as Art suggests, then it probably ought to be treated as missing data.

BTW, Gwet has developed a version of his AC1 statistic to include a gold standard. The difficulty is that his solutions are in SAS, not SPSS, so unless the OP wants to translate SAS to SPSS syntax, it's probably not doable.


Brian Dates, M.A.
Director of Evaluation and Research | Evaluation & Research | Southwest Counseling Solutions
Southwest Solutions
1906 25th Street, Detroit, MI 48216
313-297-1391 office | 313-849-2702 fax
[hidden email] | www.swsol.org


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall
Sent: Monday, November 28, 2016 12:46 PM
To: [hidden email]
Subject: Re: Concurrent validity for nominal/categorical items

It would help the OP and those of us trying to respond if there were much more context, e.g., describing the task as presented to the 'expert(s?)' and to the respondents.
Were there subgroups of items? Of respondents?

The OP mentioned 'not scored'. This could have many meanings, e.g., it could mean 'does not apply'  or 'Respondent skipped but went on' or 'Respondent stopped responding', etc.



-----
Art Kendall
Social Research Consultants
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concurrent-validity-for-nominal-categorical-items-tp5733514p5733536.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Mike
In reply to this post by Andy W
On Monday, November 28, 2016 11:40 AM, Andy W wrote:
> For those suggesting ROC curves, can you show an example of how to go
> from
> the OP's data to an ROC curve? I don't understand how you go from the
> three
> categorical inputs to a continuous score necessary to compute the
> sensitivity and specificity.

The real issue is whether the three categories are ordered --
in which case this can be treated like a rating scale (0=not scored,
1= No, 2= Yes) and the procedures have been long worked out
for this case -- or they are unordered categories.  It seems to
me to be something of stretch to claim that the responses are
ordincal.  Things would be a lot simpler if the "not score" response
could be ignored (e.g., treated as missing data) but we'll have
to wait on the OP to provide more information.

So, we're left with the situation where one has three categories
or, to use current parlance, classes and we have a multiclass
classifier problem.  Multiclass ROC analysis has under developmnet
since the late 1990s.  One source on this is the following article:

Tom Fawcett, An introduction to ROC analysis, Pattern
Recognition Letters, Volume 27, Issue 8, June 2006,
Pages 861-874.
http://dx.doi.org/10.1016/j.patrec.2005.10.010  .

See section 9. Decision problems with more than two classes (p872)

With more than two classes, the math become hairy and
one suggestion is to breakdown the analysis into pairwiase
comparisons;  see:

Thomas C.W. Landgrebe, Robert P.W. Duin, Approximating
the multiclass ROC by pairwise analysis, Pattern Recognition
Letters, Volume 28, Issue 13, 1 October 2007, Pages 1747-1758,
http://dx.doi.org/10.1016/j.patrec.2007.05.001 .

There appears to be a good sized literature on this case
as a classification problem (in contrast to a discrimination
problem) and one recent publication provides some sense
of where this area is now and where it is going; see:

Simon Bernard, Clément Chatelain, Sébastien Adam,
Robert Sabourin, The Multiclass ROC Front method for
cost-sensitive classification, Pattern Recognition,
Volume 52, April 2016, Pages 46-60,
http://dx.doi.org/10.1016/j.patcog.2015.10.010

That being said, the bad news is that the analyses provided
in the above articles can't be easily done in SPSS (if at all).
Stata appears to several procedures for doing ROC analysis
(including with a "gold standard": rocgold) but these all appear
to be for binary responses.  I'm not a Stata person so I don't
know if these have been extended to the multiclass case but
the stata fora should be able to provide answers.

-Mike Palij
New York University
[hidden email]
.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Andy W
Mike, it isn't obvious to me how you apply any of those papers to this situation.

Multi-class ROC curves are for "predicting" multi-classes, not for using categorical "independent" variables. That is a red-herring as far as I can tell.

If you have a gold standard, you want to see if the extra raters match the gold standard. So the outcome is "Predicted Right" or "Predicted Wrong" - still a plain old binary outcome is it not? Even if that is not the case, just pretend like it is for a moment - how do you get an ROC curve for only three input guesses? For which you agree there is no natural ordering.

Rich originally suggested to code along an ordinal scale and calculate sensitivity/specificity. That would give you an ROC curve with only one point. (So would not be a curve at all.) What is the point of the area-under-the-curve statistic in that situation? It is a bit facile, since the inputs don't allow interpolation along any of the line to alter predictions. You can only have three potential predictions if you only have three potential inputs.

Are you suggesting an ROC curve for every rater? Or for every item? Or do you just get one ROC curve? I'm still really confused how you turn the OP's data into an ROC curve. Just sketch out a simple example, even if you don't know how to do it in software, to help me understand.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Bruce Weaver
Administrator
Even if it were a situation where everyone agreed on how to generate the ROC curve, some critics are beginning to cast doubt on the usefulness of AUC as a measure of screening test quality.  E.g., the authors of the following article suggest that it would generally be more useful to report sensitivity for a desired level of specificity (e.g., when good specificity is needed), or specificity for a given level of sensitivity (when good sensitivity is needed).  

  https://www.ncbi.nlm.nih.gov/pubmed/24407586

HTH.


Andy W wrote
Mike, it isn't obvious to me how you apply any of those papers to this situation.

Multi-class ROC curves are for "predicting" multi-classes, not for using categorical "independent" variables. That is a red-herring as far as I can tell.

If you have a gold standard, you want to see if the extra raters match the gold standard. So the outcome is "Predicted Right" or "Predicted Wrong" - still a plain old binary outcome is it not? Even if that is not the case, just pretend like it is for a moment - how do you get an ROC curve for only three input guesses? For which you agree there is no natural ordering.

Rich originally suggested to code along an ordinal scale and calculate sensitivity/specificity. That would give you an ROC curve with only one point. (So would not be a curve at all.) What is the point of the area-under-the-curve statistic in that situation? It is a bit facile, since the inputs don't allow interpolation along any of the line to alter predictions. You can only have three potential predictions if you only have three potential inputs.

Are you suggesting an ROC curve for every rater? Or for every item? Or do you just get one ROC curve? I'm still really confused how you turn the OP's data into an ROC curve. Just sketch out a simple example, even if you don't know how to do it in software, to help me understand.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Andy W
Y'all are killing me slowly with this advice I can't make sense of. I can understand being critical of just one number, like AUC, but Bruce how is just reporting single values of sensitivity and specificity any better than plotting the actual ROC curve - which shows the entire range of sensitivity per specificity? That is the whole motivation for the ROC plot to begin with!
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent validity for nominal/categorical items

Bruce Weaver
Administrator
Hi Andy.  I didn't (intend to) say that one should not plot ROC curves.  All I was suggesting was that AUC is probably not nearly as useful a measure as one might think, given how frequently it is reported.  

But now that you've got me started, I would suggest that if one must report AUC, they should consider reporting the Gini coefficient too (or instead).  Gini coefficient =  2*AUC-1.  Conceptually, it has been described as a chance-corrected AUC.  Given how frequently Cohen's kappa (which is described as a chance-corrected measure of agreement) is touted as being far superior to raw percent agreement, it surprises me that the Gini coefficient has not caught on more.

Regarding the other point about reporting sensitivity for a given specificity (or vice-versa), I was just suggesting that in many uses of diagnostic tests, it is important to achieve very high sensitivity (e.g., when ruling out disease) or very high specificity (when ruling in disease).  In such cases, one would surely choose a cut-point that guarantees the needed level of sensitivity or specificity, and then report the other test property at that cut-point.  If 99% specificity is required for a given test in a given situation, I don't really care what the sensitivity would be at another cut-point that yields 75% specificity.  

I fear this is veering way too far from the OP's question, so will stop there.  ;-)

Cheers,
Bruce


Andy W wrote
Y'all are killing me slowly with this advice I can't make sense of. I can understand being critical of just one number, like AUC, but Bruce how is just reporting single values of sensitivity and specificity any better than plotting the actual ROC curve - which shows the entire range of sensitivity per specificity? That is the whole motivation for the ROC plot to begin with!
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.