At 01:33 AM 1/26/2013, Blankdots wrote:

>[I have] are 200 cases and roughly 50000 variables.

>

>Each variable is a gene with 'expression data' which is a decimal number.

>So each patient has a value. e.g. patient 1 - 1.11, patient 2. 1.21.

>And so forth.

As you state below, this isn't a complete list of variables. You have

those 50,000 gene expression values for each patient; but you also

have a variable for survival or not, and one for survival time. I

hope you also have an identifying variable for each patient, so you

can trace your SPSS data back to the source.

>1) Firstly patients are divided into two groups, those who have died

>- 1 and those who are still alive - 0. This is needed for KM. Also needed, is

>overall survival time as this is another variable needed in KM analysis.

>2) We want to divide the 200 patients into two groups, with the cutoff

>between groups based on the mean value of a gene expression. These groups

>are 1 for low expression and 2 for high expression.

>3) We then run KM with logrank. This will spit out a p value which

>tells us how significant the difference is between the high expression and low

>expression groups. The better the p value, the more interested in it we are.

>4) Of course survival plots are important, but for initial analysis we are

>simply interested in pvalue.

>

>So my plan is to run KM log rank and record p value for all 50000

>variables then sort ascending. I don't see how varstocases would

>help in this scenario, but if it does could someone please explain it?

OK, here's where understanding how SPSS 'thinks' is important.

As I wrote before, SPSS loops very easily through cases, and through

groups of cases; but loops through sets of variables only awkwardly,

often requiring something like Python.

Now, your data is like this:

Patient_ID LiveOrDead Time Gene01 Gene02 Gene03

Alpha 0 13 12.3456 78.9012 34.5678

Beta 1 10 98.7654 32.1098 76.5432

Gamma 1 15 24.6890 13.5791 36.9036

and you have to run your analysis separately for each of GeneExpr01,

GeneExpr02, GeneExpr03 -- except, you have 50,000 instead of three of them.

Here's what you get from VARSTOCASES and SORT -- this is an actual

run, in SPSS 14:

VARSTOCASES

/MAKE Express FROM Gene01 Gene02 Gene03

/INDEX = Gene(Express)

/KEEP = Patient_ID LiveOrDead Time

/NULL = DROP.

SORT CASES BY Gene Patient_ID.

LIST.

List

|-----------------------------|---------------------------|

|Output Created |28-JAN-2013 16:22:30 |

|-----------------------------|---------------------------|

Patient_ID LiveOrDead Time Gene Express

Alpha 0 13 Gene01 12.3456

Beta 1 10 Gene01 98.7654

Gamma 1 15 Gene01 24.6890

Alpha 0 13 Gene02 78.9012

Beta 1 10 Gene02 32.1098

Gamma 1 15 Gene02 13.5791

Alpha 0 13 Gene03 34.5678

Beta 1 10 Gene03 76.5432

Gamma 1 15 Gene03 36.9036

Number of cases read: 9 Number of cases listed: 9

Notice you now have three groups of cases, corresponding to the three

genes; in your data, you'll have 50,000 groups. In each group there

is a case for each patient; three per group in this demo, 200 per

group in yours. (So the whole file will have just one million records

-- big, but well within SPSS's capacity, especially when each record

is this short.)

Now you don't have to write a separate KM statement to analyze each

gene. If you issue command

SPLIT FILES BY Gene.

data in each group will be analyzed separately, and *one* KM

statement will analyze all 50,000 genes. You'll use OMS to capture

the results as an SPSS dataset; see my previous posting in this thread.

Further, today at 01:33 AM 1/26/2013, you wrote:

>2) We want to divide the 200 patients into two groups, with the cutoff

>between groups based on the mean value of a gene expression. These groups

>are 1 for low expression and 2 for high expression.

You've had a lot of advice on *whether* to do this. As for *how* to

do this, I'm looking at your thread "Basic Recode syntax not creating

new Variable", where it looks like you're having Python generate a

RECODE statement for each of your 50,000 variables, inserting the

mean value. (Where do you get the means? Compute them in Excel? in Python?)

But when you have the data transformed, the operation takes only a

few lines of basic SPSS for *all* the genes. This, again, is from an

actual run:

* Dichotomizing by high/low gene expression .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES

/BREAK=Gene

/MeanVal=MEAN(Express).

FORMATS MeanVal (F8.4).

NUMERIC HiLo (F2).

VALUE LABEL HiLo

(1) Low expression

(2) Hi expression.

DO IF Express GE MeanVal.

. COMPUTE HiLo = 2.

ELSE.

. Compute HiLo = 1.

END IF.

LIST.

List

|-----------------------------|---------------------------|

|Output Created |28-JAN-2013 17:15:30 |

|-----------------------------|---------------------------|

Patient_ID LiveOrDead Time Gene Express MeanVal HiLo

Alpha 0 13 Gene01 12.3456 45.2667 1

Beta 1 10 Gene01 98.7654 45.2667 2

Gamma 1 15 Gene01 24.6890 45.2667 1

Alpha 0 13 Gene02 78.9012 41.5300 2

Beta 1 10 Gene02 32.1098 41.5300 1

Gamma 1 15 Gene02 13.5791 41.5300 1

Alpha 0 13 Gene03 34.5678 49.3382 1

Beta 1 10 Gene03 76.5432 49.3382 2

Gamma 1 15 Gene03 36.9036 49.3382 1

Number of cases read: 9 Number of cases listed: 9

->It helps to learn thoroughly what SPSS, itself, can do; it can save

a lot of clumsy wrestling with other tools <-

=============================

APPENDIX: Test data, and code

=============================

* C:\Documents and Settings\Richard\My Documents .

* \Technical\spssx-l\Z-2013\ .

* 2013-01-23 Blankdots- .

* Help retrieving and saving SPSS output via xpath.SPS .

* In response to posting .

* Date: Wed, 23 Jan 2013 16:27:52 -0800 .

* From: Blankdots <

[hidden email]> .

* Subject: Help retrieving and saving SPSS output via xpath .

* To:

[hidden email] .

* This code illustrates wide-to-long data restructuring, .

* and its advantages. .

DATA LIST LIST/

Patient_ID LiveOrDead Time Gene01 Gene02 Gene03

(A8, F2, F3, F8.4, F8.4, F8.4).

BEGIN DATA

Alpha 0 13 12.3456 78.9012 34.5678

Beta 1 10 98.7654 32.1098 76.5432

Gamma 1 15 24.6890 13.5791 36.9036

END DATA.

LIST.

VARSTOCASES

/MAKE Express FROM Gene01 Gene02 Gene03

/INDEX = Gene(Express)

/KEEP = Patient_ID LiveOrDead Time

/NULL = DROP.

SORT CASES BY Gene Patient_ID.

LIST.

* Dichotomizing by high/low gene expression .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES

/BREAK=Gene

/MeanVal=MEAN(Express).

FORMATS MeanVal (F8.4).

NUMERIC HiLo (F2).

VALUE LABEL HiLo

(1) Low expression

(2) Hi expression.

DO IF Express GE MeanVal.

. COMPUTE HiLo = 2.

ELSE.

. Compute HiLo = 1.

END IF.

LIST.

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD