SAS9

Optimizing PROC TTEST on thousands of variables and structuring your data

Simon 11 vues

Gene expression data analysis often presents a major challenge: the multiplicity of variables. When it comes to comparing the expression of thousands of genes between two groups (e.g., "affected" vs "unaffected" subjects), the naive approach often involves using macros to loop over each variable.

However, this method can run into performance issues and, more insidiously, data structure errors leading to null results ($n=0$).

Optimizing PROC TTEST on thousands of variables and structuring your data -

The trap of the PAIRED statement and missing data

Imagine a dataset where each row represents a subject. You have columns for the genes of healthy individuals (unaffected_gene_x) and others for sick individuals (affected_gene_x).

If you try to use the PAIRED statement in PROC TTEST:

1
1paired unaffected_gene_1 * affected_gene_1;

SAS© will automatically exclude any row containing a missing value in either of the two variables. If your data is structured such that a subject is either in the healthy group, or in the sick group (and therefore has a missing value for the other condition), SAS© will exclude all observations. Result: the log shows 0 observations used.

The Solution: Restructure the data ("Wide to Long")

Instead of creating a complex macro that runs PROC TTEST 2500 times (which is inefficient), the best practice is to restructure the data to have only a single value column and an identification column (Group/Status).

Here's how to transform your "wide" data into "long" data with a DATA step and arrays (ARRAY):

1DATA analyse_genes;
2 SET donnees_source;
3 /* Déclaration des tableaux pour vos colonnes existantes */
4 /* Ajustez la dimension (ici 10) au nombre réel de gènes (ex: 2500) */
5 array u unaffected_gene_1 - unaffected_gene_10;
6 array a affected_gene_1 - affected_gene_10;
7
8 /* Boucle pour pivoter les données */
9 DO gene_id = 1 to dim(u);
10 /* Traitement du cas 'Non Affecté' */
11 IF u[gene_id] ne . THEN DO;
12 STATUS = 'Unaffected';
13 value = u[gene_id];
14 OUTPUT;
15 END;
16
17 /* Traitement du cas 'Affecté' */
18 IF a[gene_id] ne . THEN DO;
19 STATUS = 'Affected';
20 value = a[gene_id];
21 OUTPUT;
22 END;
23 END;
24 keep gene_id STATUS value;
25RUN;
26 
27/* Tri indispensable pour le traitement par groupe */
28PROC SORT DATA=analyse_genes;
29 BY gene_id;
30RUN;
This transformation allows moving from a sparse structure to one optimized for group statistical analysis.

Running the T-Test Efficiently

Once the data is restructured, there is no longer a need for a macro. A single PROC TTEST procedure is sufficient, using the BY statement to process each gene separately and CLASS to define the groups.

1PROC TTEST DATA=analyse_genes;
2 BY gene_id; /* Effectue le test pour chaque gène */
3 class STATUS; /* Définit les groupes (Affected / Unaffected) */
4 var value; /* La variable contenant la mesure */
5RUN;
This method is significantly faster than invoking the procedure thousands of times.

Automating the Retrieval of Significant Results

With thousands of tests, reading the standard output is impossible. The trick is to use ODS OUTPUT to retrieve the statistics (t-values and p-values) into a table, and then filter this table.

1/* Redirection des résultats vers une table SAS */
2ods OUTPUT ttests = resultats_ttests;
3 
4PROC TTEST DATA=analyse_genes;
5 BY gene_id;
6 class STATUS;
7 var value;
8RUN;
9 
10/* Filtrer pour ne garder que les différences significatives */
11DATA genes_significatifs;
12 SET resultats_ttests;
13 /* On filtre souvent sur la méthode Pooled ou Satterthwaite selon l'égalité des variances */
14 where Probt < 0.05;
15RUN;

The resultats_ttests table will contain variables like tValue (t-statistic) and Probt (the p-value), allowing you to immediately identify the genes of interest among thousands.