SAS9

Optimizing PROC TTEST on thousands of variables and structuring your data

Simon 129 vistas

Nivel de dificultad

Débutant

Publicado el : 29/01/2022

Consejo del experto

Stéphanie

Avoid the 'Macro Loop' trap: running a procedure thousands of times is a common performance bottleneck in SAS Viya that bypasses the parallel processing power of the CAS engine. By restructuring your data from wide to long, you can leverage the BY statement to process all variables in a single, high-performance pass while using ODS OUTPUT to transform overwhelming logs into a clean, actionable results table.

Dans : SAS9

SAS Tutorial: Mastering Data Sorting with PROC SORT

26/01/2024 • 414

Optimizing PROC SUMMARY Performance in SAS Enterprise Guide

26/04/2023 • 377

Mastering TRIM and Concatenation

20/01/2023 • 343

How to get weighted counts with PROC SUMMARY?

26/08/2022 • 364

How to select and read the Nth file in a Unix directory with SAS

26/08/2022 • 390

Ver toda la categoría

Gene expression data analysis often presents a major challenge: the multiplicity of variables. When it comes to comparing the expression of thousands of genes between two groups (e.g., "affected" vs "unaffected" subjects), the naive approach often involves using macros to loop over each variable.

However, this method can run into performance issues and, more insidiously, data structure errors leading to null results ( $n=0$ ).

Optimizing PROC TTEST on thousands of variables and structuring your data -

Tabla de contenidos

The trap of the PAIRED statement and missing data

Imagine a dataset where each row represents a subject. You have columns for the genes of healthy individuals (unaffected_gene_x) and others for sick individuals (affected_gene_x).

If you try to use the PAIRED statement in PROC TTEST:

1 1 paired unaffected_gene_1 * affected_gene_1;

SAS^© will automatically exclude any row containing a missing value in either of the two variables. If your data is structured such that a subject is either in the healthy group, or in the sick group (and therefore has a missing value for the other condition), SAS^© will exclude all observations. Result: the log shows 0 observations used.

The Solution: Restructure the data ("Wide to Long")

Instead of creating a complex macro that runs PROC TTEST 2500 times (which is inefficient), the best practice is to restructure the data to have only a single value column and an identification column (Group/Status).

Here's how to transform your "wide" data into "long" data with a DATA step and arrays (ARRAY):

1	DATA analyse_genes;
2	SET donnees_source;
3	/* Déclaration des tableaux pour vos colonnes existantes */
4	/* Ajustez la dimension (ici 10) au nombre réel de gènes (ex: 2500) */
5	array u unaffected_gene_1 - unaffected_gene_10;
6	array a affected_gene_1 - affected_gene_10;
7
8	/* Boucle pour pivoter les données */
9	DO gene_id = 1 to dim(u);
10	/* Traitement du cas 'Non Affecté' */
11	IF u[gene_id] ne . THEN DO;
12	STATUS = 'Unaffected';
13	value = u[gene_id];
14	OUTPUT;
15	END;
16
17	/* Traitement du cas 'Affecté' */
18	IF a[gene_id] ne . THEN DO;
19	STATUS = 'Affected';
20	value = a[gene_id];
21	OUTPUT;
22	END;
23	END;
24	keep gene_id STATUS value;
25	RUN;
26
27	/* Tri indispensable pour le traitement par groupe */
28	PROC SORT DATA=analyse_genes;
29	BY gene_id;
30	RUN;

This transformation allows moving from a sparse structure to one optimized for group statistical analysis.

Running the T-Test Efficiently

Once the data is restructured, there is no longer a need for a macro. A single PROC TTEST procedure is sufficient, using the BY statement to process each gene separately and CLASS to define the groups.

1	PROC TTEST DATA=analyse_genes;
2	BY gene_id; /* Effectue le test pour chaque gène */
3	class STATUS; /* Définit les groupes (Affected / Unaffected) */
4	var value; /* La variable contenant la mesure */
5	RUN;

This method is significantly faster than invoking the procedure thousands of times.

Automating the Retrieval of Significant Results

With thousands of tests, reading the standard output is impossible. The trick is to use ODS OUTPUT to retrieve the statistics (t-values and p-values) into a table, and then filter this table.

1	/* Redirection des résultats vers une table SAS */
2	ods OUTPUT ttests = resultats_ttests;
3
4	PROC TTEST DATA=analyse_genes;
5	BY gene_id;
6	class STATUS;
7	var value;
8	RUN;
9
10	/* Filtrer pour ne garder que les différences significatives */
11	DATA genes_significatifs;
12	SET resultats_ttests;
13	/* On filtre souvent sur la méthode Pooled ou Satterthwaite selon l'égalité des variances */
14	where Probt < 0.05;
15	RUN;

The resultats_ttests table will contain variables like tValue (t-statistic) and Probt (the p-value), allowing you to immediately identify the genes of interest among thousands.

Aviso importante

Los códigos y ejemplos proporcionados en WeAreCAS.eu son con fines educativos. Es imperativo no copiarlos y pegarlos ciegamente en sus entornos de producción. El mejor enfoque es comprender la lógica antes de aplicarla. Recomendamos encarecidamente probar estos scripts en un entorno de prueba (Sandbox/Dev). WeAreCAS no acepta ninguna responsabilidad por cualquier impacto o pérdida de datos en sus sistemas.

Volver a la lista de artículos

SAS y todos los demás nombres de productos o servicios de SAS Institute Inc. son marcas registradas o marcas comerciales de SAS Institute Inc. en los EE. UU. y otros países. ® indica registro en los EE. UU. WeAreCAS es un sitio comunitario independiente y no está afiliado a SAS Institute Inc.