Performance Test on a High-Volume Genomic Data Aggregation

Business Context

A bioinformatics research institute is processing massive genomic datasets. A researcher initiates a heavy aggregation (`simple.summary`) on a table with hundreds of millions of rows. The process is expected to run for several hours. The goal is to ensure that `batchresults` can successfully detach such a long-running, resource-intensive job and that the job continues even if the original client session is terminated.

About the Set : session

Management of the CAS session state.

Discover all actions of session

Data Preparation

Create a very large table 'genomic_variants' with 100 million rows to simulate a resource-intensive, long-running task.

Copied!

1	DATA casuser.genomic_variants;
2	call streaminit(456);
3	DO i = 1 to 100000000;
4	chromosome = 'chr' \|\| put(ceil(rand('UNIFORM')*22), 2.);
5	position = ceil(rand('UNIFORM')*10000000);
6	quality_score = rand('NORMAL', 50, 5);
7	IF (mod(i, 1000000) = 0) THEN put i=; /* Log progress */
8	OUTPUT;
9	END;
10	RUN;

Étapes de réalisation

Start a named session 'research_session' and initiate the creation of the large dataset. Then, start a 'simple.summary' action asynchronously.

Copied!

1	cas research_session name='research_session';
2	PROC CAS;
3	SESSION research_session;
4	/* Assume data_prep code has been run */
5	ACTION SIMPLE.summary RESULT=summary_job /
6	TABLE='genomic_variants',
7	async='genomic_summary_job';
8	PRINT 'Research Session UUID: ' \|\| SESSION.sessionId();
9	RUN;

From a 'supervisor_session', immediately detach the long-running job using its session UUID.

Copied!

1	cas supervisor_session name='supervisor_session';
2	PROC CAS;
3	SESSION supervisor_session;
4	/* Replace 'uuid-from-research-session' with the actual UUID */
5	ACTION SESSION.batchresults / uuid='uuid-from-research-session';
6	RUN;

To test robustness, terminate the original 'research_session'. The server-side job should continue running.

Copied!

1	PROC CAS;
2	SESSION research_session;
3	ACTION SESSION.endSession;
4	RUN;

Much later, from a new session, check the job status and fetch the results once completed.

Copied!

1	cas results_session;
2	PROC CAS;
3	SESSION results_session;
4	/* Check status periodically */
5	ACTION SESSION.actionstatus / name='genomic_summary_job';
6	/* Fetch when complete */
7	ACTION SESSION.fetchresult / name='genomic_summary_job';
8	RUN;

Expected Result

The `batchresults` action should execute quickly, detaching the 'genomic_summary_job'. The `endSession` action on 'research_session' should succeed without affecting the server-side job. After a significant amount of time, `actionstatus` will show the job as 'completed', and `fetchresult` will successfully retrieve the summary statistics for the 100 million row table, proving the job ran to completion independently of the client session.

Voir la documentation technique de batchresults