Published on :
Data Quality CREATION_INTERNE

PROC DATAMETRICS: Advanced Data Quality Analysis

This code is also available in: Deutsch English Español Français
Awaiting validation
The DATAMETRICS procedure is a powerful tool for data quality assessment. This detailed functional analysis shows how to configure the procedure with a full range of parameters. It covers the extraction of statistical metrics like the median, the determination of the most frequent values (frequencies), the identification of extreme values (minmax), and the use of data formats for more relevant analysis. Special attention is paid to the integration of the Quality Knowledge Base (QKB) for advanced identity analyses, as well as performance optimization via the 'threads' parameter. The last example demonstrates the procedure's execution on a table loaded in memory within the SAS© Viya CAS environment.
Data Analysis

Type : CREATION_INTERNE


The examples use internally generated data via DATA steps with DATALINES, ensuring their autonomy and reproducibility.

1 Code Block
DATA / PROC DATAMETRICS Data
Explanation :
This example initializes a 'my_data' dataset with dummy information (name, address, city, state). Then, the PROC DATAMETRICS procedure is executed with the minimum required parameters to calculate data quality metrics for the 'name' and 'address' variables. The results are stored in the 'basic_metrics' table.
Copied!
1DATA work.my_data;
2 LENGTH name $30 address $50 city $20 state $2;
3 INPUT name $ address $ city $ state $;
4 DATALINES;
5"John Doe" "123 Main St" "Anytown" "NY"
6"Jane Smith" "456 Oak Ave" "Anycity" "CA"
7"John Doe" "123 Main St" "Anytown" "NY"
8"Peter Jones" "789 Pine Ln" "Otherville" "TX"
9"Alice Brown" "101 Maple Dr" "Anytown" "NY"
10"Bob White" "202 Elm St" "Otherville" "TX"
11"Charlie Green" "303 Cedar Rd" "Anycity" "CA"
12"David Black" "404 Birch Ct" "Anytown" "NY"
13;
14RUN;
15 
16PROC DATAMETRICS DATA=work.my_data out=work.basic_metrics;
17 variables name address;
18 RUN;
19 
20PROC PRINT DATA=work.basic_metrics;
21 title "Métriques Basiques pour Nom et Adresse";
22RUN;
2 Code Block
PROC DATAMETRICS
Explanation :
Based on the data from the previous example, this example uses common options: 'frequencies=10' for the 10 most frequent values, 'minmax=5' for 5 minimum and maximum values, and 'median' to calculate the median. The 'identities' statement is used to integrate a Quality Knowledge Base (QKB) specific to the 'ENUSA' locale and 'Field Content' definition to enrich the identification analysis.
Copied!
1/* Assurez-vous que work.my_data est déjà créé à partir de l'Exemple 1 */
2 
3PROC DATAMETRICS DATA=work.my_data out=work.common_metrics frequencies=10
4 minmax=5 median;
5 identities qkb='/sas/dqc/QKBLoc' locale='ENUSA' def='Field Content';
6 variables name address city;
7 RUN;
8 
9PROC PRINT DATA=work.common_metrics;
10 title "Métriques avec Fréquences, Min/Max, Médiane et QKB";
11RUN;
3 Code Block
PROC FORMAT / DATA / PROC DATAMETRICS
Explanation :
This example introduces a custom format for the 'state' variable, then applies this format to a new dataset 'formatted_data'. PROC DATAMETRICS is then executed on this formatted table. The 'frequencies', 'minmax', and 'threads=4' options are used for parallel processing. The 'multiidentity' option in the 'identities' statement allows for analyzing multiple data quality identities for the specified variables.
Copied!
1/* Assurez-vous que work.my_data est déjà créé à partir de l'Exemple 1 */
2 
3PROC FORMAT;
4 value $statefmt
5 'NY'='New York'
6 'CA'='California'
7 'TX'='Texas'
8 other='Autre';
9RUN;
10 
11DATA work.formatted_data;
12 SET work.my_data;
13 FORMAT state $statefmt.;
14RUN;
15 
16PROC DATAMETRICS DATA=work.formatted_data out=work.advanced_metrics
17 frequencies=20 minmax=10 threads=4 FORMAT;
18 identities qkb='/sas/dqc/QKBLoc' locale='ENUSA'
19 def='Field Content' multiidentity;
20 variables name address city state;
21 RUN;
22 
23PROC PRINT DATA=work.advanced_metrics;
24 title "Métriques Avancées avec Formats, Threads et Multi-identités";
25RUN;
4 Code Block
CASLIB / PROC CASUTIL / PROC DATAMETRICS
Explanation :
This example demonstrates integration with the SAS Viya Cloud Analytic Services (CAS) environment. The 'my_data' dataset is first loaded into a CAS library ('casuser.my_cas_data') using PROC CASUTIL. Then, PROC DATAMETRICS is executed directly on the in-memory CAS table. Options such as 'frequencies', 'minmax', and 'threads' are applied to optimize metric analysis in a distributed environment. The results are also stored in a CAS table.
Copied!
1/* Assurez-vous que work.my_data est déjà créé à partir de l'Exemple 1 */
2 
3caslib _all_ assign;
4 
5PROC CASUTIL;
6 load DATA=work.my_data outcaslib='casuser' casout='my_cas_data' replace;
7RUN;
8 
9PROC DATAMETRICS DATA=casuser.my_cas_data out=casuser.cas_metrics
10 frequencies=5 minmax=3 threads=2;
11 identities qkb='/sas/dqc/QKBLoc' locale='ENUSA' def='Field Content';
12 variables name address city;
13 RUN;
14 
15PROC PRINT DATA=casuser.cas_metrics;
16 title "Métriques via DATAMETRICS sur CAS";
17RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.