Published on :
Data Manipulation CREATION_INTERNE

INDEX Function with CHAR and VARCHAR Strings

This code is also available in: Deutsch Español Français
Awaiting validation
The functional analysis details how the INDEX function behaves differently depending on the variable type (CHAR or VARCHAR), particularly in the presence of multi-byte characters. VARCHAR variables use character-length semantics, where the position is calculated in character units, while CHAR variables use byte-length semantics, where the position is calculated in bytes. This is crucial for data processing in SAS© Viya and CAS environments, where the distinction can affect the results of string manipulation functions.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or direct creation of CAS tables via DATA steps.

1 Code Block
DATA STEP Data
Explanation :
This example initializes two variables, one VARCHAR and one CHAR, with single-byte character strings. The INDEX function searches for the position of the character 'c'. Since the characters are single-byte, the returned positions are identical (3), regardless of the variable type, illustrating standard behavior.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.chaine_basique;
4 LENGTH x varchar(10);
5 LENGTH y $10;
6 x = 'abcde';
7 y = 'abcde';
8 xi = index(x,'c');
9 yi = index(y,'c');
10 put 'VARC_pos_c = ' xi;
11 put 'CHAR_pos_c = ' yi;
12RUN;
13 
14PROC PRINT DATA=mycas.chaine_basique;
15 title 'Résultats de l''indexation basique';
16RUN;
17 
18PROC CASUTIL incaslib='mycas' outcaslib='mycas';
19 dropcas casdata='chaine_basique' quiet;
20QUIT;
2 Code Block
DATA STEP Data
Explanation :
This example uses multi-byte Chinese characters. The VARCHAR(10) variable stores '你好世界' as 4 characters. Searching for '世' (the third character) returns 3 for VARCHAR. The CHAR(10) variable stores the same string, but each character occupies 3 bytes. Searching for '世' (the 7th byte if counting from 1) returns 7 for CHAR, clearly demonstrating the difference between character-length (VARCHAR) and byte-length (CHAR) semantics.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.chaine_multioctet;
4 LENGTH x varchar(10);
5 LENGTH y $10;
6 x = '你好世界'; /* "Bonjour monde" en chinois, 4 caractères, 12 octets */
7 y = '你好世界';
8 xi = index(x,'世'); /* Recherche du 3ème caractère */
9 yi = index(y,'世');
10 put 'VARCHAR_pos_shi = ' xi;
11 put 'CHAR_pos_shi = ' yi;
12RUN;
13 
14PROC PRINT DATA=mycas.chaine_multioctet;
15 title 'Résultats de l''indexation multi-octets';
16RUN;
17 
18PROC CASUTIL incaslib='mycas' outcaslib='mycas';
19 dropcas casdata='chaine_multioctet' quiet;
20QUIT;
3 Code Block
DATA STEP Data
Explanation :
This example delves into the use of INDEX by searching for a longer substring ('monde') in CHAR and VARCHAR variables. It shows that for single-byte characters, the behavior is the same. Furthermore, it illustrates what happens when the searched substring is not found (the INDEX function returns 0), a common case in string manipulation.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.chaine_avancee;
4 LENGTH phrase_varchar varchar(50);
5 LENGTH phrase_char $50;
6 phrase_varchar = 'Le monde est beau, la vie est courte.';
7 phrase_char = 'Le monde est beau, la vie est courte.';
8 
9 pos_monde_varchar = index(phrase_varchar,'monde');
10 pos_monde_char = index(phrase_char,'monde');
11 
12 pos_non_trouve_varchar = index(phrase_varchar,'inexistant');
13 pos_non_trouve_char = index(phrase_char,'inexistant');
14 
15 put 'VARCHAR "monde" à la position : ' pos_monde_varchar;
16 put 'CHAR "monde" à la position : ' pos_monde_char;
17 put 'VARCHAR "inexistant" à la position : ' pos_non_trouve_varchar;
18 put 'CHAR "inexistant" à la position : ' pos_non_trouve_char;
19RUN;
20 
21PROC PRINT DATA=mycas.chaine_avancee;
22 title 'Résultats de recherche avancée de sous-chaînes';
23RUN;
24 
25PROC CASUTIL incaslib='mycas' outcaslib='mycas';
26 dropcas casdata='chaine_avancee' quiet;
27QUIT;
4 Code Block
DATA STEP Data
Explanation :
This example highlights length semantics using the SUBSTR function in a CAS environment, a key element of Viya. It uses multi-byte characters and shows that SUBSTR on VARCHAR extracts characters based on their logical position (by character), while on CHAR, it extracts bytes. This can lead to unexpected results if the difference is not understood, especially if one attempts to extract parts of multi-byte characters with a CHAR variable. It can also be a source of error if the indicated position is in the middle of a multi-byte character for a CHAR variable, or beyond the defined size if semantics are not taken into account.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.chaine_substr_cas;
4 LENGTH var_char $10;
5 LENGTH var_varchar varchar(10);
6
7 /* Chaîne de 3 caractères multi-octets (ex: chinois) */
8 var_char = '你好世'; /* 3 caractères, 9 octets */
9 var_varchar = '你好世';
10 
11 /* Extraction du 2ème caractère (VARCHAR) vs 2ème octet (CHAR) */
12 sub_varchar_char = substr(var_varchar, 2, 1);
13 sub_char_byte = substr(var_char, 2, 1);
14
15 /* Tentative d'extraction d'un caractère au-delà de la longueur réelle par octet pour CHAR */
16 sub_char_byte_erreur = substr(var_char, 7, 1); /* Le 7ème octet est le 3ème caractère */
17 
18 put 'VARCHAR (caractère 2) : ' sub_varchar_char;
19 put 'CHAR (octet 2) : ' sub_char_byte;
20 put 'CHAR (octet 7, 3ème caractère) : ' sub_char_byte_erreur;
21RUN;
22 
23PROC PRINT DATA=mycas.chaine_substr_cas;
24 title 'Comparaison SUBSTR avec CHAR et VARCHAR en CAS';
25RUN;
26 
27PROC CASUTIL incaslib='mycas' outcaslib='mycas';
28 dropcas casdata='chaine_substr_cas' quiet;
29QUIT;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved.


Expert Advice
Expert
Michael
Responsable de l'infrastructure Viya.
« When migrating legacy SAS code to Viya, review all INDEX, SCAN, and SUBSTR calls. If you convert your table columns to VARCHAR during the load to CAS, your standard string functions will become "encoding-aware" automatically, often solving truncation and indexing bugs without changing a single line of logic. »