When querying Impala tables stored as Parquet files via SAS©, an insidious issue can arise: data appears in the wrong order, or columns contain incorrect values.
This phenomenon is often due to the way Impala reads metadata from Parquet files. By default, if the physical file structure differs slightly from the table definition (e.g., a different column order), Impala may try to match columns by their position (index) rather than by their name.
To fix this directly in Impala, the option PARQUET_FALLBACK_SCHEMA_RESOLUTION=name is used. But how can this specific configuration be applied when using SQL Pass-Through in SAS©?
When you try to extract a table via SAS© without this option, you may find that the values do not match the correct columns. This happens because Impala falls back to the column order.
A common attempt is to try to integrate this option directly into the connection string (connect to impala (&impala.)) or to execute the SET command in the wrong place (for example, when creating an intermediate view), which often has no effect.
The key to solving this problem is to understand that the Pass-Through session must receive the configuration statement before executing the data selection query, but within the same connection block.
You must use the EXECUTE BY IMPALA statement to send the SET command to the database just before running the SELECT.
Open the connection to Impala.
Execute the configuration option to force resolution by name.
Retrieve the data via the established connection.
Corrected Code Example
Here is the functional SAS© PROC SQL syntax for this use case:
Why does this work?
By separating the SET statement, you modify the environment of the active Impala session. When the next query (SELECT *) is sent via connection to impala, it benefits from this setting and correctly maps the Parquet file columns by their name, ensuring the integrity of the data retrieved in SAS©.