Published on :

Training a Gradient Boosting Model for Salary Prediction

This code is also available in: Deutsch Español Français
Awaiting validation
The script initiates a CAS session, loads the 'sashelp.baseball' dataset into CAS memory under the 'casuser' CASLIB, then loads the 'decisionTree' action set. It then uses the 'gbtreeTrain' action to build a Gradient Boosting model. The model is configured with various options such as POISSON distribution, early stopping based on 'LOGLOSS', and variable importance. The trained model is saved as a CAS table 'GRADBOOST3'.
Data Analysis

Type : SASHELP


The source data comes from SAS's built-in 'sashelp.baseball' dataset, which is then loaded and processed in CAS memory under the 'casuser' CASLIB.

1 Code Block
DATA STEP Data
Explanation :
This code block initializes a CAS session and makes all CASLIBs available. A DATA STEP is then used to load the 'sashelp.baseball' dataset into CAS memory under the 'casuser' CASLIB, thereby creating a working copy of the 'baseball' table in CAS memory.
Copied!
1cas;
2caslib _all_ assign;
3 
4DATA casuser.baseball;
5 SET sashelp.baseball;
6RUN;
2 Code Block
PROC CAS
Explanation :
This block uses PROC CAS to load the 'decisionTree' action set. This action set provides the necessary actions for building and training decision tree models, including the Gradient Boosting action that will be used later.
Copied!
1PROC CAS;
2LOADACTIONSET 'decisionTree';
3QUIT;
3 Code Block
PROC CAS Data
Explanation :
This block sets the default CASLIB to 'casuser' and then uses the 'gbtreeTrain' action from the 'decisionTree' action set via PROC CAS. This action trains a Gradient Boosting model on the 'baseball' table (casuser.baseball), targeting the 'logSalary' variable. It specifies a list of numerical and nominal input variables, uses a POISSON distribution, and includes options for early stopping (based on LOGLOSS), name encoding, greedy selection, handling missing values, Lasso regularization, leaf size, learning rate, and variable importance calculation. The trained model is saved in a new CAS table named 'GRADBOOST3'.
Copied!
1options caslib=casuser;
2 
3PROC CAS;
4decisionTree.gbtreeTrain /
5 TABLE={name="baseball"}
6 target="logSalary"
7 casOut={name="GRADBOOST3", replace=true}
8inputs={"nAtBat",
9 "nHits",
10 "nHome",
11 "nRuns",
12 "nRBI",
13 "nBB",
14 "YrMajor",
15 "CrAtBat",
16 "CrHits",
17 "CrHome",
18 "CrRuns",
19 "CrRbi",
20 "CrBB",
21 "nOuts",
22 "nAssts",
23 "nError",
24 "Division",
25 "League",
26 "Position"}
27nominals={"Division","League","Position"}
28distribution="POISSON"
29earlyStop={metric="LOGLOSS"}
30encodeName=TRUE
31greedy=TRUE
32includeMissing=TRUE
33lasso=1
34leafSize=5
35learningRate=.1
36m=5
37varImp=TRUE
38;
39QUIT;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © 2021, SAS Institute Inc., Cary, NC, USA. All Rights Reserved. SPDX-License-Identifier: Apache-2.0