Seeing the Trees for the Forest | A gentle introduction to tree-based methods Part 1
Background & Motivation
As the title of this post suggests, I think there is great value in understanding, at a very low level, how decision trees are built. I hope to present a simple introduction of this process, and then build this out to trees and forests in future posts. But first, a little analogy that hopefully sets some intuition for how this works.
Do you remember playing the game “20 Questions” to pass the time during a long car ride or sleepless nights as a kid? For those that missed out, the game involved two people, a Questioner (Q) and and Answerer (A). First, A thinks of an object and keeps this selection private. Then, Q has the opportunity to ask 20 Yes/No questions in an effort to identify A’s secret object. If you are unfamiliar and curious, try it out here (excuse the clunky UI, and note that you will be playing against a computer Questioner).
When I first learned of the game, I remember being amazed/skeptical that someone could hone in on a specific object from the vast sea of possible words in a mere 20 questions. However, I soon learned that 20 questions was more than enough to identify most objects (at least, objects familiar to a 10 year old). Additionally, the particular questions I asked impacted how quickly and efficiently I could hone in on the target object (rather than just asking the same predefined 20 questions every game, for example).
Although at the time, I was not thinking of my gameplay as selecting “optimal questions” to efficiently “partition the object space”, in hindsight I realize I (needless to say, this is not specific to me, I would guess it applies to most kids who learn the game) internalized some of the same rules that, at an abstract level, govern how decision trees learn a problem.
The takeaway was: even simple yes/no questions, when chosen properly, can efficiently partition even a very complex space.
Decision Trees: Selecting Splits
Generating Candidate Splits
Similar to the questions in a game of 20 Questions, decision trees ask a series of Yes/No questions about your data. At each step, we first gather a list of all potential questions about our features (for categorical features, we can ask “Is Feature I = X” for each category X in Feature I, and for continuous features we can ask “Is Feature I greater than X” for each observed value of Feature I (or some discretization of the observed values)).
Suppose we have the following toy dataset of salaries by age and geographic region:
Age | Region | Salary |
---|---|---|
25 | West | 53 |
55 | West | 98 |
19 | Midwest | 50 |
49 | Midwest | 110 |
Given such a table, we can generate all candidate splits with the following code:
def get_candidate_splits(df, y):
"""
df: pandas dataframe, to search for possible splits
y: value we are trying to predict (so omit from candidate splits)
"""
candidate_splits = []
for col, dtype in df.dtypes.iteritems():
if col == y:
# We don't want to split on target variable
continue
if np.issubdtype(dtype, np.number):
# If numeric, discretize cutpoints to be along the deciles (not a general rule, my simplification here)
if df[col].nunique()<5:
candidate_splits += [(col, v) for v in sorted(df[col].unique())[1:]]
else:
candidate_splits += [(col, v) for v in set(np.percentile(df[col].dropna(), range(10, 100, 10)))]
else:
# If categorical, split on the 10 most frequenty occuring values (again my simplification)
if df[col].nunique()<10:
candidate_splits += [(col, v) for v in df[col].unique()]
else:
candidate_splits += [(col, v) for v in df[col].value_counts().sort_values(ascending=False).index[:10]]
return candidate_splits
Applying this to the toy dataset gives:
candidate_splits = get_candidate_splits(toy_data, 'Salary')
print(candidate_splits)
[('Age', 25),
('Age', 49),
('Age', 55),
('Region', 'Midwest'),
('Region', 'West')]
This tells us we can ask the following Yes/No questions of the dataset:
- Is the age < 25?
- Is the age < 49?
- Is the age < 55?
- Is the Region == Midwest?
- Is the Region == West?
Now that we have generated the set of candidate questions, we can turn to the issue of which question to ask in order to gain the most information about salary.
Selecting Optimal Split
Once we have generated all possible splits of our data, we need a decision rule to select the “best” split. This rule can take different forms, but in general we are trying to minimize some cost function of our data. In the classification context, the cost function often takes the form of average entropy within leaves across the tree. For regression, which we are exploring here, we can try to minimize variance of the target variable within groups. The goal is to split the data in such a way that the resulting groups are as “similar” as possible (with regards to some specified target variable).
Suppose we start with some group X and variable y. Then, we can generate various splits of X according to each of the candidate splits identified in the previous section, to get resulting groups X_i and X_j. For each of these splits, we can calculate a new weighted variance as follows:
def get_split_variance(X, feature, val, y):
if np.issubdtype(X[feature].dtype, np.number):
X_i = X.loc[X[feature] < val]
X_j = X.loc[X[feature] >= val]
else:
X_i = X.loc[X[feature] == val]
X_j = X.loc[X[feature] != val]
return ((np.var(X_i[y]) * X_i.shape[0]) + (np.var(X_j[y]) * X_j.shape[0])) / X.shape[0]
This is simply a function of the dataframe (X), split information (feature and value), and target variable (y). Thus, we can easily calculate the resulting weighted variance of every candidate split, and then select the split that results in the lowest weighted variance (equivalently, the split with greatest variance reduction from the original X).
Applying this to our toy dataset and candidate splits, we get the following variance reduction for each split:
def get_variance_reduction(df, candidate_splits, y):
variance_reductions = []
for split in candidate_splits:
variance_reductions.append((split[0], split[1], get_split_variance(df, split[0], split[1], y)))
var_df = pd.DataFrame(variance_reductions, columns=['feature', 'value', 'variance'])
var_df['variance_reduction'] = df[y].var() - var_df['variance']
return var_df.sort_values('variance_reduction', ascending=False)
get_variance_reduction(toy_data, candidate_splits, 'Salary')
This results in the following split stats:
feature | value | variance | variance_reduction |
---|---|---|---|
Age | 49 | 19.125 | 925.125 |
Age | 25 | 451.5 | 492.75 |
Age | 55 | 571.5 | 372.75 |
Region | Midwest | 703.125 | 241.125 |
Region | West | 703.125 | 241.125 |
This tells us that our first split should be on Age >= 49, as that will lead to the largest reduction of variance for the
Salary
variable. You can probably see that this will result in two new datasets X_i
and X_j
, each of which have
their own set of candidate splits and associated variance reductions. In this way, we can recursively apply the
algorithm on the subgroups until some stopping criterion is met (e.g. commonly a maximum “depth” or some minimum required
number of samples). The resulting model has a tree structure, where predicting the target for a new sample involves
traversing the Yes/No questions until reaching a final terminal node, at which point the prediction is the average (or
modal, in the case of classification) value target value of training data that ended up in that node.
The following image provides a visualization of the resulting model structure (note that the particular split points are slightly different, based on how scikit chooses splits vs. how I did, but result in the same groups for our training data, e.g. Age <= 37.5 results in the same split as Age < 49 /Age >= 49 in our data).
As a final, I will demonstrate the flexibility of the approach on a more real dataset, so the results are perhaps more interesting and compelling.
male_wages = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/plm/Males.csv').drop(['Unnamed: 0', 'nr'], axis=1)
male_wages.head()
year | school | exper | union | ethn | married | health | wage | industry | occupation | residence |
---|---|---|---|---|---|---|---|---|---|---|
1980 | 14 | 1 | no | other | no | no | 1.19754 | Business_and_Repair_Service | Service_Workers | north_east |
1981 | 14 | 2 | yes | other | no | no | 1.85306 | Personal_Service | Service_Workers | north_east |
1982 | 14 | 3 | no | other | no | no | 1.34446 | Business_and_Repair_Service | Service_Workers | north_east |
1983 | 14 | 4 | no | other | no | no | 1.43321 | Business_and_Repair_Service | Service_Workers | north_east |
1984 | 14 | 5 | no | other | no | no | 1.56813 | Personal_Service | Craftsmen, Foremen_and_kindred | north_east |
Looking at a sample of candidate splits:
pd.DataFrame(get_candidate_splits(male_wages, 'wage')).drop_duplicates(0).values
['year' 1983.5]
['school' 10.0]
['exper' 8.0]
['union' 'no']
['ethn' 'other']
['married' 'no']
['health' 'no']
['industry' 'Manufacturing']
['occupation' 'Craftsmen, Foremen_and_kindred']
['residence' 'south']
Sorting splits by resulting variance reduction:
feature | value | variance | variance_reduction |
---|---|---|---|
school | 12.0 | 0.267075 | 0.0165982 |
year | 1983.5 | 0.268026 | 0.0156472 |
year | 1985.0 | 0.269613 | 0.0140595 |
year | 1982.0 | 0.270806 | 0.0128672 |
school | 11.0 | 0.271068 | 0.012605 |
married | no | 0.271655 | 0.0120179 |
married | yes | 0.271655 | 0.0120179 |
exper | 5.0 | 0.27271 | 0.0109627 |
exper | 6.0 | 0.273995 | 0.00967737 |
school | 14.0 | 0.274754 | 0.00891904 |
Thus we would expect the first split to be on whether schooling is greater than or equal to 12. Sure enough, we can fit this data with scikit and inspect the resulting tree, where the first split splits the group into those with 12 or more and less than 12 years of schooling, respectively.
from sklearn.ensemble import DecisionTreeRegressor
X = male_wages_w_dummies.drop(cat_cols+['wage'], axis=1)
dtr = DecisionTreeRegressor().fit(X, male_wages_w_dummies['wage'])
export_graphviz(dtr, out_file=out_f,
feature_names=X.columns.tolist(),
max_depth=3,
rounded=True,)
Next Steps
In a future post, I will go more into the details such as how we decide on a stopping criterion, limitations to the decision tree framework, and how extensions like bagging and random forests can address some of these shortcomings. In this post, I wanted to lay the groundwork for thinking about how a tree is constructed at the most micro level. Breaking it down into components like “candidate split search” and “optimal split selection” also lay a good foundation for reasoning about model training and parameter tuning. For example, as we will see when we move into random forests, limiting the candidate split search to a random subset of the features at each split can significantly improve model test performance and generalizability. Or we may want to modify the selection step; for certain applications our notion of “best split” may involve another metric such as MAE that is less sensitive to outliers.