Background & Motivation

As the title of this post suggests, I think there is great value in understanding, at a very low level, how decision trees are built. I hope to present a simple introduction of this process, and then build this out to trees and forests in future posts. But first, a little analogy that hopefully sets some intuition for how this works.

Do you remember playing the game “20 Questions” to pass the time during a long car ride or sleepless nights as a kid? For those that missed out, the game involved two people, a Questioner (Q) and and Answerer (A). First, A thinks of an object and keeps this selection private. Then, Q has the opportunity to ask 20 Yes/No questions in an effort to identify A’s secret object. If you are unfamiliar and curious, try it out here (excuse the clunky UI, and note that you will be playing against a computer Questioner).

When I first learned of the game, I remember being amazed/skeptical that someone could hone in on a specific object from the vast sea of possible words in a mere 20 questions. However, I soon learned that 20 questions was more than enough to identify most objects (at least, objects familiar to a 10 year old). Additionally, the particular questions I asked impacted how quickly and efficiently I could hone in on the target object (rather than just asking the same predefined 20 questions every game, for example).

Although at the time, I was not thinking of my gameplay as selecting “optimal questions” to efficiently “partition the object space”, in hindsight I realize I (needless to say, this is not specific to me, I would guess it applies to most kids who learn the game) internalized some of the same rules that, at an abstract level, govern how decision trees learn a problem.

The takeaway was: even simple yes/no questions, when chosen properly, can efficiently partition even a very complex space.

Decision Trees: Selecting Splits

Generating Candidate Splits

Similar to the questions in a game of 20 Questions, decision trees ask a series of Yes/No questions about your data. At each step, we first gather a list of all potential questions about our features (for categorical features, we can ask “Is Feature I = X” for each category X in Feature I, and for continuous features we can ask “Is Feature I greater than X” for each observed value of Feature I (or some discretization of the observed values)).

Suppose we have the following toy dataset of salaries by age and geographic region:

Age Region Salary
25 West 53
55 West 98
19 Midwest 50
49 Midwest 110

Given such a table, we can generate all candidate splits with the following code:

def get_candidate_splits(df, y):
    """
    df: pandas dataframe, to search for possible splits
    y: value we are trying to predict (so omit from candidate splits)
    """
    candidate_splits = []

    for col, dtype in df.dtypes.iteritems():
        if col == y:
            # We don't want to split on target variable
            continue
        if np.issubdtype(dtype, np.number):
            # If numeric, discretize cutpoints to be along the deciles (not a general rule, my simplification here)
            if df[col].nunique()<5:
                candidate_splits += [(col, v) for v in sorted(df[col].unique())[1:]]
            else:
                candidate_splits += [(col, v) for v in set(np.percentile(df[col].dropna(), range(10, 100, 10)))]
        else:
            # If categorical, split on the 10 most frequenty occuring values (again my simplification)
            if df[col].nunique()<10:
                candidate_splits += [(col, v) for v in df[col].unique()]
            else:
                candidate_splits += [(col, v) for v in df[col].value_counts().sort_values(ascending=False).index[:10]]
                
    return candidate_splits
    

Applying this to the toy dataset gives:

candidate_splits = get_candidate_splits(toy_data, 'Salary')
print(candidate_splits)
[('Age', 25),
 ('Age', 49),
 ('Age', 55),
 ('Region', 'Midwest'),
 ('Region', 'West')]

This tells us we can ask the following Yes/No questions of the dataset:

  • Is the age < 25?
  • Is the age < 49?
  • Is the age < 55?
  • Is the Region == Midwest?
  • Is the Region == West?

Now that we have generated the set of candidate questions, we can turn to the issue of which question to ask in order to gain the most information about salary.

Selecting Optimal Split

Once we have generated all possible splits of our data, we need a decision rule to select the “best” split. This rule can take different forms, but in general we are trying to minimize some cost function of our data. In the classification context, the cost function often takes the form of average entropy within leaves across the tree. For regression, which we are exploring here, we can try to minimize variance of the target variable within groups. The goal is to split the data in such a way that the resulting groups are as “similar” as possible (with regards to some specified target variable).

Suppose we start with some group X and variable y. Then, we can generate various splits of X according to each of the candidate splits identified in the previous section, to get resulting groups X_i and X_j. For each of these splits, we can calculate a new weighted variance as follows:

def get_split_variance(X, feature, val, y):
    if np.issubdtype(X[feature].dtype, np.number):
        X_i = X.loc[X[feature] < val]
        X_j = X.loc[X[feature] >= val]
    else:
        X_i = X.loc[X[feature] == val]
        X_j = X.loc[X[feature] != val]
        
    return ((np.var(X_i[y]) * X_i.shape[0]) + (np.var(X_j[y]) * X_j.shape[0])) / X.shape[0]

This is simply a function of the dataframe (X), split information (feature and value), and target variable (y). Thus, we can easily calculate the resulting weighted variance of every candidate split, and then select the split that results in the lowest weighted variance (equivalently, the split with greatest variance reduction from the original X).

Applying this to our toy dataset and candidate splits, we get the following variance reduction for each split:

def get_variance_reduction(df, candidate_splits, y):
    variance_reductions = []
    for split in candidate_splits:
        variance_reductions.append((split[0], split[1], get_split_variance(df, split[0], split[1], y)))

    var_df = pd.DataFrame(variance_reductions, columns=['feature', 'value', 'variance'])
    var_df['variance_reduction'] = df[y].var() - var_df['variance']

    return var_df.sort_values('variance_reduction', ascending=False)
    
get_variance_reduction(toy_data, candidate_splits, 'Salary')

This results in the following split stats:

feature value variance variance_reduction
Age 49 19.125 925.125
Age 25 451.5 492.75
Age 55 571.5 372.75
Region Midwest 703.125 241.125
Region West 703.125 241.125

This tells us that our first split should be on Age >= 49, as that will lead to the largest reduction of variance for the Salary variable. You can probably see that this will result in two new datasets X_i and X_j, each of which have their own set of candidate splits and associated variance reductions. In this way, we can recursively apply the algorithm on the subgroups until some stopping criterion is met (e.g. commonly a maximum “depth” or some minimum required number of samples). The resulting model has a tree structure, where predicting the target for a new sample involves traversing the Yes/No questions until reaching a final terminal node, at which point the prediction is the average (or modal, in the case of classification) value target value of training data that ended up in that node.

The following image provides a visualization of the resulting model structure (note that the particular split points are slightly different, based on how scikit chooses splits vs. how I did, but result in the same groups for our training data, e.g. Age <= 37.5 results in the same split as Age < 49 /Age >= 49 in our data).

Decision Tree Structure

As a final, I will demonstrate the flexibility of the approach on a more real dataset, so the results are perhaps more interesting and compelling.

male_wages = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/plm/Males.csv').drop(['Unnamed: 0', 'nr'], axis=1)
male_wages.head()
year school exper union ethn married health wage industry occupation residence
1980 14 1 no other no no 1.19754 Business_and_Repair_Service Service_Workers north_east
1981 14 2 yes other no no 1.85306 Personal_Service Service_Workers north_east
1982 14 3 no other no no 1.34446 Business_and_Repair_Service Service_Workers north_east
1983 14 4 no other no no 1.43321 Business_and_Repair_Service Service_Workers north_east
1984 14 5 no other no no 1.56813 Personal_Service Craftsmen, Foremen_and_kindred north_east

Looking at a sample of candidate splits:

pd.DataFrame(get_candidate_splits(male_wages, 'wage')).drop_duplicates(0).values
 
 ['year' 1983.5]
 ['school' 10.0]
 ['exper' 8.0]
 ['union' 'no']
 ['ethn' 'other']
 ['married' 'no']
 ['health' 'no']
 ['industry' 'Manufacturing']
 ['occupation' 'Craftsmen, Foremen_and_kindred']
 ['residence' 'south']

Sorting splits by resulting variance reduction:

feature value variance variance_reduction
school 12.0 0.267075 0.0165982
year 1983.5 0.268026 0.0156472
year 1985.0 0.269613 0.0140595
year 1982.0 0.270806 0.0128672
school 11.0 0.271068 0.012605
married no 0.271655 0.0120179
married yes 0.271655 0.0120179
exper 5.0 0.27271 0.0109627
exper 6.0 0.273995 0.00967737
school 14.0 0.274754 0.00891904

Thus we would expect the first split to be on whether schooling is greater than or equal to 12. Sure enough, we can fit this data with scikit and inspect the resulting tree, where the first split splits the group into those with 12 or more and less than 12 years of schooling, respectively.

from sklearn.ensemble import DecisionTreeRegressor

X = male_wages_w_dummies.drop(cat_cols+['wage'], axis=1)
dtr = DecisionTreeRegressor().fit(X, male_wages_w_dummies['wage'])

export_graphviz(dtr, out_file=out_f,
               feature_names=X.columns.tolist(),
                max_depth=3,
                rounded=True,)

Wages Decision Tree Structure

Next Steps

In a future post, I will go more into the details such as how we decide on a stopping criterion, limitations to the decision tree framework, and how extensions like bagging and random forests can address some of these shortcomings. In this post, I wanted to lay the groundwork for thinking about how a tree is constructed at the most micro level. Breaking it down into components like “candidate split search” and “optimal split selection” also lay a good foundation for reasoning about model training and parameter tuning. For example, as we will see when we move into random forests, limiting the candidate split search to a random subset of the features at each split can significantly improve model test performance and generalizability. Or we may want to modify the selection step; for certain applications our notion of “best split” may involve another metric such as MAE that is less sensitive to outliers.