Implementing a Decision Tree Algorithm from Scratch

machine learning
python
In this notebook I implement a decision tree algorithm from scratch and compare it to the algorithm from Lesson 6 of the fastai course.
Author

Vishal Bakshi

Published

September 28, 2023

Background

In this blog post I’ll work through the second exercise of the “Further Research” section in Chapter 9 of the fastai textbook:

Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.

The decision tree algorithm, as described in the textbook for the Blue Book for Bulldozers Kaggle competition:

  1. Loop through each column of the dataset in turn.
  2. For each column, loop through each possible level of that column in turn.
  3. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
  4. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. Treat this as a very simple “model” in which our predictions are simply the average sale price of the item’s group.
  5. After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
  6. We now have two groups for our data, based on this selected split. Treat each group as a separate dataset, and find the best split for each by going back to step 1 for each group.
  7. Continue this process recursively, until you have reached some stopping criterion for each group–for instance, stop splitting a group further when it has only 20 items in it.

I’ll implement the algorithm on my own, then compare it with the algorithm Jeremy implemented in the Lesson 6 video, and the sklearn.DecisionTreeRegressor.

Load the Data

I’ll follow the same steps as the textbook to load and prepare the training and validation datasets from the Blue Book for Bulldozers Kaggle Competition:

!pip install dtreeviz

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
from pathlib import Path

cred_path = Path("~/.kaggle/kaggle.json").expanduser()
if not cred_path.exists():
  cred_path.parent.mkdir(exist_ok=True)
  cred_path.write_text(creds)
  cred_path.chmod(0o600)
import zipfile,kaggle

path = Path('bluebook-for-bulldozers')
if not path.exists():
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)
Downloading bluebook-for-bulldozers.zip to /content
100%|██████████| 48.4M/48.4M [00:00<00:00, 53.5MB/s]
path.ls(file_type='text')
(#7) [Path('bluebook-for-bulldozers/Test.csv'),Path('bluebook-for-bulldozers/TrainAndValid.csv'),Path('bluebook-for-bulldozers/random_forest_benchmark_test.csv'),Path('bluebook-for-bulldozers/Machine_Appendix.csv'),Path('bluebook-for-bulldozers/median_benchmark.csv'),Path('bluebook-for-bulldozers/ValidSolution.csv'),Path('bluebook-for-bulldozers/Valid.csv')]
df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
len(df.columns)
53
sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])
df = add_datepart(df, 'saledate')
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')
' '.join(o for o in df.columns if o.startswith('sale'))
'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
procs = [Categorify, FillMissing]
cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
len(to.train), len(to.valid)
(404710, 7988)
to.show(3)
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc fiModelSeries fiModelDescriptor ProductSize fiProductClassDesc state ProductGroup ProductGroupDesc Drive_System Enclosure Forks Pad_Type Ride_Control Stick Transmission Turbocharged Blade_Extension Blade_Width Enclosure_Type Engine_Horsepower Hydraulics Pushblock Ripper Scarifier Tip_Control Tire_Size Coupler Coupler_System Grouser_Tracks Hydraulics_Flow Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start auctioneerID_na MachineHoursCurrentMeter_na SalesID MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleElapsed SalePrice
0 Low 521D 521 D #na# #na# #na# Wheel Loader - 110.0 to 120.0 Horsepower Alabama WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# None or Unspecified None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False False False 1139246 999089 3157 121 3.0 2004 68.0 2006 11 46 16 3 320 1.163635e+09 11.097410
1 Low 950FII 950 F II #na# Medium Wheel Loader - 150.0 to 175.0 Horsepower North Carolina WL Wheel Loader #na# EROPS w AC None or Unspecified #na# None or Unspecified #na# #na# #na# #na# #na# #na# #na# 2 Valve #na# #na# #na# #na# 23.5 None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# Standard Conventional False False False False False False False False 1139248 117657 77 121 3.0 1996 4640.0 2004 3 13 26 4 86 1.080259e+09 10.950807
2 High 226 226 #na# #na# #na# #na# Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity New York SSL Skid Steer Loaders #na# OROPS None or Unspecified #na# #na# #na# #na# #na# #na# #na# #na# #na# Auxiliary #na# #na# #na# #na# #na# None or Unspecified None or Unspecified None or Unspecified Standard #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# #na# False False False False False False False False 1139249 434808 7009 121 3.0 2001 2838.0 2004 2 9 26 3 57 1.077754e+09 9.210340
to = load_pickle('to.pkl')
xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

Decision Tree Algorithm: Before Watching Lesson 6

I haven’t watched the Lesson 6 video yet, and am assuming that Jeremy walks through how to build a decision tree from scratch in that video, so I’m trying it out first on my own. To be honest, I’m a bit embarrassed to publish this part of my learning process, since my approach is not very elegant, but I’d like to show what my current thinking is and then compare it with what I understand after learning the formal solution to this algorithm.

I’ll start by creating a mean squared error function to calculate at each split:

def mse(pred, y): return ((pred-y)**2).mean()
# root mse and value
mse(y, y.mean()), y.mean()
(0.48205692, 10.104347)

I’ll walk through the algorithm step-by-step manually for a couple of columns before I create any functions or loops. The first column in the training dataset’s index is the categorical variable UsageBand which has a value of 0, 1, 2, or 3.

xs.columns
Index(['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls',
       'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
       'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
       'auctioneerID_na', 'MachineHoursCurrentMeter_na', 'SalesID',
       'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade',
       'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleWeek',
       'saleDay', 'saleDayofweek', 'saleDayofyear', 'saleElapsed'],
      dtype='object')
xs.UsageBand.unique()
array([2, 1, 3, 0], dtype=int8)

The first split I’ll choose is between rows where the UsageBand value is 0 or not:

mask = xs.UsageBand == 0
is_zero_xs = xs[mask]
is_zero_y = y[mask]
len(is_zero_xs), len(is_zero_y)
(334164, 334164)
isnt_zero_xs = xs[~mask]
isnt_zero_y = y[~mask]
len(isnt_zero_xs), len(isnt_zero_y)
(70546, 70546)

I’ll calculate the average sale price and MSE for each group:

is_zero_y.mean(), mse(is_zero_y, is_zero_y.mean())
(10.080211, 0.46968707)
isnt_zero_y.mean(), mse(isnt_zero_y, isnt_zero_y.mean())
(10.218686, 0.5248172)

I’ll put this code into a routine so I can apply it to all splits of UsageBand:

def get_cat_splits(xs, y, column):
  splits = []
  for el in xs[column].unique():
    mask = xs[column] == el
    is_el_y = y[mask]
    isnt_el_y = y[~mask]
    is_el_mse = mse(is_el_y, is_el_y.mean())
    isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())

    split = {
        str(el): is_el_mse,
        "!" + str(el): isnt_el_mse
    }

    splits.append(split)
  return splits
splits = get_cat_splits(xs, y, "UsageBand")
splits
[{'2': 0.46612886, '!2': 0.4824535},
 {'1': 0.5223263, '!1': 0.476563},
 {'3': 0.50998193, '!3': 0.47640917},
 {'0': 0.46968707, '!0': 0.5248172}]

Great! My function calculates the MSE for each split in a categorical column. However, I don’t need to store all of the splits for the column, just the best one (the one with the lowest MSE). I’ll modify my function so that it returns only the best split:

def get_cat_best_split(xs, y, column):
  best_split = []
  lowest_mse = 1000000

  for el in xs[column].unique():
    mask = xs[column] == el

    # ignore splits where either group has 0 or 1 row
    if sum(mask) == 0: continue
    if sum(mask) == 1: continue
    if sum(~mask) == 1: continue

    is_el_y = y[mask]
    isnt_el_y = y[~mask]

    is_el_mse = mse(is_el_y, is_el_y.mean())
    isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())

    if is_el_mse < lowest_mse or isnt_el_mse < lowest_mse:
      best_split = [el, [is_el_mse, isnt_el_mse]]

    lowest_mse = min(is_el_mse, isnt_el_mse, lowest_mse)
  print(column)
  return best_split
get_cat_best_split(xs, y, "UsageBand")
UsageBand
[2, [0.46612886, 0.4824535]]

I now have the category for which the split was made (is 2, isn’t 2) and the corresponding MSE.

Next, I’ll find the best split for one of the continuous columns, YearMade:

xs.YearMade.unique()
array([2004, 1996, 2001, 2007, 1993, 2008, 1000, 1998, 1999, 2003, 1991,
       2000, 2005, 1995, 2006, 2002, 1984, 1988, 1980, 1992, 1987, 1997,
       1971, 1978, 1989, 1985, 1979, 1976, 1994, 1982, 1990, 1974, 1968,
       1966, 1983, 1986, 1981, 1970, 1977, 1975, 1973, 1965, 1967, 2009,
       2010, 1969, 1972, 1964, 1957, 1958, 1963, 1919, 1920, 1950, 1948,
       1952, 1942, 1956, 1954, 1953, 1955, 1959, 1960, 1961, 1962, 1951,
       1937, 1949, 1947, 2012, 2011, 2014], dtype=int16)
mask = xs.YearMade <= 2004
lte_2004_xs = xs[mask]
lte_2004_y = y[mask]
len(lte_2004_xs), len(lte_2004_y)
(364685, 364685)
gt_2004_xs = xs[~mask]
gt_2004_y = y[~mask]
len(gt_2004_xs), len(gt_2004_y)
(40025, 40025)

So far the process is pretty similar to what I did for categorical variables. I’ll calculate the average sale price and MSE for each split next:

lte_2004_y.mean(), mse(lte_2004_y, lte_2004_y.mean())
(10.072348, 0.46514994)
gt_2004_y.mean(), mse(gt_2004_y, gt_2004_y.mean())
(10.395911, 0.5417633)

Great! I’ll wrap this process into a function:

def get_cont_best_split(xs, y, column):
  best_split = []
  lowest_mse = 1000000

  for el in xs[column].unique():
    mask = xs[column] <= el

    # ignore splits where either group has 0 or 1 row
    if sum(mask) == 0: continue
    if sum(mask) == 1: continue
    if sum(~mask) == 1: continue

    lte_el_y = y[mask]
    gt_el_y = y[~mask]

    lte_el_mse = mse(lte_el_y, lte_el_y.mean())
    gt_el_mse = mse(gt_el_y, gt_el_y.mean())

    if lte_el_mse < lowest_mse or gt_el_mse < lowest_mse:
      best_split = [el, [lte_el_mse, gt_el_mse]]

    lowest_mse = min(lte_el_mse, gt_el_mse, lowest_mse)
  print(column)
  return best_split
get_cont_best_split(xs, y, "YearMade")
YearMade
[2012, [0.4820588, 0.00022279842]]

Now that I have functions to calculate the best split (by MSE) for categorical and continuous variables, I can loop through each column, apply these functions and then determine the overall best split. I’ll use a small dataset for this procedure since the loops require too much time (more than an hour) when I use the full dataset:

xs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]
best_splits = []
for column in xs_sub.columns:
  if column in to.cat_names:
    best_splits.append([column, get_cat_best_split(xs_sub, y_sub, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(xs_sub, y_sub, column)])
best_splits
[['UsageBand', [0, [0.4289803, 0.5418442]]],
 ['fiModelDesc', [1082, [0.0, 0.45956165]]],
 ['fiBaseModel', [72, [0.0, 0.45933586]]],
 ['fiSecondaryDesc', [139, [0.0010413192, 0.45947686]]],
 ['fiModelSeries', [22, [0.0018927432, 0.45957345]]],
 ['fiModelDescriptor', [138, [0.0011120219, 0.45861065]]],
 ['ProductSize', [6, [0.17651369, 0.45974725]]],
 ['fiProductClassDesc', [59, [0.0028960933, 0.4587686]]],
 ['state', [15, [0.024394397, 0.45930827]]],
 ['ProductGroup', [3, [0.10852089, 0.39364752]]],
 ['ProductGroupDesc', [3, [0.10852089, 0.39364752]]],
 ['Drive_System', [2, [0.0733325, 0.49420977]]],
 ['Enclosure', [3, [0.30207962, 0.37955183]]],
 ['Forks', [2, [0.28580874, 0.4628031]]],
 ['Pad_Type', [4, [0.01935081, 0.46220613]]],
 ['Ride_Control', [1, [0.10788533, 0.53090215]]],
 ['Stick', [1, [0.08875317, 0.49070317]]],
 ['Transmission', [3, [0.037590597, 0.4595029]]],
 ['Turbocharged', [2, [0.043661047, 0.46277064]]],
 ['Blade_Extension', [1, [0.6090261, 0.43294448]]],
 ['Blade_Width', [2, [0.21063821, 0.45938253]]],
 ['Enclosure_Type', [1, [0.09410498, 0.45690852]]],
 ['Engine_Horsepower', [2, [0.08047946, 0.45704415]]],
 ['Hydraulics', [7, [0.00222363, 0.4558903]]],
 ['Pushblock', [2, [0.17637664, 0.4491221]]],
 ['Ripper', [1, [0.3071829, 0.4566584]]],
 ['Scarifier', [0, [0.43313345, 0.59920347]]],
 ['Tip_Control', [0, [0.43313345, 0.59920347]]],
 ['Tire_Size', [6, [0.003916449, 0.45774794]]],
 ['Coupler', [0, [0.35867354, 0.54620147]]],
 ['Coupler_System', [2, [0.10629387, 0.45753148]]],
 ['Grouser_Tracks', [2, [0.033831615, 0.45868516]]],
 ['Hydraulics_Flow', [0, [0.39364752, 0.10852089]]],
 ['Track_Type', [1, [0.2580127, 0.45796755]]],
 ['Undercarriage_Pad_Width', [4, [0.08790448, 0.459086]]],
 ['Stick_Length', [21, [0.03558779, 0.45946068]]],
 ['Thumb', [2, [0.23120114, 0.4605685]]],
 ['Pattern_Changer', [2, [0.4319223, 0.45872542]]],
 ['Grouser_Type', [1, [0.4383151, 0.45913604]]],
 ['Backhoe_Mounting', [0, [0.46665692, 0.3895407]]],
 ['Blade_Type', [10, [0.0015298143, 0.4583235]]],
 ['Travel_Controls', [4, [0.0022710091, 0.45842946]]],
 ['Differential_Type', [3, [0.008903191, 0.4579678]]],
 ['Steering_Controls', [2, [0.39570603, 0.46438074]]],
 ['saleIs_month_end', [1, [0.46104407, 0.36129642]]],
 ['saleIs_month_start', [1, [0.46031126, 0.39876112]]],
 ['saleIs_quarter_end', [1, [0.4600184, 0.33042803]]],
 ['saleIs_quarter_start', [1, [0.45814824, 0.48365197]]],
 ['saleIs_year_end', [1, [0.45866725, nan]]],
 ['saleIs_year_start', [1, [0.45866725, nan]]],
 ['auctioneerID_na', [1, [0.45774823, 0.46984664]]],
 ['MachineHoursCurrentMeter_na', [2, [0.41661143, 0.51914454]]],
 ['SalesID', [1144455, [0.008310286, 0.45859787]]],
 ['MachineID', [17776, [0.003757841, 0.4595788]]],
 ['ModelID', [36033, [0.45857352, 0.0010936307]]],
 ['datasource', [132, [0.43562403, 0.49841937]]],
 ['auctioneerID', [0.0, [0.34691957, 0.45884383]]],
 ['YearMade', [1974, [0.3009956, 0.46007067]]],
 ['MachineHoursCurrentMeter', [14856.0, [0.459368, 0.2509487]]],
 ['saleYear', [1989, [0.09763114, 0.45981508]]],
 ['saleMonth', [8, [0.44693333, 0.48317593]]],
 ['saleWeek', [50, [0.45991018, 0.35147443]]],
 ['saleDay', [30, [0.4608025, 0.2767068]]],
 ['saleDayofweek', [4, [0.4637312, 0.39367497]]],
 ['saleDayofyear', [10, [0.0038693824, 0.45864925]]],
 ['saleElapsed', [607910400.0, [0.0004528304, 0.45936278]]]]
lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value
('fiModelDesc', 0.0, 1082)

There were numerous splits with an MSE of 0.0 so I’ll just pick the first one which is for when fiModelDesc is 1082.

mask = xs_sub.fiModelDesc == best_split_value
left_xs = xs_sub[mask]
right_xs = xs_sub[~mask]

left_y = y_sub[mask]
right_y = y_sub[~mask]

len(left_xs), len(right_xs)
(2, 998)
left_y.mean(), right_y.mean()
(9.998797, 10.110093)
mse(left_y, left_y.mean()), mse(right_y, right_y.mean())
(0.0, 0.45956165)
left_xs
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc ... saleDay saleDayofweek saleDayofyear saleElapsed
292246 0 1082 326 93 ... 25 2 84 1.237939e+09
292242 0 1082 326 93 ... 25 2 84 1.237939e+09

2 rows × 66 columns

I’ll leave the left_xs split as a leaf, since it has only two values. I’ll continue to split right_xs:

best_splits = []
for column in right_xs.columns:
  if column in to.cat_names and column != 'fiModelDesc':
    best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])
lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value
('fiBaseModel', 0.0, 72)

The next best split is for fiBaseModel:

mask = right_xs.fiBaseModel == best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]

left_y = right_y[mask]
right_y = right_y[~mask]

len(left_xs), len(right_xs)
(2, 996)
left_xs
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc ... saleDay saleDayofweek saleDayofyear saleElapsed
372510 0 1030 313 50 ... 17 2 321 1.289952e+09
366529 0 2214 698 31 ... 12 1 102 1.302566e+09

2 rows × 66 columns

left_y.mean(), right_y.mean()
(10.463103, 10.109385)
mse(left_y, left_y.mean()), mse(right_y, right_y.mean())
(0.0, 0.4602337)

Again, one of the splits only has two values, so I’ll keep that as the second leaf, and continue to split the other group:

best_splits = []
for column in right_xs.columns:
  if column in to.cat_names and column not in ['fiModelDesc', 'fiBaseModel']:
    best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])
lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value
('saleElapsed', 0.0004528304, 607910400.0)

The final split (to reach four leaf nodes like the textbook) is on saleElapsed.

mask = right_xs.saleElapsed <= best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]

left_y = right_y[mask]
right_y = right_y[~mask]

mse(left_y, left_y.mean()), mse(right_y, right_y.mean())
(0.0004528304, 0.4609359)
left_y.mean(), right_y.mean()
(9.776848, 10.110054)
len(left_xs), len(right_xs)
(2, 994)

Plotting the decision tree (manually) to visualize the splits:

My Decision Tree Algorithm

My model seems to be making incremental splits (with 2 rows taken off each split) based on very small sample sizes. One question I have is if the decision tree is always supposed to select the minimum or is it choosing randomly one of the many columns with relatively small MSE values? Another question: am I supposed to exclude a column from future splits once a split has been made on the column (which is what I’m doing currently)? I’ll be looking for these answers when watching the Lesson 6 video.

Decision Tree Algorithm: After Watching Lesson 6

The algorithm presented in the Lesson 6 video(from Jeremy’s notebook How random forests really work) evaluates a split based on a weighted standard deviation calculation.

For a given condition (for example Age <= 6) Jeremy calculates the following:

  • A “side score” which is the product of the standard deviation of the dependent variable and number of rows for rows that satisfy the condition.
  • A side score for the rows that don’t satisfy the condition.
  • The sum of both side scores divided by the number of total rows in the dependent variable column.

Then for each column, he calculates the side score for splits at each unique value of the column and gets the minimum score (and corresponding split value). The column (and corresponding value) with the split that has the smallest score is determined as the best split.

Conceptually: the column (and value) where the weighted standard deviation is the smallest is the column where a split on that value will create two groups where within each group there will be dependent variable values that are relatively close to each other.

I’ll start by coding the _side_score function Jeremy defined, which calculates the product of the standard deviation of the dependent variable and number of rows in the give split side:

def _side_score(side, y):
    tot = side.sum()
    if tot<=1: return 0
    return y[side].std()*tot

Next, the function score which calculates the sum of each side’s _side_score divided by the total number of rows, or in other words, the weighted average of each side’s standard deviation:

def score(col, y, split):
    lhs = col<=split
    return (_side_score(lhs,y) + _side_score(~lhs,y))/len(y)

I’ll reload the data so and take the same subset as above so I can compare the splits this algorithm makes to the ones made previously:

to = load_pickle("to.pkl")
xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
xs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]
mse(y_sub, y_sub.mean()), y_sub.mean()
(0.45866725, 10.109871)
score(xs['UsageBand'], y, 2)
0.6922498415264088

I’ll walk through each step of the score calculation manually to visualize it better.

First we create a boolean array of rows that satisfy or don’t satisfy the condition:

lhs = xs['UsageBand'] <= 2
lhs
0          True
1          True
2          True
3          True
4         False
          ...  
412693     True
412694     True
412695     True
412696     True
412697     True
Name: UsageBand, Length: 404710, dtype: bool

Next: for how many columns is the condition met?

tot = lhs.sum()
tot
370444

What is the standard deviation of the dependent variable for rows where the condition is met?

y[lhs].std()
0.69022495

Finally, what is the side score for this condition?

side_score_lhs = y[lhs].std()*tot
side_score_lhs
255689.68972754478

We ask the same questions for the rows where the condition is NOT met:

tot = (~lhs).sum()
tot
34266
y[~lhs].std()
0.71414065
side_score_rhs = y[~lhs].std()*tot
side_score_rhs
24470.743636608124

Finally, we add the two side scores together and divide by the number of total rows:

(side_score_lhs + side_score_rhs) / len(y)
0.6922498415264088

Thankfully, this is equal to the output from the function _side_score! I’ll continue by getting the code for calculating this score for splits on each of the unique values in a column:

def min_col(xs, nm, y):
    col = xs[nm]
    unq = col.dropna().unique()
    scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
    idx = scores.argmin()
    return unq[idx],scores[idx]

Let’s test it out on UsageBand:

min_col(xs, 'UsageBand', y)
(0, 0.6921555579861831)

Again, I’ll go line-by-line through the code of min_col to make sure I understand what’s going on:

col = xs['UsageBand']
col
0         2
1         2
2         1
3         1
4         3
         ..
412693    0
412694    0
412695    0
412696    0
412697    0
Name: UsageBand, Length: 404710, dtype: int8

What are the unique values in this column?

unq = col.dropna().unique()
unq
array([2, 1, 3, 0], dtype=int8)

If we split the data on each of the unique values, what is the score for each split?

scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
scores
array([0.69224984, 0.69378292, 0.69430405, 0.69215556])

What is the minimum score, and what value does it correspond to?

idx = scores.argmin()
idx
3
unq[idx], scores[idx]
(0, 0.6921555579861831)

Next, I’ll iterate through all of the columns in the data and calculate the minimum score (and corresponding value) for each one. I’ll use the smaller subset of the data since the list comprehension was taking more than an hour to compute on the full dataset.

col_scores = {o: min_col(xs_sub, o, y_sub) for o in xs_sub.columns}
col_scores
{'UsageBand': (0, 0.6710787683725357),
 'fiModelDesc': (150, 0.6684566705226899),
 'fiBaseModel': (30, 0.66775039935112),
 'fiSecondaryDesc': (15, 0.6461849688887596),
 'fiModelSeries': (60, 0.6612360656261445),
 'fiModelDescriptor': (9, 0.6385975532531738),
 'ProductSize': (0, 0.649804151058197),
 'fiProductClassDesc': (7, 0.6468342208862304),
 'state': (52, 0.6769107410311699),
 'ProductGroup': (3, 0.6413102557659149),
 'ProductGroupDesc': (3, 0.6413102557659149),
 'Drive_System': (3, 0.6595019570589066),
 'Enclosure': (3, 0.6443507042527199),
 'Forks': (0, 0.6468727025985718),
 'Pad_Type': (0, 0.6484307520389557),
 'Ride_Control': (0, 0.6693848296999931),
 'Stick': (0, 0.6484307520389557),
 'Transmission': (6, 0.6698495596647263),
 'Turbocharged': (0, 0.6484307520389557),
 'Blade_Extension': (0, 0.6660390158891678),
 'Blade_Width': (2, 0.665115752696991),
 'Enclosure_Type': (0, 0.6660390158891678),
 'Engine_Horsepower': (0, 0.6660390158891678),
 'Hydraulics': (0, 0.6497658799290658),
 'Pushblock': (0, 0.6660390158891678),
 'Ripper': (0, 0.6618279729485512),
 'Scarifier': (0, 0.6660390158891678),
 'Tip_Control': (0, 0.6660390158891678),
 'Tire_Size': (3, 0.6607923060655594),
 'Coupler': (2, 0.6719830477237702),
 'Coupler_System': (0, 0.5992864975929261),
 'Grouser_Tracks': (0, 0.5992864975929261),
 'Hydraulics_Flow': (0, 0.5992864975929261),
 'Track_Type': (1, 0.6633642191886902),
 'Undercarriage_Pad_Width': (5, 0.6722126321792603),
 'Stick_Length': (1, 0.6726506268978119),
 'Thumb': (0, 0.6727646491527557),
 'Pattern_Changer': (0, 0.6727646491527557),
 'Grouser_Type': (0, 0.6727646491527557),
 'Backhoe_Mounting': (0, 0.6719208754301071),
 'Blade_Type': (5, 0.6712375184893609),
 'Travel_Controls': (2, 0.6719208754301071),
 'Differential_Type': (0, 0.6718101676702499),
 'Steering_Controls': (0, 0.6718101676702499),
 'saleIs_month_end': (1, 0.6775472568273544),
 'saleIs_month_start': (1, 0.6773856681585312),
 'saleIs_quarter_end': (1, 0.6774418947696685),
 'saleIs_quarter_start': (2, 0.6775886416435242),
 'saleIs_year_end': (1, 0.6775886416435242),
 'saleIs_year_start': (1, 0.6775886416435242),
 'auctioneerID_na': (2, 0.6775886416435242),
 'MachineHoursCurrentMeter_na': (1, 0.6731610369682312),
 'SalesID': (1614635, 0.6669842964410782),
 'MachineID': (1010754, 0.6537257554531097),
 'ModelID': (4336, 0.6421609473228455),
 'datasource': (132, 0.674677339553833),
 'auctioneerID': (1.0, 0.674844207584858),
 'YearMade': (1974, 0.6595817529559136),
 'MachineHoursCurrentMeter': (3360.0, 0.6672791358232498),
 'saleYear': (1991, 0.6751361751556396),
 'saleMonth': (4, 0.6770030355453491),
 'saleWeek': (17, 0.6766363668441773),
 'saleDay': (10, 0.6758146023750305),
 'saleDayofweek': (4, 0.6748345303535461),
 'saleDayofyear': (10, 0.6763967939466238),
 'saleElapsed': (688348800.0, 0.6746526069641113)}

I’ll then find the split across all columns with the lowest score:

min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names
((0, 0.5992864975929261),
 ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])

The lowest score was around the split of <=0 for three columns, which interestingly enough, were found to be redundant features in the textbook chapter. I’ll pick Coupler_System since that’s the split used in the textbook.

xs_sub.Coupler_System.unique()
array([0, 1, 2], dtype=int8)

To answer one of my earlier questions, I think we only remove a column from future splits if it was a binary column (such as the Sex column was in the Titanic dataset) since there are only two possible groups and once the data has been split into those two groups, there’s nothing left to split in that column.

mask = xs_sub.Coupler_System <= 0
lhs = xs_sub[mask]
rhs = xs_sub[~mask]

lhs_y = y_sub[mask]
rhs_y = y_sub[~mask]

len(lhs), len(rhs)
(904, 96)
lhs_y.mean(), rhs_y.mean()
(10.208925, 9.177121)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())
(0.39364752, 0.10852089)

The lefthand-side has many more rows so I’ll continue splitting that dataset and keep rhs as a leaf node.

Coupler_System only has one value, 0, in this group so I’ll remove it from future splits.

lhs.Coupler_System.unique()
array([0], dtype=int8)
lhs = lhs.drop("Coupler_System", axis=1)
'Coupler_System' in lhs.columns
False
col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names
((0, 0.5869915567140663), ['Pad_Type', 'Stick', 'Turbocharged'])

Again we have three columns with the same split score. I’ll choose Pad_Type as the column to split on.

lhs.Pad_Type.unique()
array([2, 0, 3, 4], dtype=int8)
mask = lhs.Pad_Type <= 0
rhs = lhs[~mask]
lhs = lhs[mask]

rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]

len(lhs), len(rhs)
(700, 204)
lhs_y.mean(), rhs_y.mean()
(10.30318, 9.885499)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())
(0.43736917, 0.108533934)

Again, the lefthand-side has more rows so I’ll continue to split it. I’ll keep rhs as the second leaf node. I’ll also remove Pad_Type from lhs since it has a single value of 0.

lhs = lhs.drop("Pad_Type", axis=1)
'Pad_Type' in lhs.columns
False
col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names
((0, 0.6552193513938359),
 ['Drive_System',
  'Blade_Extension',
  'Enclosure_Type',
  'Engine_Horsepower',
  'Scarifier',
  'Tip_Control'])

I now have six columns that have the same split score. I’ll choose the first one, Drive_System to make my final split and get my last two leaf nodes.

mask = lhs.Drive_System <= 0
rhs = lhs[~mask]
lhs = lhs[mask]

rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]

len(lhs), len(rhs)
(638, 62)
lhs_y.mean(), rhs_y.mean()
(10.275307, 10.590003)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())
(0.41287065, 0.59920347)

Here is a manually made visualization of this four-leaf-node decision tree:

Jeremy’s Decision Tree

Decision Tree Algorithm: Using sklearn

The last decision tree that I’ll create for comparison with the other two approaches.

m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs_sub, y_sub);
from sklearn.tree import export_graphviz

import graphviz

def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
                      special_characters=True, rotate=False, precision=precision, **kwargs)
    return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
draw_tree(m, xs_sub, size=10)

Final Thoughts

In this exercise, I tried three different algorithms to create a four-leaf-node decision tree on a 1000-sample subset of the Blue Book for Bulldozer’s Kaggle competition dataset:

  • An algorithm I created where the model made splits based on the lowest MSE.
  • Jeremy’s “side score” algorithm which found the split with the lowest weighted standard deviation.
  • sklearn.DecisionTreeRegressor which also uses MSE to decide splits.

Decision Tree Comparison

Here are some key observations:

  • My algorithm made splits with very small samples. To improve this, I would either explicitly ignore splits with low samples or penalize such splits by weighting by number of rows like Jeremy did in _side_score.
  • The sklearn.DecisionTreeRegressor had the same initial split as Jeremy’s (on Coupler_System) but then chose two completely different columns to split on for the remainder of the leaf nodes. I wonder if this difference in column selection is due to the difference in how a split is scored between those two algorithms.
  • A future improvement I could make to this experiment is compare the MSE on a validation set for each decision tree.
  • Another future improvement would be to make my algorithm and Jeremy’s algorithm run faster so that I can test it on the actual dataset in a reasonable amount of time.

This was another challenging and enjoyable “Further Research” exercise provided by the fastai textbook. I hope you enjoyed this blog post!