Implementing a Decision Tree Algorithm from Scratch

machine learning

python

In this notebook I implement a decision tree algorithm from scratch and compare it to the algorithm from Lesson 6 of the fastai course.

Author

Vishal Bakshi

Published

September 28, 2023

Background

In this blog post I’ll work through the second exercise of the “Further Research” section in Chapter 9 of the fastai textbook:

Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.

The decision tree algorithm, as described in the textbook for the Blue Book for Bulldozers Kaggle competition:

Loop through each column of the dataset in turn.
For each column, loop through each possible level of that column in turn.
Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. Treat this as a very simple “model” in which our predictions are simply the average sale price of the item’s group.
After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
We now have two groups for our data, based on this selected split. Treat each group as a separate dataset, and find the best split for each by going back to step 1 for each group.
Continue this process recursively, until you have reached some stopping criterion for each group–for instance, stop splitting a group further when it has only 20 items in it.

I’ll implement the algorithm on my own, then compare it with the algorithm Jeremy implemented in the Lesson 6 video, and the sklearn.DecisionTreeRegressor.

Load the Data

I’ll follow the same steps as the textbook to load and prepare the training and validation datasets from the Blue Book for Bulldozers Kaggle Competition:

!pip install dtreeviz

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

from pathlib import Path

cred_path = Path("~/.kaggle/kaggle.json").expanduser()
if not cred_path.exists():
  cred_path.parent.mkdir(exist_ok=True)
  cred_path.write_text(creds)
  cred_path.chmod(0o600)

import zipfile,kaggle

path = Path('bluebook-for-bulldozers')
if not path.exists():
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading bluebook-for-bulldozers.zip to /content

100%|██████████| 48.4M/48.4M [00:00<00:00, 53.5MB/s]

path.ls(file_type='text')

(#7) [Path('bluebook-for-bulldozers/Test.csv'),Path('bluebook-for-bulldozers/TrainAndValid.csv'),Path('bluebook-for-bulldozers/random_forest_benchmark_test.csv'),Path('bluebook-for-bulldozers/Machine_Appendix.csv'),Path('bluebook-for-bulldozers/median_benchmark.csv'),Path('bluebook-for-bulldozers/ValidSolution.csv'),Path('bluebook-for-bulldozers/Valid.csv')]

df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

len(df.columns)

sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.

dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])

df = add_datepart(df, 'saledate')
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')

' '.join(o for o in df.columns if o.startswith('sale'))

'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'

procs = [Categorify, FillMissing]

cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))

cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

len(to.train), len(to.valid)

(404710, 7988)

to.show(3)

	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	ProductSize	fiProductClassDesc	state	ProductGroup	ProductGroupDesc	Drive_System	Enclosure	Forks	Pad_Type	Ride_Control	Stick	Transmission	Turbocharged	Blade_Extension	Blade_Width	Enclosure_Type	Engine_Horsepower	Hydraulics	Pushblock	Ripper	Scarifier	Tip_Control	Tire_Size	Coupler	Coupler_System	Grouser_Tracks	Hydraulics_Flow	Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls	saleIs_month_end	saleIs_month_start	saleIs_quarter_end	saleIs_quarter_start	saleIs_year_end	saleIs_year_start	auctioneerID_na	MachineHoursCurrentMeter_na	SalesID	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	saleYear	saleMonth	saleWeek	saleDay	saleDayofweek	saleDayofyear	saleElapsed	SalePrice
0	Low	521D	521	D	#na#	#na#	#na#	Wheel Loader - 110.0 to 120.0 Horsepower	Alabama	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	False	False	1139246	999089	3157	121	3.0	2004	68.0	2006	11	46	16	3	320	1.163635e+09	11.097410
1	Low	950FII	950	F	II	#na#	Medium	Wheel Loader - 150.0 to 175.0 Horsepower	North Carolina	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	23.5	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	False	False	1139248	117657	77	121	3.0	1996	4640.0	2004	3	13	26	4	86	1.080259e+09	10.950807
2	High	226	226	#na#	#na#	#na#	#na#	Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity	New York	SSL	Skid Steer Loaders	#na#	OROPS	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Auxiliary	#na#	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	None or Unspecified	Standard	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	False	False	False	False	False	False	False	False	1139249	434808	7009	121	3.0	2001	2838.0	2004	2	9	26	3	57	1.077754e+09	9.210340

to = load_pickle('to.pkl')

xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

Decision Tree Algorithm: Before Watching Lesson 6

I haven’t watched the Lesson 6 video yet, and am assuming that Jeremy walks through how to build a decision tree from scratch in that video, so I’m trying it out first on my own. To be honest, I’m a bit embarrassed to publish this part of my learning process, since my approach is not very elegant, but I’d like to show what my current thinking is and then compare it with what I understand after learning the formal solution to this algorithm.

I’ll start by creating a mean squared error function to calculate at each split:

def mse(pred, y): return ((pred-y)**2).mean()

# root mse and value
mse(y, y.mean()), y.mean()

(0.48205692, 10.104347)

I’ll walk through the algorithm step-by-step manually for a couple of columns before I create any functions or loops. The first column in the training dataset’s index is the categorical variable UsageBand which has a value of 0, 1, 2, or 3.

xs.columns

Index(['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls',
       'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
       'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
       'auctioneerID_na', 'MachineHoursCurrentMeter_na', 'SalesID',
       'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade',
       'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleWeek',
       'saleDay', 'saleDayofweek', 'saleDayofyear', 'saleElapsed'],
      dtype='object')

xs.UsageBand.unique()

array([2, 1, 3, 0], dtype=int8)

The first split I’ll choose is between rows where the UsageBand value is 0 or not:

mask = xs.UsageBand == 0
is_zero_xs = xs[mask]
is_zero_y = y[mask]

len(is_zero_xs), len(is_zero_y)

(334164, 334164)

isnt_zero_xs = xs[~mask]
isnt_zero_y = y[~mask]

len(isnt_zero_xs), len(isnt_zero_y)

(70546, 70546)

I’ll calculate the average sale price and MSE for each group:

is_zero_y.mean(), mse(is_zero_y, is_zero_y.mean())

(10.080211, 0.46968707)

isnt_zero_y.mean(), mse(isnt_zero_y, isnt_zero_y.mean())

(10.218686, 0.5248172)

I’ll put this code into a routine so I can apply it to all splits of UsageBand:

def get_cat_splits(xs, y, column):
  splits = []
  for el in xs[column].unique():
    mask = xs[column] == el
    is_el_y = y[mask]
    isnt_el_y = y[~mask]
    is_el_mse = mse(is_el_y, is_el_y.mean())
    isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())

    split = {
        str(el): is_el_mse,
        "!" + str(el): isnt_el_mse
    }

    splits.append(split)
  return splits

splits = get_cat_splits(xs, y, "UsageBand")
splits

[{'2': 0.46612886, '!2': 0.4824535},
 {'1': 0.5223263, '!1': 0.476563},
 {'3': 0.50998193, '!3': 0.47640917},
 {'0': 0.46968707, '!0': 0.5248172}]

Great! My function calculates the MSE for each split in a categorical column. However, I don’t need to store all of the splits for the column, just the best one (the one with the lowest MSE). I’ll modify my function so that it returns only the best split:

def get_cat_best_split(xs, y, column):
  best_split = []
  lowest_mse = 1000000

  for el in xs[column].unique():
    mask = xs[column] == el

    # ignore splits where either group has 0 or 1 row
    if sum(mask) == 0: continue
    if sum(mask) == 1: continue
    if sum(~mask) == 1: continue

    is_el_y = y[mask]
    isnt_el_y = y[~mask]

    is_el_mse = mse(is_el_y, is_el_y.mean())
    isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())

    if is_el_mse < lowest_mse or isnt_el_mse < lowest_mse:
      best_split = [el, [is_el_mse, isnt_el_mse]]

    lowest_mse = min(is_el_mse, isnt_el_mse, lowest_mse)
  print(column)
  return best_split

get_cat_best_split(xs, y, "UsageBand")

UsageBand

[2, [0.46612886, 0.4824535]]

I now have the category for which the split was made (is 2, isn’t 2) and the corresponding MSE.

Next, I’ll find the best split for one of the continuous columns, YearMade:

xs.YearMade.unique()

array([2004, 1996, 2001, 2007, 1993, 2008, 1000, 1998, 1999, 2003, 1991,
       2000, 2005, 1995, 2006, 2002, 1984, 1988, 1980, 1992, 1987, 1997,
       1971, 1978, 1989, 1985, 1979, 1976, 1994, 1982, 1990, 1974, 1968,
       1966, 1983, 1986, 1981, 1970, 1977, 1975, 1973, 1965, 1967, 2009,
       2010, 1969, 1972, 1964, 1957, 1958, 1963, 1919, 1920, 1950, 1948,
       1952, 1942, 1956, 1954, 1953, 1955, 1959, 1960, 1961, 1962, 1951,
       1937, 1949, 1947, 2012, 2011, 2014], dtype=int16)

mask = xs.YearMade <= 2004
lte_2004_xs = xs[mask]
lte_2004_y = y[mask]

len(lte_2004_xs), len(lte_2004_y)

(364685, 364685)

gt_2004_xs = xs[~mask]
gt_2004_y = y[~mask]

len(gt_2004_xs), len(gt_2004_y)

(40025, 40025)

So far the process is pretty similar to what I did for categorical variables. I’ll calculate the average sale price and MSE for each split next:

lte_2004_y.mean(), mse(lte_2004_y, lte_2004_y.mean())

(10.072348, 0.46514994)

gt_2004_y.mean(), mse(gt_2004_y, gt_2004_y.mean())

(10.395911, 0.5417633)

Great! I’ll wrap this process into a function:

def get_cont_best_split(xs, y, column):
  best_split = []
  lowest_mse = 1000000

  for el in xs[column].unique():
    mask = xs[column] <= el

    # ignore splits where either group has 0 or 1 row
    if sum(mask) == 0: continue
    if sum(mask) == 1: continue
    if sum(~mask) == 1: continue

    lte_el_y = y[mask]
    gt_el_y = y[~mask]

    lte_el_mse = mse(lte_el_y, lte_el_y.mean())
    gt_el_mse = mse(gt_el_y, gt_el_y.mean())

    if lte_el_mse < lowest_mse or gt_el_mse < lowest_mse:
      best_split = [el, [lte_el_mse, gt_el_mse]]

    lowest_mse = min(lte_el_mse, gt_el_mse, lowest_mse)
  print(column)
  return best_split

get_cont_best_split(xs, y, "YearMade")

YearMade

[2012, [0.4820588, 0.00022279842]]

Now that I have functions to calculate the best split (by MSE) for categorical and continuous variables, I can loop through each column, apply these functions and then determine the overall best split. I’ll use a small dataset for this procedure since the loops require too much time (more than an hour) when I use the full dataset:

xs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]

best_splits = []
for column in xs_sub.columns:
  if column in to.cat_names:
    best_splits.append([column, get_cat_best_split(xs_sub, y_sub, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(xs_sub, y_sub, column)])

best_splits

[['UsageBand', [0, [0.4289803, 0.5418442]]],
 ['fiModelDesc', [1082, [0.0, 0.45956165]]],
 ['fiBaseModel', [72, [0.0, 0.45933586]]],
 ['fiSecondaryDesc', [139, [0.0010413192, 0.45947686]]],
 ['fiModelSeries', [22, [0.0018927432, 0.45957345]]],
 ['fiModelDescriptor', [138, [0.0011120219, 0.45861065]]],
 ['ProductSize', [6, [0.17651369, 0.45974725]]],
 ['fiProductClassDesc', [59, [0.0028960933, 0.4587686]]],
 ['state', [15, [0.024394397, 0.45930827]]],
 ['ProductGroup', [3, [0.10852089, 0.39364752]]],
 ['ProductGroupDesc', [3, [0.10852089, 0.39364752]]],
 ['Drive_System', [2, [0.0733325, 0.49420977]]],
 ['Enclosure', [3, [0.30207962, 0.37955183]]],
 ['Forks', [2, [0.28580874, 0.4628031]]],
 ['Pad_Type', [4, [0.01935081, 0.46220613]]],
 ['Ride_Control', [1, [0.10788533, 0.53090215]]],
 ['Stick', [1, [0.08875317, 0.49070317]]],
 ['Transmission', [3, [0.037590597, 0.4595029]]],
 ['Turbocharged', [2, [0.043661047, 0.46277064]]],
 ['Blade_Extension', [1, [0.6090261, 0.43294448]]],
 ['Blade_Width', [2, [0.21063821, 0.45938253]]],
 ['Enclosure_Type', [1, [0.09410498, 0.45690852]]],
 ['Engine_Horsepower', [2, [0.08047946, 0.45704415]]],
 ['Hydraulics', [7, [0.00222363, 0.4558903]]],
 ['Pushblock', [2, [0.17637664, 0.4491221]]],
 ['Ripper', [1, [0.3071829, 0.4566584]]],
 ['Scarifier', [0, [0.43313345, 0.59920347]]],
 ['Tip_Control', [0, [0.43313345, 0.59920347]]],
 ['Tire_Size', [6, [0.003916449, 0.45774794]]],
 ['Coupler', [0, [0.35867354, 0.54620147]]],
 ['Coupler_System', [2, [0.10629387, 0.45753148]]],
 ['Grouser_Tracks', [2, [0.033831615, 0.45868516]]],
 ['Hydraulics_Flow', [0, [0.39364752, 0.10852089]]],
 ['Track_Type', [1, [0.2580127, 0.45796755]]],
 ['Undercarriage_Pad_Width', [4, [0.08790448, 0.459086]]],
 ['Stick_Length', [21, [0.03558779, 0.45946068]]],
 ['Thumb', [2, [0.23120114, 0.4605685]]],
 ['Pattern_Changer', [2, [0.4319223, 0.45872542]]],
 ['Grouser_Type', [1, [0.4383151, 0.45913604]]],
 ['Backhoe_Mounting', [0, [0.46665692, 0.3895407]]],
 ['Blade_Type', [10, [0.0015298143, 0.4583235]]],
 ['Travel_Controls', [4, [0.0022710091, 0.45842946]]],
 ['Differential_Type', [3, [0.008903191, 0.4579678]]],
 ['Steering_Controls', [2, [0.39570603, 0.46438074]]],
 ['saleIs_month_end', [1, [0.46104407, 0.36129642]]],
 ['saleIs_month_start', [1, [0.46031126, 0.39876112]]],
 ['saleIs_quarter_end', [1, [0.4600184, 0.33042803]]],
 ['saleIs_quarter_start', [1, [0.45814824, 0.48365197]]],
 ['saleIs_year_end', [1, [0.45866725, nan]]],
 ['saleIs_year_start', [1, [0.45866725, nan]]],
 ['auctioneerID_na', [1, [0.45774823, 0.46984664]]],
 ['MachineHoursCurrentMeter_na', [2, [0.41661143, 0.51914454]]],
 ['SalesID', [1144455, [0.008310286, 0.45859787]]],
 ['MachineID', [17776, [0.003757841, 0.4595788]]],
 ['ModelID', [36033, [0.45857352, 0.0010936307]]],
 ['datasource', [132, [0.43562403, 0.49841937]]],
 ['auctioneerID', [0.0, [0.34691957, 0.45884383]]],
 ['YearMade', [1974, [0.3009956, 0.46007067]]],
 ['MachineHoursCurrentMeter', [14856.0, [0.459368, 0.2509487]]],
 ['saleYear', [1989, [0.09763114, 0.45981508]]],
 ['saleMonth', [8, [0.44693333, 0.48317593]]],
 ['saleWeek', [50, [0.45991018, 0.35147443]]],
 ['saleDay', [30, [0.4608025, 0.2767068]]],
 ['saleDayofweek', [4, [0.4637312, 0.39367497]]],
 ['saleDayofyear', [10, [0.0038693824, 0.45864925]]],
 ['saleElapsed', [607910400.0, [0.0004528304, 0.45936278]]]]

lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value

('fiModelDesc', 0.0, 1082)

There were numerous splits with an MSE of 0.0 so I’ll just pick the first one which is for when fiModelDesc is 1082.

mask = xs_sub.fiModelDesc == best_split_value
left_xs = xs_sub[mask]
right_xs = xs_sub[~mask]

left_y = y_sub[mask]
right_y = y_sub[~mask]

len(left_xs), len(right_xs)

(2, 998)

left_y.mean(), right_y.mean()

(9.998797, 10.110093)

mse(left_y, left_y.mean()), mse(right_y, right_y.mean())

(0.0, 0.45956165)

left_xs

	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	...	saleDay	saleDayofweek	saleDayofyear	saleElapsed
292246	0	1082	326	93	...	25	2	84	1.237939e+09
292242	0	1082	326	93	...	25	2	84	1.237939e+09

2 rows × 66 columns

I’ll leave the left_xs split as a leaf, since it has only two values. I’ll continue to split right_xs:

best_splits = []
for column in right_xs.columns:
  if column in to.cat_names and column != 'fiModelDesc':
    best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])

lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value

('fiBaseModel', 0.0, 72)

The next best split is for fiBaseModel:

mask = right_xs.fiBaseModel == best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]

left_y = right_y[mask]
right_y = right_y[~mask]

len(left_xs), len(right_xs)

(2, 996)

left_xs

	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	...	saleDay	saleDayofweek	saleDayofyear	saleElapsed
372510	0	1030	313	50	...	17	2	321	1.289952e+09
366529	0	2214	698	31	...	12	1	102	1.302566e+09

2 rows × 66 columns

left_y.mean(), right_y.mean()

(10.463103, 10.109385)

mse(left_y, left_y.mean()), mse(right_y, right_y.mean())

(0.0, 0.4602337)

Again, one of the splits only has two values, so I’ll keep that as the second leaf, and continue to split the other group:

best_splits = []
for column in right_xs.columns:
  if column in to.cat_names and column not in ['fiModelDesc', 'fiBaseModel']:
    best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
  if column in to.cont_names:
    best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])

lowest_mse = 10000
for split in best_splits:
  if len(split[1]) > 0:
    if min(split[1][1]) < lowest_mse:
      lowest_mse = min(split[1][1])
      best_split_column = split[0]
      best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value

('saleElapsed', 0.0004528304, 607910400.0)

The final split (to reach four leaf nodes like the textbook) is on saleElapsed.

mask = right_xs.saleElapsed <= best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]

left_y = right_y[mask]
right_y = right_y[~mask]

mse(left_y, left_y.mean()), mse(right_y, right_y.mean())

(0.0004528304, 0.4609359)

left_y.mean(), right_y.mean()

(9.776848, 10.110054)

len(left_xs), len(right_xs)

(2, 994)

Plotting the decision tree (manually) to visualize the splits:

My model seems to be making incremental splits (with 2 rows taken off each split) based on very small sample sizes. One question I have is if the decision tree is always supposed to select the minimum or is it choosing randomly one of the many columns with relatively small MSE values? Another question: am I supposed to exclude a column from future splits once a split has been made on the column (which is what I’m doing currently)? I’ll be looking for these answers when watching the Lesson 6 video.

Decision Tree Algorithm: After Watching Lesson 6

The algorithm presented in the Lesson 6 video(from Jeremy’s notebook How random forests really work) evaluates a split based on a weighted standard deviation calculation.

For a given condition (for example Age <= 6) Jeremy calculates the following:

A “side score” which is the product of the standard deviation of the dependent variable and number of rows for rows that satisfy the condition.
A side score for the rows that don’t satisfy the condition.
The sum of both side scores divided by the number of total rows in the dependent variable column.

Then for each column, he calculates the side score for splits at each unique value of the column and gets the minimum score (and corresponding split value). The column (and corresponding value) with the split that has the smallest score is determined as the best split.

Conceptually: the column (and value) where the weighted standard deviation is the smallest is the column where a split on that value will create two groups where within each group there will be dependent variable values that are relatively close to each other.

I’ll start by coding the _side_score function Jeremy defined, which calculates the product of the standard deviation of the dependent variable and number of rows in the give split side:

def _side_score(side, y):
    tot = side.sum()
    if tot<=1: return 0
    return y[side].std()*tot

Next, the function score which calculates the sum of each side’s _side_score divided by the total number of rows, or in other words, the weighted average of each side’s standard deviation:

def score(col, y, split):
    lhs = col<=split
    return (_side_score(lhs,y) + _side_score(~lhs,y))/len(y)

I’ll reload the data so and take the same subset as above so I can compare the splits this algorithm makes to the ones made previously:

to = load_pickle("to.pkl")

xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

xs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]

mse(y_sub, y_sub.mean()), y_sub.mean()

(0.45866725, 10.109871)

score(xs['UsageBand'], y, 2)

0.6922498415264088

I’ll walk through each step of the score calculation manually to visualize it better.

First we create a boolean array of rows that satisfy or don’t satisfy the condition:

lhs = xs['UsageBand'] <= 2

lhs

0          True
1          True
2          True
3          True
4         False
          ...  
412693     True
412694     True
412695     True
412696     True
412697     True
Name: UsageBand, Length: 404710, dtype: bool

Next: for how many columns is the condition met?

tot = lhs.sum()
tot

What is the standard deviation of the dependent variable for rows where the condition is met?

y[lhs].std()

0.69022495

Finally, what is the side score for this condition?

side_score_lhs = y[lhs].std()*tot
side_score_lhs

255689.68972754478

We ask the same questions for the rows where the condition is NOT met:

tot = (~lhs).sum()
tot

y[~lhs].std()

0.71414065

side_score_rhs = y[~lhs].std()*tot
side_score_rhs

24470.743636608124

Finally, we add the two side scores together and divide by the number of total rows:

(side_score_lhs + side_score_rhs) / len(y)

0.6922498415264088

Thankfully, this is equal to the output from the function _side_score! I’ll continue by getting the code for calculating this score for splits on each of the unique values in a column:

def min_col(xs, nm, y):
    col = xs[nm]
    unq = col.dropna().unique()
    scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
    idx = scores.argmin()
    return unq[idx],scores[idx]

Let’s test it out on UsageBand:

min_col(xs, 'UsageBand', y)

(0, 0.6921555579861831)

Again, I’ll go line-by-line through the code of min_col to make sure I understand what’s going on:

col = xs['UsageBand']
col

0         2
1         2
2         1
3         1
4         3
         ..
412693    0
412694    0
412695    0
412696    0
412697    0
Name: UsageBand, Length: 404710, dtype: int8

What are the unique values in this column?

unq = col.dropna().unique()
unq

array([2, 1, 3, 0], dtype=int8)

If we split the data on each of the unique values, what is the score for each split?

scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
scores

array([0.69224984, 0.69378292, 0.69430405, 0.69215556])

What is the minimum score, and what value does it correspond to?

idx = scores.argmin()
idx

unq[idx], scores[idx]

(0, 0.6921555579861831)

Next, I’ll iterate through all of the columns in the data and calculate the minimum score (and corresponding value) for each one. I’ll use the smaller subset of the data since the list comprehension was taking more than an hour to compute on the full dataset.

col_scores = {o: min_col(xs_sub, o, y_sub) for o in xs_sub.columns}

col_scores

{'UsageBand': (0, 0.6710787683725357),
 'fiModelDesc': (150, 0.6684566705226899),
 'fiBaseModel': (30, 0.66775039935112),
 'fiSecondaryDesc': (15, 0.6461849688887596),
 'fiModelSeries': (60, 0.6612360656261445),
 'fiModelDescriptor': (9, 0.6385975532531738),
 'ProductSize': (0, 0.649804151058197),
 'fiProductClassDesc': (7, 0.6468342208862304),
 'state': (52, 0.6769107410311699),
 'ProductGroup': (3, 0.6413102557659149),
 'ProductGroupDesc': (3, 0.6413102557659149),
 'Drive_System': (3, 0.6595019570589066),
 'Enclosure': (3, 0.6443507042527199),
 'Forks': (0, 0.6468727025985718),
 'Pad_Type': (0, 0.6484307520389557),
 'Ride_Control': (0, 0.6693848296999931),
 'Stick': (0, 0.6484307520389557),
 'Transmission': (6, 0.6698495596647263),
 'Turbocharged': (0, 0.6484307520389557),
 'Blade_Extension': (0, 0.6660390158891678),
 'Blade_Width': (2, 0.665115752696991),
 'Enclosure_Type': (0, 0.6660390158891678),
 'Engine_Horsepower': (0, 0.6660390158891678),
 'Hydraulics': (0, 0.6497658799290658),
 'Pushblock': (0, 0.6660390158891678),
 'Ripper': (0, 0.6618279729485512),
 'Scarifier': (0, 0.6660390158891678),
 'Tip_Control': (0, 0.6660390158891678),
 'Tire_Size': (3, 0.6607923060655594),
 'Coupler': (2, 0.6719830477237702),
 'Coupler_System': (0, 0.5992864975929261),
 'Grouser_Tracks': (0, 0.5992864975929261),
 'Hydraulics_Flow': (0, 0.5992864975929261),
 'Track_Type': (1, 0.6633642191886902),
 'Undercarriage_Pad_Width': (5, 0.6722126321792603),
 'Stick_Length': (1, 0.6726506268978119),
 'Thumb': (0, 0.6727646491527557),
 'Pattern_Changer': (0, 0.6727646491527557),
 'Grouser_Type': (0, 0.6727646491527557),
 'Backhoe_Mounting': (0, 0.6719208754301071),
 'Blade_Type': (5, 0.6712375184893609),
 'Travel_Controls': (2, 0.6719208754301071),
 'Differential_Type': (0, 0.6718101676702499),
 'Steering_Controls': (0, 0.6718101676702499),
 'saleIs_month_end': (1, 0.6775472568273544),
 'saleIs_month_start': (1, 0.6773856681585312),
 'saleIs_quarter_end': (1, 0.6774418947696685),
 'saleIs_quarter_start': (2, 0.6775886416435242),
 'saleIs_year_end': (1, 0.6775886416435242),
 'saleIs_year_start': (1, 0.6775886416435242),
 'auctioneerID_na': (2, 0.6775886416435242),
 'MachineHoursCurrentMeter_na': (1, 0.6731610369682312),
 'SalesID': (1614635, 0.6669842964410782),
 'MachineID': (1010754, 0.6537257554531097),
 'ModelID': (4336, 0.6421609473228455),
 'datasource': (132, 0.674677339553833),
 'auctioneerID': (1.0, 0.674844207584858),
 'YearMade': (1974, 0.6595817529559136),
 'MachineHoursCurrentMeter': (3360.0, 0.6672791358232498),
 'saleYear': (1991, 0.6751361751556396),
 'saleMonth': (4, 0.6770030355453491),
 'saleWeek': (17, 0.6766363668441773),
 'saleDay': (10, 0.6758146023750305),
 'saleDayofweek': (4, 0.6748345303535461),
 'saleDayofyear': (10, 0.6763967939466238),
 'saleElapsed': (688348800.0, 0.6746526069641113)}

I’ll then find the split across all columns with the lowest score:

min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]

min_col_score, min_col_names

((0, 0.5992864975929261),
 ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])

The lowest score was around the split of <=0 for three columns, which interestingly enough, were found to be redundant features in the textbook chapter. I’ll pick Coupler_System since that’s the split used in the textbook.

xs_sub.Coupler_System.unique()

array([0, 1, 2], dtype=int8)

To answer one of my earlier questions, I think we only remove a column from future splits if it was a binary column (such as the Sex column was in the Titanic dataset) since there are only two possible groups and once the data has been split into those two groups, there’s nothing left to split in that column.

mask = xs_sub.Coupler_System <= 0
lhs = xs_sub[mask]
rhs = xs_sub[~mask]

lhs_y = y_sub[mask]
rhs_y = y_sub[~mask]

len(lhs), len(rhs)

(904, 96)

lhs_y.mean(), rhs_y.mean()

(10.208925, 9.177121)

mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())

(0.39364752, 0.10852089)

The lefthand-side has many more rows so I’ll continue splitting that dataset and keep rhs as a leaf node.

Coupler_System only has one value, 0, in this group so I’ll remove it from future splits.

lhs.Coupler_System.unique()

array([0], dtype=int8)

lhs = lhs.drop("Coupler_System", axis=1)

'Coupler_System' in lhs.columns

False

col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names

((0, 0.5869915567140663), ['Pad_Type', 'Stick', 'Turbocharged'])

Again we have three columns with the same split score. I’ll choose Pad_Type as the column to split on.

lhs.Pad_Type.unique()

array([2, 0, 3, 4], dtype=int8)

mask = lhs.Pad_Type <= 0
rhs = lhs[~mask]
lhs = lhs[mask]

rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]

len(lhs), len(rhs)

(700, 204)

lhs_y.mean(), rhs_y.mean()

(10.30318, 9.885499)

mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())

(0.43736917, 0.108533934)

Again, the lefthand-side has more rows so I’ll continue to split it. I’ll keep rhs as the second leaf node. I’ll also remove Pad_Type from lhs since it has a single value of 0.

lhs = lhs.drop("Pad_Type", axis=1)

'Pad_Type' in lhs.columns

False

col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names

((0, 0.6552193513938359),
 ['Drive_System',
  'Blade_Extension',
  'Enclosure_Type',
  'Engine_Horsepower',
  'Scarifier',
  'Tip_Control'])

I now have six columns that have the same split score. I’ll choose the first one, Drive_System to make my final split and get my last two leaf nodes.

mask = lhs.Drive_System <= 0
rhs = lhs[~mask]
lhs = lhs[mask]

rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]

len(lhs), len(rhs)

(638, 62)

lhs_y.mean(), rhs_y.mean()

(10.275307, 10.590003)

mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())

(0.41287065, 0.59920347)

Here is a manually made visualization of this four-leaf-node decision tree:

Decision Tree Algorithm: Using sklearn

The last decision tree that I’ll create for comparison with the other two approaches.

m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs_sub, y_sub);

from sklearn.tree import export_graphviz

import graphviz

def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
                      special_characters=True, rotate=False, precision=precision, **kwargs)
    return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))

draw_tree(m, xs_sub, size=10)

Final Thoughts

In this exercise, I tried three different algorithms to create a four-leaf-node decision tree on a 1000-sample subset of the Blue Book for Bulldozer’s Kaggle competition dataset:

An algorithm I created where the model made splits based on the lowest MSE.
Jeremy’s “side score” algorithm which found the split with the lowest weighted standard deviation.
sklearn.DecisionTreeRegressor which also uses MSE to decide splits.

Here are some key observations:

My algorithm made splits with very small samples. To improve this, I would either explicitly ignore splits with low samples or penalize such splits by weighting by number of rows like Jeremy did in _side_score.
The sklearn.DecisionTreeRegressor had the same initial split as Jeremy’s (on Coupler_System) but then chose two completely different columns to split on for the remainder of the leaf nodes. I wonder if this difference in column selection is due to the difference in how a split is scored between those two algorithms.
A future improvement I could make to this experiment is compare the MSE on a validation set for each decision tree.
Another future improvement would be to make my algorithm and Jeremy’s algorithm run faster so that I can test it on the actual dataset in a reasonable amount of time.

This was another challenging and enjoyable “Further Research” exercise provided by the fastai textbook. I hope you enjoyed this blog post!