!pip install dtreeviz
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8Implementing a Decision Tree Algorithm from Scratch
Background
In this blog post I’ll work through the second exercise of the “Further Research” section in Chapter 9 of the fastai textbook:
Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.
The decision tree algorithm, as described in the textbook for the Blue Book for Bulldozers Kaggle competition:
- Loop through each column of the dataset in turn.
- For each column, loop through each possible level of that column in turn.
- Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
- Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. Treat this as a very simple “model” in which our predictions are simply the average sale price of the item’s group.
- After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
- We now have two groups for our data, based on this selected split. Treat each group as a separate dataset, and find the best split for each by going back to step 1 for each group.
- Continue this process recursively, until you have reached some stopping criterion for each group–for instance, stop splitting a group further when it has only 20 items in it.
I’ll implement the algorithm on my own, then compare it with the algorithm Jeremy implemented in the Lesson 6 video, and the sklearn.DecisionTreeRegressor.
Load the Data
I’ll follow the same steps as the textbook to load and prepare the training and validation datasets from the Blue Book for Bulldozers Kaggle Competition:
from pathlib import Path
cred_path = Path("~/.kaggle/kaggle.json").expanduser()
if not cred_path.exists():
cred_path.parent.mkdir(exist_ok=True)
cred_path.write_text(creds)
cred_path.chmod(0o600)import zipfile,kaggle
path = Path('bluebook-for-bulldozers')
if not path.exists():
kaggle.api.competition_download_cli(str(path))
zipfile.ZipFile(f'{path}.zip').extractall(path)Downloading bluebook-for-bulldozers.zip to /content
100%|██████████| 48.4M/48.4M [00:00<00:00, 53.5MB/s]
path.ls(file_type='text')(#7) [Path('bluebook-for-bulldozers/Test.csv'),Path('bluebook-for-bulldozers/TrainAndValid.csv'),Path('bluebook-for-bulldozers/random_forest_benchmark_test.csv'),Path('bluebook-for-bulldozers/Machine_Appendix.csv'),Path('bluebook-for-bulldozers/median_benchmark.csv'),Path('bluebook-for-bulldozers/ValidSolution.csv'),Path('bluebook-for-bulldozers/Valid.csv')]
df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)len(df.columns)53
sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])df = add_datepart(df, 'saledate')
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')' '.join(o for o in df.columns if o.startswith('sale'))'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
procs = [Categorify, FillMissing]cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))cont,cat = cont_cat_split(df, 1, dep_var=dep_var)to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)len(to.train), len(to.valid)(404710, 7988)
to.show(3)| UsageBand | fiModelDesc | fiBaseModel | fiSecondaryDesc | fiModelSeries | fiModelDescriptor | ProductSize | fiProductClassDesc | state | ProductGroup | ProductGroupDesc | Drive_System | Enclosure | Forks | Pad_Type | Ride_Control | Stick | Transmission | Turbocharged | Blade_Extension | Blade_Width | Enclosure_Type | Engine_Horsepower | Hydraulics | Pushblock | Ripper | Scarifier | Tip_Control | Tire_Size | Coupler | Coupler_System | Grouser_Tracks | Hydraulics_Flow | Track_Type | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | saleIs_month_end | saleIs_month_start | saleIs_quarter_end | saleIs_quarter_start | saleIs_year_end | saleIs_year_start | auctioneerID_na | MachineHoursCurrentMeter_na | SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | saleYear | saleMonth | saleWeek | saleDay | saleDayofweek | saleDayofyear | saleElapsed | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Low | 521D | 521 | D | #na# | #na# | #na# | Wheel Loader - 110.0 to 120.0 Horsepower | Alabama | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139246 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | 2006 | 11 | 46 | 16 | 3 | 320 | 1.163635e+09 | 11.097410 |
| 1 | Low | 950FII | 950 | F | II | #na# | Medium | Wheel Loader - 150.0 to 175.0 Horsepower | North Carolina | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | 23.5 | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139248 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | 2004 | 3 | 13 | 26 | 4 | 86 | 1.080259e+09 | 10.950807 |
| 2 | High | 226 | 226 | #na# | #na# | #na# | #na# | Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity | New York | SSL | Skid Steer Loaders | #na# | OROPS | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Auxiliary | #na# | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | None or Unspecified | Standard | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | False | False | False | False | False | False | False | False | 1139249 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | 2004 | 2 | 9 | 26 | 3 | 57 | 1.077754e+09 | 9.210340 |
to = load_pickle('to.pkl')xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.yDecision Tree Algorithm: Before Watching Lesson 6
I haven’t watched the Lesson 6 video yet, and am assuming that Jeremy walks through how to build a decision tree from scratch in that video, so I’m trying it out first on my own. To be honest, I’m a bit embarrassed to publish this part of my learning process, since my approach is not very elegant, but I’d like to show what my current thinking is and then compare it with what I understand after learning the formal solution to this algorithm.
I’ll start by creating a mean squared error function to calculate at each split:
def mse(pred, y): return ((pred-y)**2).mean()# root mse and value
mse(y, y.mean()), y.mean()(0.48205692, 10.104347)
I’ll walk through the algorithm step-by-step manually for a couple of columns before I create any functions or loops. The first column in the training dataset’s index is the categorical variable UsageBand which has a value of 0, 1, 2, or 3.
xs.columnsIndex(['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
'Travel_Controls', 'Differential_Type', 'Steering_Controls',
'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
'auctioneerID_na', 'MachineHoursCurrentMeter_na', 'SalesID',
'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade',
'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleWeek',
'saleDay', 'saleDayofweek', 'saleDayofyear', 'saleElapsed'],
dtype='object')
xs.UsageBand.unique()array([2, 1, 3, 0], dtype=int8)
The first split I’ll choose is between rows where the UsageBand value is 0 or not:
mask = xs.UsageBand == 0
is_zero_xs = xs[mask]
is_zero_y = y[mask]len(is_zero_xs), len(is_zero_y)(334164, 334164)
isnt_zero_xs = xs[~mask]
isnt_zero_y = y[~mask]len(isnt_zero_xs), len(isnt_zero_y)(70546, 70546)
I’ll calculate the average sale price and MSE for each group:
is_zero_y.mean(), mse(is_zero_y, is_zero_y.mean())(10.080211, 0.46968707)
isnt_zero_y.mean(), mse(isnt_zero_y, isnt_zero_y.mean())(10.218686, 0.5248172)
I’ll put this code into a routine so I can apply it to all splits of UsageBand:
def get_cat_splits(xs, y, column):
splits = []
for el in xs[column].unique():
mask = xs[column] == el
is_el_y = y[mask]
isnt_el_y = y[~mask]
is_el_mse = mse(is_el_y, is_el_y.mean())
isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())
split = {
str(el): is_el_mse,
"!" + str(el): isnt_el_mse
}
splits.append(split)
return splitssplits = get_cat_splits(xs, y, "UsageBand")
splits[{'2': 0.46612886, '!2': 0.4824535},
{'1': 0.5223263, '!1': 0.476563},
{'3': 0.50998193, '!3': 0.47640917},
{'0': 0.46968707, '!0': 0.5248172}]
Great! My function calculates the MSE for each split in a categorical column. However, I don’t need to store all of the splits for the column, just the best one (the one with the lowest MSE). I’ll modify my function so that it returns only the best split:
def get_cat_best_split(xs, y, column):
best_split = []
lowest_mse = 1000000
for el in xs[column].unique():
mask = xs[column] == el
# ignore splits where either group has 0 or 1 row
if sum(mask) == 0: continue
if sum(mask) == 1: continue
if sum(~mask) == 1: continue
is_el_y = y[mask]
isnt_el_y = y[~mask]
is_el_mse = mse(is_el_y, is_el_y.mean())
isnt_el_mse = mse(isnt_el_y, isnt_el_y.mean())
if is_el_mse < lowest_mse or isnt_el_mse < lowest_mse:
best_split = [el, [is_el_mse, isnt_el_mse]]
lowest_mse = min(is_el_mse, isnt_el_mse, lowest_mse)
print(column)
return best_splitget_cat_best_split(xs, y, "UsageBand")UsageBand
[2, [0.46612886, 0.4824535]]
I now have the category for which the split was made (is 2, isn’t 2) and the corresponding MSE.
Next, I’ll find the best split for one of the continuous columns, YearMade:
xs.YearMade.unique()array([2004, 1996, 2001, 2007, 1993, 2008, 1000, 1998, 1999, 2003, 1991,
2000, 2005, 1995, 2006, 2002, 1984, 1988, 1980, 1992, 1987, 1997,
1971, 1978, 1989, 1985, 1979, 1976, 1994, 1982, 1990, 1974, 1968,
1966, 1983, 1986, 1981, 1970, 1977, 1975, 1973, 1965, 1967, 2009,
2010, 1969, 1972, 1964, 1957, 1958, 1963, 1919, 1920, 1950, 1948,
1952, 1942, 1956, 1954, 1953, 1955, 1959, 1960, 1961, 1962, 1951,
1937, 1949, 1947, 2012, 2011, 2014], dtype=int16)
mask = xs.YearMade <= 2004
lte_2004_xs = xs[mask]
lte_2004_y = y[mask]len(lte_2004_xs), len(lte_2004_y)(364685, 364685)
gt_2004_xs = xs[~mask]
gt_2004_y = y[~mask]len(gt_2004_xs), len(gt_2004_y)(40025, 40025)
So far the process is pretty similar to what I did for categorical variables. I’ll calculate the average sale price and MSE for each split next:
lte_2004_y.mean(), mse(lte_2004_y, lte_2004_y.mean())(10.072348, 0.46514994)
gt_2004_y.mean(), mse(gt_2004_y, gt_2004_y.mean())(10.395911, 0.5417633)
Great! I’ll wrap this process into a function:
def get_cont_best_split(xs, y, column):
best_split = []
lowest_mse = 1000000
for el in xs[column].unique():
mask = xs[column] <= el
# ignore splits where either group has 0 or 1 row
if sum(mask) == 0: continue
if sum(mask) == 1: continue
if sum(~mask) == 1: continue
lte_el_y = y[mask]
gt_el_y = y[~mask]
lte_el_mse = mse(lte_el_y, lte_el_y.mean())
gt_el_mse = mse(gt_el_y, gt_el_y.mean())
if lte_el_mse < lowest_mse or gt_el_mse < lowest_mse:
best_split = [el, [lte_el_mse, gt_el_mse]]
lowest_mse = min(lte_el_mse, gt_el_mse, lowest_mse)
print(column)
return best_split
get_cont_best_split(xs, y, "YearMade")YearMade
[2012, [0.4820588, 0.00022279842]]
Now that I have functions to calculate the best split (by MSE) for categorical and continuous variables, I can loop through each column, apply these functions and then determine the overall best split. I’ll use a small dataset for this procedure since the loops require too much time (more than an hour) when I use the full dataset:
xs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]best_splits = []
for column in xs_sub.columns:
if column in to.cat_names:
best_splits.append([column, get_cat_best_split(xs_sub, y_sub, column)])
if column in to.cont_names:
best_splits.append([column, get_cont_best_split(xs_sub, y_sub, column)])best_splits[['UsageBand', [0, [0.4289803, 0.5418442]]],
['fiModelDesc', [1082, [0.0, 0.45956165]]],
['fiBaseModel', [72, [0.0, 0.45933586]]],
['fiSecondaryDesc', [139, [0.0010413192, 0.45947686]]],
['fiModelSeries', [22, [0.0018927432, 0.45957345]]],
['fiModelDescriptor', [138, [0.0011120219, 0.45861065]]],
['ProductSize', [6, [0.17651369, 0.45974725]]],
['fiProductClassDesc', [59, [0.0028960933, 0.4587686]]],
['state', [15, [0.024394397, 0.45930827]]],
['ProductGroup', [3, [0.10852089, 0.39364752]]],
['ProductGroupDesc', [3, [0.10852089, 0.39364752]]],
['Drive_System', [2, [0.0733325, 0.49420977]]],
['Enclosure', [3, [0.30207962, 0.37955183]]],
['Forks', [2, [0.28580874, 0.4628031]]],
['Pad_Type', [4, [0.01935081, 0.46220613]]],
['Ride_Control', [1, [0.10788533, 0.53090215]]],
['Stick', [1, [0.08875317, 0.49070317]]],
['Transmission', [3, [0.037590597, 0.4595029]]],
['Turbocharged', [2, [0.043661047, 0.46277064]]],
['Blade_Extension', [1, [0.6090261, 0.43294448]]],
['Blade_Width', [2, [0.21063821, 0.45938253]]],
['Enclosure_Type', [1, [0.09410498, 0.45690852]]],
['Engine_Horsepower', [2, [0.08047946, 0.45704415]]],
['Hydraulics', [7, [0.00222363, 0.4558903]]],
['Pushblock', [2, [0.17637664, 0.4491221]]],
['Ripper', [1, [0.3071829, 0.4566584]]],
['Scarifier', [0, [0.43313345, 0.59920347]]],
['Tip_Control', [0, [0.43313345, 0.59920347]]],
['Tire_Size', [6, [0.003916449, 0.45774794]]],
['Coupler', [0, [0.35867354, 0.54620147]]],
['Coupler_System', [2, [0.10629387, 0.45753148]]],
['Grouser_Tracks', [2, [0.033831615, 0.45868516]]],
['Hydraulics_Flow', [0, [0.39364752, 0.10852089]]],
['Track_Type', [1, [0.2580127, 0.45796755]]],
['Undercarriage_Pad_Width', [4, [0.08790448, 0.459086]]],
['Stick_Length', [21, [0.03558779, 0.45946068]]],
['Thumb', [2, [0.23120114, 0.4605685]]],
['Pattern_Changer', [2, [0.4319223, 0.45872542]]],
['Grouser_Type', [1, [0.4383151, 0.45913604]]],
['Backhoe_Mounting', [0, [0.46665692, 0.3895407]]],
['Blade_Type', [10, [0.0015298143, 0.4583235]]],
['Travel_Controls', [4, [0.0022710091, 0.45842946]]],
['Differential_Type', [3, [0.008903191, 0.4579678]]],
['Steering_Controls', [2, [0.39570603, 0.46438074]]],
['saleIs_month_end', [1, [0.46104407, 0.36129642]]],
['saleIs_month_start', [1, [0.46031126, 0.39876112]]],
['saleIs_quarter_end', [1, [0.4600184, 0.33042803]]],
['saleIs_quarter_start', [1, [0.45814824, 0.48365197]]],
['saleIs_year_end', [1, [0.45866725, nan]]],
['saleIs_year_start', [1, [0.45866725, nan]]],
['auctioneerID_na', [1, [0.45774823, 0.46984664]]],
['MachineHoursCurrentMeter_na', [2, [0.41661143, 0.51914454]]],
['SalesID', [1144455, [0.008310286, 0.45859787]]],
['MachineID', [17776, [0.003757841, 0.4595788]]],
['ModelID', [36033, [0.45857352, 0.0010936307]]],
['datasource', [132, [0.43562403, 0.49841937]]],
['auctioneerID', [0.0, [0.34691957, 0.45884383]]],
['YearMade', [1974, [0.3009956, 0.46007067]]],
['MachineHoursCurrentMeter', [14856.0, [0.459368, 0.2509487]]],
['saleYear', [1989, [0.09763114, 0.45981508]]],
['saleMonth', [8, [0.44693333, 0.48317593]]],
['saleWeek', [50, [0.45991018, 0.35147443]]],
['saleDay', [30, [0.4608025, 0.2767068]]],
['saleDayofweek', [4, [0.4637312, 0.39367497]]],
['saleDayofyear', [10, [0.0038693824, 0.45864925]]],
['saleElapsed', [607910400.0, [0.0004528304, 0.45936278]]]]
lowest_mse = 10000
for split in best_splits:
if len(split[1]) > 0:
if min(split[1][1]) < lowest_mse:
lowest_mse = min(split[1][1])
best_split_column = split[0]
best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value('fiModelDesc', 0.0, 1082)
There were numerous splits with an MSE of 0.0 so I’ll just pick the first one which is for when fiModelDesc is 1082.
mask = xs_sub.fiModelDesc == best_split_value
left_xs = xs_sub[mask]
right_xs = xs_sub[~mask]
left_y = y_sub[mask]
right_y = y_sub[~mask]
len(left_xs), len(right_xs)(2, 998)
left_y.mean(), right_y.mean()(9.998797, 10.110093)
mse(left_y, left_y.mean()), mse(right_y, right_y.mean())(0.0, 0.45956165)
left_xs| UsageBand | fiModelDesc | fiBaseModel | fiSecondaryDesc | ... | saleDay | saleDayofweek | saleDayofyear | saleElapsed | |
|---|---|---|---|---|---|---|---|---|---|
| 292246 | 0 | 1082 | 326 | 93 | ... | 25 | 2 | 84 | 1.237939e+09 |
| 292242 | 0 | 1082 | 326 | 93 | ... | 25 | 2 | 84 | 1.237939e+09 |
2 rows × 66 columns
I’ll leave the left_xs split as a leaf, since it has only two values. I’ll continue to split right_xs:
best_splits = []
for column in right_xs.columns:
if column in to.cat_names and column != 'fiModelDesc':
best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
if column in to.cont_names:
best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])lowest_mse = 10000
for split in best_splits:
if len(split[1]) > 0:
if min(split[1][1]) < lowest_mse:
lowest_mse = min(split[1][1])
best_split_column = split[0]
best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value('fiBaseModel', 0.0, 72)
The next best split is for fiBaseModel:
mask = right_xs.fiBaseModel == best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]
left_y = right_y[mask]
right_y = right_y[~mask]
len(left_xs), len(right_xs)(2, 996)
left_xs| UsageBand | fiModelDesc | fiBaseModel | fiSecondaryDesc | ... | saleDay | saleDayofweek | saleDayofyear | saleElapsed | |
|---|---|---|---|---|---|---|---|---|---|
| 372510 | 0 | 1030 | 313 | 50 | ... | 17 | 2 | 321 | 1.289952e+09 |
| 366529 | 0 | 2214 | 698 | 31 | ... | 12 | 1 | 102 | 1.302566e+09 |
2 rows × 66 columns
left_y.mean(), right_y.mean()(10.463103, 10.109385)
mse(left_y, left_y.mean()), mse(right_y, right_y.mean())(0.0, 0.4602337)
Again, one of the splits only has two values, so I’ll keep that as the second leaf, and continue to split the other group:
best_splits = []
for column in right_xs.columns:
if column in to.cat_names and column not in ['fiModelDesc', 'fiBaseModel']:
best_splits.append([column, get_cat_best_split(right_xs, right_y, column)])
if column in to.cont_names:
best_splits.append([column, get_cont_best_split(right_xs, right_y, column)])lowest_mse = 10000
for split in best_splits:
if len(split[1]) > 0:
if min(split[1][1]) < lowest_mse:
lowest_mse = min(split[1][1])
best_split_column = split[0]
best_split_value = split[1][0]
best_split_column, lowest_mse, best_split_value('saleElapsed', 0.0004528304, 607910400.0)
The final split (to reach four leaf nodes like the textbook) is on saleElapsed.
mask = right_xs.saleElapsed <= best_split_value
left_xs = right_xs[mask]
right_xs = right_xs[~mask]
left_y = right_y[mask]
right_y = right_y[~mask]
mse(left_y, left_y.mean()), mse(right_y, right_y.mean())(0.0004528304, 0.4609359)
left_y.mean(), right_y.mean()(9.776848, 10.110054)
len(left_xs), len(right_xs)(2, 994)
Plotting the decision tree (manually) to visualize the splits:
My model seems to be making incremental splits (with 2 rows taken off each split) based on very small sample sizes. One question I have is if the decision tree is always supposed to select the minimum or is it choosing randomly one of the many columns with relatively small MSE values? Another question: am I supposed to exclude a column from future splits once a split has been made on the column (which is what I’m doing currently)? I’ll be looking for these answers when watching the Lesson 6 video.
Decision Tree Algorithm: After Watching Lesson 6
The algorithm presented in the Lesson 6 video(from Jeremy’s notebook How random forests really work) evaluates a split based on a weighted standard deviation calculation.
For a given condition (for example Age <= 6) Jeremy calculates the following:
- A “side score” which is the product of the standard deviation of the dependent variable and number of rows for rows that satisfy the condition.
- A side score for the rows that don’t satisfy the condition.
- The sum of both side scores divided by the number of total rows in the dependent variable column.
Then for each column, he calculates the side score for splits at each unique value of the column and gets the minimum score (and corresponding split value). The column (and corresponding value) with the split that has the smallest score is determined as the best split.
Conceptually: the column (and value) where the weighted standard deviation is the smallest is the column where a split on that value will create two groups where within each group there will be dependent variable values that are relatively close to each other.
I’ll start by coding the _side_score function Jeremy defined, which calculates the product of the standard deviation of the dependent variable and number of rows in the give split side:
def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
return y[side].std()*totNext, the function score which calculates the sum of each side’s _side_score divided by the total number of rows, or in other words, the weighted average of each side’s standard deviation:
def score(col, y, split):
lhs = col<=split
return (_side_score(lhs,y) + _side_score(~lhs,y))/len(y)I’ll reload the data so and take the same subset as above so I can compare the splits this algorithm makes to the ones made previously:
to = load_pickle("to.pkl")xs,y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.yxs_sub = xs.sample(1000, random_state=42)
y_sub = y[xs_sub.index]mse(y_sub, y_sub.mean()), y_sub.mean()(0.45866725, 10.109871)
score(xs['UsageBand'], y, 2)0.6922498415264088
I’ll walk through each step of the score calculation manually to visualize it better.
First we create a boolean array of rows that satisfy or don’t satisfy the condition:
lhs = xs['UsageBand'] <= 2lhs0 True
1 True
2 True
3 True
4 False
...
412693 True
412694 True
412695 True
412696 True
412697 True
Name: UsageBand, Length: 404710, dtype: bool
Next: for how many columns is the condition met?
tot = lhs.sum()
tot370444
What is the standard deviation of the dependent variable for rows where the condition is met?
y[lhs].std()0.69022495
Finally, what is the side score for this condition?
side_score_lhs = y[lhs].std()*tot
side_score_lhs255689.68972754478
We ask the same questions for the rows where the condition is NOT met:
tot = (~lhs).sum()
tot34266
y[~lhs].std()0.71414065
side_score_rhs = y[~lhs].std()*tot
side_score_rhs24470.743636608124
Finally, we add the two side scores together and divide by the number of total rows:
(side_score_lhs + side_score_rhs) / len(y)0.6922498415264088
Thankfully, this is equal to the output from the function _side_score! I’ll continue by getting the code for calculating this score for splits on each of the unique values in a column:
def min_col(xs, nm, y):
col = xs[nm]
unq = col.dropna().unique()
scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
idx = scores.argmin()
return unq[idx],scores[idx]Let’s test it out on UsageBand:
min_col(xs, 'UsageBand', y)(0, 0.6921555579861831)
Again, I’ll go line-by-line through the code of min_col to make sure I understand what’s going on:
col = xs['UsageBand']
col0 2
1 2
2 1
3 1
4 3
..
412693 0
412694 0
412695 0
412696 0
412697 0
Name: UsageBand, Length: 404710, dtype: int8
What are the unique values in this column?
unq = col.dropna().unique()
unqarray([2, 1, 3, 0], dtype=int8)
If we split the data on each of the unique values, what is the score for each split?
scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
scoresarray([0.69224984, 0.69378292, 0.69430405, 0.69215556])
What is the minimum score, and what value does it correspond to?
idx = scores.argmin()
idx3
unq[idx], scores[idx](0, 0.6921555579861831)
Next, I’ll iterate through all of the columns in the data and calculate the minimum score (and corresponding value) for each one. I’ll use the smaller subset of the data since the list comprehension was taking more than an hour to compute on the full dataset.
col_scores = {o: min_col(xs_sub, o, y_sub) for o in xs_sub.columns}col_scores{'UsageBand': (0, 0.6710787683725357),
'fiModelDesc': (150, 0.6684566705226899),
'fiBaseModel': (30, 0.66775039935112),
'fiSecondaryDesc': (15, 0.6461849688887596),
'fiModelSeries': (60, 0.6612360656261445),
'fiModelDescriptor': (9, 0.6385975532531738),
'ProductSize': (0, 0.649804151058197),
'fiProductClassDesc': (7, 0.6468342208862304),
'state': (52, 0.6769107410311699),
'ProductGroup': (3, 0.6413102557659149),
'ProductGroupDesc': (3, 0.6413102557659149),
'Drive_System': (3, 0.6595019570589066),
'Enclosure': (3, 0.6443507042527199),
'Forks': (0, 0.6468727025985718),
'Pad_Type': (0, 0.6484307520389557),
'Ride_Control': (0, 0.6693848296999931),
'Stick': (0, 0.6484307520389557),
'Transmission': (6, 0.6698495596647263),
'Turbocharged': (0, 0.6484307520389557),
'Blade_Extension': (0, 0.6660390158891678),
'Blade_Width': (2, 0.665115752696991),
'Enclosure_Type': (0, 0.6660390158891678),
'Engine_Horsepower': (0, 0.6660390158891678),
'Hydraulics': (0, 0.6497658799290658),
'Pushblock': (0, 0.6660390158891678),
'Ripper': (0, 0.6618279729485512),
'Scarifier': (0, 0.6660390158891678),
'Tip_Control': (0, 0.6660390158891678),
'Tire_Size': (3, 0.6607923060655594),
'Coupler': (2, 0.6719830477237702),
'Coupler_System': (0, 0.5992864975929261),
'Grouser_Tracks': (0, 0.5992864975929261),
'Hydraulics_Flow': (0, 0.5992864975929261),
'Track_Type': (1, 0.6633642191886902),
'Undercarriage_Pad_Width': (5, 0.6722126321792603),
'Stick_Length': (1, 0.6726506268978119),
'Thumb': (0, 0.6727646491527557),
'Pattern_Changer': (0, 0.6727646491527557),
'Grouser_Type': (0, 0.6727646491527557),
'Backhoe_Mounting': (0, 0.6719208754301071),
'Blade_Type': (5, 0.6712375184893609),
'Travel_Controls': (2, 0.6719208754301071),
'Differential_Type': (0, 0.6718101676702499),
'Steering_Controls': (0, 0.6718101676702499),
'saleIs_month_end': (1, 0.6775472568273544),
'saleIs_month_start': (1, 0.6773856681585312),
'saleIs_quarter_end': (1, 0.6774418947696685),
'saleIs_quarter_start': (2, 0.6775886416435242),
'saleIs_year_end': (1, 0.6775886416435242),
'saleIs_year_start': (1, 0.6775886416435242),
'auctioneerID_na': (2, 0.6775886416435242),
'MachineHoursCurrentMeter_na': (1, 0.6731610369682312),
'SalesID': (1614635, 0.6669842964410782),
'MachineID': (1010754, 0.6537257554531097),
'ModelID': (4336, 0.6421609473228455),
'datasource': (132, 0.674677339553833),
'auctioneerID': (1.0, 0.674844207584858),
'YearMade': (1974, 0.6595817529559136),
'MachineHoursCurrentMeter': (3360.0, 0.6672791358232498),
'saleYear': (1991, 0.6751361751556396),
'saleMonth': (4, 0.6770030355453491),
'saleWeek': (17, 0.6766363668441773),
'saleDay': (10, 0.6758146023750305),
'saleDayofweek': (4, 0.6748345303535461),
'saleDayofyear': (10, 0.6763967939466238),
'saleElapsed': (688348800.0, 0.6746526069641113)}
I’ll then find the split across all columns with the lowest score:
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]min_col_score, min_col_names((0, 0.5992864975929261),
['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
The lowest score was around the split of <=0 for three columns, which interestingly enough, were found to be redundant features in the textbook chapter. I’ll pick Coupler_System since that’s the split used in the textbook.
xs_sub.Coupler_System.unique()array([0, 1, 2], dtype=int8)
To answer one of my earlier questions, I think we only remove a column from future splits if it was a binary column (such as the Sex column was in the Titanic dataset) since there are only two possible groups and once the data has been split into those two groups, there’s nothing left to split in that column.
mask = xs_sub.Coupler_System <= 0
lhs = xs_sub[mask]
rhs = xs_sub[~mask]
lhs_y = y_sub[mask]
rhs_y = y_sub[~mask]
len(lhs), len(rhs)(904, 96)
lhs_y.mean(), rhs_y.mean()(10.208925, 9.177121)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())(0.39364752, 0.10852089)
The lefthand-side has many more rows so I’ll continue splitting that dataset and keep rhs as a leaf node.
Coupler_System only has one value, 0, in this group so I’ll remove it from future splits.
lhs.Coupler_System.unique()array([0], dtype=int8)
lhs = lhs.drop("Coupler_System", axis=1)'Coupler_System' in lhs.columnsFalse
col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names((0, 0.5869915567140663), ['Pad_Type', 'Stick', 'Turbocharged'])
Again we have three columns with the same split score. I’ll choose Pad_Type as the column to split on.
lhs.Pad_Type.unique()array([2, 0, 3, 4], dtype=int8)
mask = lhs.Pad_Type <= 0
rhs = lhs[~mask]
lhs = lhs[mask]
rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]
len(lhs), len(rhs)(700, 204)
lhs_y.mean(), rhs_y.mean()(10.30318, 9.885499)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())(0.43736917, 0.108533934)
Again, the lefthand-side has more rows so I’ll continue to split it. I’ll keep rhs as the second leaf node. I’ll also remove Pad_Type from lhs since it has a single value of 0.
lhs = lhs.drop("Pad_Type", axis=1)'Pad_Type' in lhs.columnsFalse
col_scores = {o: min_col(lhs, o, lhs_y) for o in lhs.columns}
min_col_score = min(col_scores.values())
min_col_names = [key for key in col_scores if col_scores[key] == min_col_score]
min_col_score, min_col_names((0, 0.6552193513938359),
['Drive_System',
'Blade_Extension',
'Enclosure_Type',
'Engine_Horsepower',
'Scarifier',
'Tip_Control'])
I now have six columns that have the same split score. I’ll choose the first one, Drive_System to make my final split and get my last two leaf nodes.
mask = lhs.Drive_System <= 0
rhs = lhs[~mask]
lhs = lhs[mask]
rhs_y = lhs_y[~mask]
lhs_y = lhs_y[mask]
len(lhs), len(rhs)(638, 62)
lhs_y.mean(), rhs_y.mean()(10.275307, 10.590003)
mse(lhs_y, lhs_y.mean()), mse(rhs_y, rhs_y.mean())(0.41287065, 0.59920347)
Here is a manually made visualization of this four-leaf-node decision tree:
Decision Tree Algorithm: Using sklearn
The last decision tree that I’ll create for comparison with the other two approaches.
m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs_sub, y_sub);from sklearn.tree import export_graphviz
import graphviz
def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
special_characters=True, rotate=False, precision=precision, **kwargs)
return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))draw_tree(m, xs_sub, size=10)Final Thoughts
In this exercise, I tried three different algorithms to create a four-leaf-node decision tree on a 1000-sample subset of the Blue Book for Bulldozer’s Kaggle competition dataset:
- An algorithm I created where the model made splits based on the lowest MSE.
- Jeremy’s “side score” algorithm which found the split with the lowest weighted standard deviation.
sklearn.DecisionTreeRegressorwhich also uses MSE to decide splits.
Here are some key observations:
- My algorithm made splits with very small samples. To improve this, I would either explicitly ignore splits with low samples or penalize such splits by weighting by number of rows like Jeremy did in
_side_score. - The
sklearn.DecisionTreeRegressorhad the same initial split as Jeremy’s (onCoupler_System) but then chose two completely different columns to split on for the remainder of the leaf nodes. I wonder if this difference in column selection is due to the difference in how a split is scored between those two algorithms. - A future improvement I could make to this experiment is compare the MSE on a validation set for each decision tree.
- Another future improvement would be to make my algorithm and Jeremy’s algorithm run faster so that I can test it on the actual dataset in a reasonable amount of time.
This was another challenging and enjoyable “Further Research” exercise provided by the fastai textbook. I hope you enjoyed this blog post!