Book contents
Work in progress

Book version 0.3.1

6 Categorically Speaking

Terence Parr and Jeremy Howard

Copyright © 2018-2019 Terence Parr. All rights reserved.
Please don't replicate on web or redistribute in any way.
This book generated from markup+markdown+python+latex source with Bookish.

You can make comments or annotate this page by going to the annotated version of this page. You'll see existing annotated bits highlighted in yellow. They are PUBLICLY VISIBLE. Or, you can send comments, suggestions, or fixes directly to Terence.


Coming up with features is difficult, timeconsuming, requires expert domain knowledge. When working applications of learning, we spend a lot of time tuning the features.” — Andrew Ng in a Stanford plenary talk on AI, 2011

Creating a good model is more about feature engineering than it is about choosing the right model; well, assuming your go-to model is a good one like Random Forest. Feature engineering means improving, acquiring, and even synthesizing features that are strong predictors of your model's target variable. Synthesizing features means deriving new features from existing features or injecting features from other data sources. For example, we could synthesize the name of an apartment's New York City neighborhood from it's latitude and longitude. It doesn't matter how sophisticated our model is if we don't give it something useful to chew on. If there is no relationship to discover, because the features are not predictive, no machine learning model is going to give accurate predictions.

So far we've dropped nonnumeric features, such as apartment description strings and manager ID categorical values, because machine learning models can only learn from numeric values. But, there is potentially predictive power that we could exploit in these string values. In this chapter, we're going to learn the basics of feature engineering in order to squeeze a bit more juice out of the nonnumeric features found in the New York City apartment rent data set.

Most of the predictive power for rent price comes from apartment location, number of bedrooms, and number of bathrooms, so we shouldn't expect a massive boost in model performance. The primary goal of this chapter is to learn the techniques for use on real problems in your daily work. Or, if you've entered a Kaggle competition, the difference between the top spot and position 1000 is often razor thin, so even a small advantage from feature engineering could be useful in that context.

6.1 Getting a baseline

1See Section 3.2.1 Loading and sniffing the training data for instructions on downloading JSON rent data from Kaggle and creating the CSV files.

In order to measure any improvements in model accuracy, let's get a baseline using just the cleaned up numeric features from rent.csv in the data directory underneath where we started Jupyter.1 As we did in Chapter 5 Exploring and Denoising Your Data Set, let's load the rent data, strip extreme prices, and remove apartments not in New York City:

df = pd.read_csv("data/rent.csv", parse_dates=['created']) df_clean = df[(df.price>1_000) & (df.price<10_000)] df_clean = df_clean[(df_clean.longitude!=0) | (df_clean.latitude!=0)] df_clean = df_clean[(df_clean['latitude']>40.55) & (df_clean['latitude']<40.94) & (df_clean['longitude']>-74.1) & (df_clean['longitude']<-73.67)] df = df_clean

Now train an RF using just the numeric features and print the out-of-bag (OOB) score:

numfeatures = ['bathrooms', 'bedrooms', 'longitude', 'latitude'] X, y = df[numfeatures], df['price'] rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True), y) oob_baseline = rf.oob_score_ print(oob_baseline)


The 0.868 OOB score is pretty good as it's close to 1.0. (Recall that a score of 0 indicates our model does know better than simply guessing the average rent price for all apartments and 1.0 means a perfect predictor of rent price.) Let's also get an idea of how much work the RF has to do in order to capture the relationship between those features and rent price. The rfpimp package provides a few simple measures we can use: the total number of nodes in all decision trees of the forest and the height (in nodes) of the typical tree.

print(f"{rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

2,433,290 tree nodes and 35.0 median tree height

The tree height matters because that is the path taken by the RF prediction mechanism and so tree height effects prediction speed.

Let's also get a baseline for feature importances so that, as we introduce new features, we can get a sense of their predictive power. Longitude and latitude are really one meta-feature called “location,” so we can combine those two using the features parameter to the importances() function. We're going to ask for feature importances a lot in this chapter so let's make a handy function:

» Generated by code to left
def showimp(rf, X, y): features = list(X.columns) features.remove('latitude') features.remove('longitude') features += [['latitude','longitude']] I = importances(rf, X, y, features=features) plot_importances(I, color='#4575b4') showimp(rf, X, y)

Now that we have a good benchmark for model and feature performance, let's try to improve our model by converting some existing nonnumeric features into numeric features.

6.2 Encoding categorical variables

The interest_level feature is a categorical variable that seems to encode interest in an apartment, no doubt taken from webpage activity logs. It's a categorical variable because it takes on values from a finite set of choices: low, medium, and high. More specifically, interest_level is an ordinal categorical variable, which means that the values can be ordered even if they are not actual numbers. Looking at the count of each ordinal value is a good way to get an overview of an ordinal feature:


low 33270 medium 11203 high 3827 Name: interest_level, dtype: int64

Ordinal variables are the easiest to convert from strings to numbers because we can simply assign a different integer for each possible ordinal value. One way to do the conversion is to use the Pandas map() function and a dictionary argument:

df['interest_level'] = df['interest_level'].map({'low':1,'medium':2,'high':3}) print(df['interest_level'].value_counts())

1 33270 2 11203 3 3827 Name: interest_level, dtype: int64

The key here is to ensure that the numbers we use for each category maintains the same ordering so, for example, medium's value of 2 is bigger than low's value of 1. We could also have chosen {'low':10,'medium':20,'high':30} as the encoding because RFs care about the order and not the scale of the features.

Let's see how an RF model performs using the numeric features and this newly converted interest_level feature:

def test(X, y): rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True), y) oob = rf.oob_score_ n = rfnnodes(rf) h = np.median(rfmaxdepths(rf)) print(f"OOB R^2 {oob:.5f} using {n:,d} tree nodes with {h} median tree height") return rf, oob X, y = df[['interest_level']+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.87057 using 3,024,590 tree nodes with 35.0 median tree height

That 0.871 score is only a little bit better than our baseline of 0.868, but it's still useful. As you approach an of 1.0, it gets harder and harder to nudge model performance. The difference in scores actually represents a few percentage points of the remaining accuracy, so we can be happy with this bump. Also, remember that using a single OOB score is a fairly blunt metric that aggregates the performance of the model on 50,000 records. It's possible that a certain type of apartment got a really big accuracy boost. The only cost to this accuracy boost is an increase in the number of tree nodes, but the typical decision tree height in the forest remains the same as our baseline. Prediction using this model should be just as fast as the baseline.

The easy way to remember the difference between ordinal and nominal variables is that ordinal variables have order and nominal comes from the word for “name” in Latin (nomen) or French (nom).

Ordinal categorical variables are simple to encode numerically. Unfortunately, there's another kind of categorical variable called a nominal variable for which there is no meaningful order between the category values. For example, in this data set, columns manager_id, building_id, and display address are nominal features. Without an order between categories, it's hard to encode nominal variables as numbers in a meaningful way, particularly when there are very many category values:

print(len(df['manager_id'].unique()), len(df['building_id'].unique()), len(df['display_address'].unique()))

(3409, 7417, 8692)

So-called high-cardinality (many-valued) categorical variables such as manager_id, building_id, and display address are best handled using more advanced techniques such as embeddings or mean encoding as we'll see in Section 6.5 Target encoding categorical variables, but we'll introduce two simple encoding approaches here that are often useful.

The first technique is called label encoding and simply converts each category to a numeric value, as we did before, while ignoring the fact that the categories are not really ordered. Sometimes an RF can get some predictive power from features encoded in this way but typically requires larger trees in the forest.

To label encode any categorical variable, convert the column to an ordered categorical variable and then convert the strings to the associated categorical code (which is computed automatically by Pandas):

df['display_address_cat'] = df['display_address'].astype('category').cat.as_ordered() df['display_address_cat'] = df['display_address_cat'] + 1

The category codes start with 0 and the code for “not a number” (nan) is -1. To bring everything into the range 0 or above, we add one to the category code. (Sklearn has equivalent functionality in its OrdinalEncoder transformer but it can't handle object columns with both integers and strings, plus it's get an error for missing values represented as np.nan.)

Let's see if this new feature improves performance over the baseline model:

X, y = df[['display_address_cat']+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.86572 using 3,121,640 tree nodes with 38.0 median tree height

Unfortunately, the score is roughly the same as the baseline's score and it increases the tree height. Not a good trade.

Let's try another encoding approach called frequency encoding and apply it to the manager_id feature. Frequency encoding converts categories to the frequencies with which they appear in the training. For example, here are the category value counts for the top 5 manager_ids:


e6472c7237327dd3903b3d6f6a94515a 2509 6e5c10246156ae5bdcd9b487ca99d96a 695 8f5a9c893f6d602f4953fcc0b8e6e9b4 404 62b685cc0d876c3a1a51d63a0d6a8082 396 cb87dadbca78fad02b388dc9e8f25a5b 370 Name: manager_id, dtype: int64

The idea behind this transformation is that there might be predictive power in the number of apartments managed by a particular manager.

Function value_counts() gives us the encoding and from there we can use map() to transform the manager_id into a new column called mgr_apt_count:

managers_count = df['manager_id'].value_counts() df['mgr_apt_count'] = df['manager_id'].map(managers_count)

Again, we could actually divide the accounts by len(df), but that just scales the column and won't affect predictive power.

Let's see how a model does with both of these categorical variables encoded:

X, y = df[['display_address_cat','mgr_apt_count']+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.86374 using 4,572,060 tree nodes with 40.0 median tree height

The model accuracy does not improve and the complexity of the trees goes up, so we should avoid introducing these features. The feature importance plot confirms that these features are unimportant.

We can conclude that, for this data set, label encoding and frequency encoding of high-cardinality categorical variables is not helpful. These encoding techniques could, however, be useful on other data set so it's worth learning them. We'll learn about another categorical variable encoding technique called “one-hot encoding” in Chapter 8 Dealing with Missing Data, but we avoid one-hot encoding here because it would add thousands of new columns to our data set.

6.3 Extracting features from strings

The apartment data set also has some variables that are both nonnumeric and noncategorical, description and features. Such arbitrary strings have no obvious numerical encoding, but we can extract bits of information from them to create new features. For example, an apartment with parking, doorman, dishwasher, and so on might fetch a higher price so let's synthesize some Boolean features derived from string features.

First let's normalize the string columns to be lowercase and convert any missing values (represented as “not a number” np.nan values) to be empty strings; otherwise, applying split() to the columns will fail because split() does not apply to floating-point numbers (np.nan).

df['description'] = df['description'].fillna('') df['description'] = df['description'].str.lower() # normalize to lower case df['features'] = df['features'].fillna('') # fill missing w/blanks df['features'] = df['features'].str.lower() # normalize to lower case

Now we can create new columns by applying contains() to the string columns, once for each new column:

# has apartment been renovated? df['renov'] = df['description'].str.contains("renov") for w in ['doorman', 'parking', 'garage', 'laundry', 'Elevator', 'fitness center', 'dishwasher']: df[w] = df['features'].str.contains(w) df[['doorman', 'parking', 'garage', 'laundry']].head(5)

Another trick is to count the number of words in a string

df["num_desc_words"] = df["description"].apply(lambda x: len(x.split())) df["num_features"] = df["features"].apply(lambda x: len(x.split(",")))

There's not much we can do with the list of photo URLs in the other photos string column, but the number of photos might be slightly predictive of price, so let's create a feature for that as well:

df["num_photos"] = df["photos"].apply(lambda x: len(x.split(",")))

Let's see how our new features affect performance above the baseline numerical features:

textfeatures = [ 'num_photos', 'num_desc_words', 'num_features', 'doorman', 'parking', 'garage', 'laundry', 'Elevator', 'fitness center', 'dishwasher', 'renov' ] X, y = df[textfeatures+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.86265 using 4,770,356 tree nodes with 44.0 median tree height

» Generated by code to left
showimp(rf, X, y)

It's disappointing that the accuracy does not improve with these features and that the complexity of the model is higher, but this does provide useful marketing information. Apartment features such as parking, laundry, and dishwashers are attractive but there's little evidence that people are willing to pay more for them (except maybe for having a doorman).

6.4 Synthesizing numeric features

If you grew up with lots of siblings, you're likely familiar with the notion of waiting to use the bathroom in the morning. Perhaps there is some predictive power in the ratio of bedrooms to bathrooms, so let's try synthesizing a new column that is the ratio of two numeric columns:

df["beds_to_baths"] = df["bedrooms"]/(df["bathrooms"]+1) # avoid div by 0 X, y = df[['beds_to_baths']+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.86848 using 2,433,550 tree nodes with 35.0 median tree height

Combining numerical columns is a useful technique to keep in mind, but unfortunately this combination doesn't affect our model significantly here. Perhaps we'd have better luck combining a numeric column with the target, such as the ratio of bedrooms to price:

df["beds_per_price"] = df["bedrooms"] / df["price"] X, y = df[['beds_per_price']+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.98635 using 1,316,466 tree nodes with 31.0 median tree height

Wow! That's almost a perfect score, which should trigger an alarm that it's too good to be true. In fact, we do have an error, but it's not a code bug. We have effectively copied the price column into the feature set, kind of like studying for a quiz by looking at the answers. This is a form of data leakage, which is a general term for the use of features that directly or indirectly hint at the target variable.

To illustrate how this leakage causes overfitting, let's split out 20% of the data as a validation set and compute the beds_per_price feature for the training set:

from sklearn.model_selection import train_test_split df_train, df_test = train_test_split(df, test_size=0.20) df_train = df_train.copy() df_train['beds_per_price'] = df_train['bedrooms'] / df_train["price"] df_train[['beds_per_price','bedrooms']].head(5)
As a general principle, you can't compute features for the validation set using information from the validation set. Only information computed directly from the training set can be used during feature engineering.

While we do have price information for the validation set in df_test, we can't use it to compute beds_per_price for the validation set. In production, the model will not have the validation set prices and so df['beds_per_price'] must be created only from the training set. (If the model had apartment prices in practice, we wouldn't be the model.) To avoid this common pitfall, it's helpful to imagine computing features for and sending validation records to the model one by one, rather than all at once.

Once we have the beds_per_price feature for the training set, we can compute a dictionary mapping bedrooms to the beds_per_price feature. Then, we can synthesize the beds_per_price in the validation set using map() on the bedrooms column:

bpmap = dict(zip(df_train["bedrooms"],df_train["beds_per_price"])) df_test = df_test.copy() df_test["beds_per_price"] = df_test["bedrooms"].map(bpmap) avg = np.mean(df_test['beds_per_price']) df_test['beds_per_price'].fillna(avg, inplace=True) print(df_test['beds_per_price'].head(5))

15200 0.000714 48176 0.000000 16152 0.001111 33336 0.000295 35903 0.000000 Name: beds_per_price, dtype: float64

The fillna() code deals with the situation where the validation set has a number of bedrooms that is not in the training set; for example, sometimes the validation set has an apartment with 7 bedrooms, but there is no bedroom key equal to 7 in the bpmap.

Now that we have beds_per_price for both training and validation sets, we can train an RF model using just the training set and evaluate its performance using just the validation set:

X_train, y_train = df_train[['beds_per_price']+numfeatures], df_train['price'] X_test, y_test = df_test[['beds_per_price']+numfeatures], df_test['price'] rf = RandomForestRegressor(n_estimators=100, n_jobs=-1), y_train) oob_overfit = rf.score(X_test, y_test) # don't test training set print(f"OOB R^2 {oob_overfit:.5f}") print(f"{rfnnodes(rf):,d} nodes, {np.median(rfmaxdepths(rf))} median height")

OOB R^2 0.05721 1,154,512 nodes, 31.0 median height

An of 0.057 on the validation set is terrible and indicates that the model lacks generality. In this situation, overfitting means that the model has put too much emphasis on the beds_per_price feature, which is strongly predictive of price in the training but not the validation set. (The feature was computed using just data from the training set.) We get a whiff of overfitting even in the complexity of the model, which has half the number of nodes as a model trained just on the numeric features. It's not always an error to derive features from the target; we just have to be more careful.

6.5 Target encoding categorical variables

Creating features that incorporate information about the target variable is called target encoding and is often used to derive features from categorical variables to great effect. One of the most common target encodings is called mean encoding, which replaces each category value with the average target value associated with that category. For example, building managers in our apartment data set might rent apartments in certain price ranges. The manager IDs by themselves carry little predictive power but converting IDs to the average rent price for apartments they manage could be predictive. Similarly, certain buildings might have more expensive apartments than others. The average rent price per building is easy enough to get with Pandas by grouping the data by manager_id and asking for the mean:


Unfortunately, as we saw in the last section, it's easy to overfit models when incorporating target information. Preventing overfitting is nontrivial and it's best to rely on a vetted library, such as the category_encoders package contributed to sklearn. (To prevent overfitting, the idea is to compute the mean from a subset of the training data targets for each category.) You can install category_encoders with pip on the commandline:

pip install category_encoders

{TODO: Maybe show my mean encoder. much faster. worth explaining somewhere. rent/mean-encoder.ipynb sees like we need an alpha parameter that gets rare cats towards avg; supposed to work better.}

Here's how to use the TargetEncoder object to encode three categorical variables from the data set and get an OOB score:

from category_encoders.target_encoder import TargetEncoder df = df.reset_index() # not sure why TargetEncoder needs this but it does targetfeatures = ['building_id'] encoder = TargetEncoder(cols=targetfeatures), df['price']) df_encoded = encoder.transform(df, df['price']) X, y = df_encoded[targetfeatures+numfeatures], df['price'] rf, oob = test(X, y)

OOB R^2 0.87191 using 2,746,422 tree nodes with 39.0 median tree height

That score is a bit better than the baseline of 0.868 for numeric-only features. Given the tendency to overfit the model with target encoding, however, it's a good idea to test the model with a validation set. Let's split out 20% as a validation set and get a baseline for numeric features:

df_train, df_test = train_test_split(df, test_size=0.20) # TargetEncoder needs the resets, not sure why df_train = df_train.reset_index(drop=True) df_test = df_test.reset_index(drop=True) X_train = df_train[numfeatures] y_train = df_train['price'] X_test = df_test[numfeatures] y_test = df_test['price'] rf = RandomForestRegressor(n_estimators=100, n_jobs=-1), y_train) s_validation = rf.score(X_test, y_test) print(f"{s_validation:4f} score {rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

0.868370 score 2,129,230 tree nodes and 35.0 median tree height

The validation score and the OOB score are very similar, confirming that OOB scores are an excellent approximation to validation scores. With that baseline, let's see what happens when we properly target encode the validation set, that is, using only data from the training set (warning: this takes several minutes to execute):

enc = TargetEncoder(cols=targetfeatures), df_train['price']) df_train_encoded = enc.transform(df_train, df_train['price']) df_test_encoded = enc.transform(df_test) X_train = df_train_encoded[targetfeatures+numfeatures] y_train = df_train_encoded['price'] X_test = df_test_encoded[targetfeatures+numfeatures] y_test = df_test_encoded['price'] rf = RandomForestRegressor(n_estimators=100, n_jobs=-1), y_train) s_tenc_validation = rf.score(X_test, y_test) print(f"{s_tenc_validation:.4f} score {rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

0.8663 score 2,383,668 tree nodes and 38.5 median tree height

The validation score for numeric plus target-encoded feature (0.866) is less than the validation score for numeric-only features (0.868). The model finds the target-encoded feature strongly predictive of the training set prices, causing it to overfit by overemphasizing this feature. This loss of generality explains the drop in validation scores. The feature importance graph provides evidence of this overemphasis because it shows the target-encoded building_id feature as the most important.

While not beneficial for this data set, target encoding is reported to be useful by many practitioners and Kaggle competition winners. It's worth knowing about this technique and learning to apply it properly (computing validation set features using data only from the training set).

When we've exhausted our bag of tricks deriving features from a given data set, sometimes it's fruitful to inject features derived from external data sources.

6.6 Injecting external neighborhood info

Our data set has longitude and latitude coordinates, but a more obvious price predictor would be a categorical variable identifying the neighborhood because some neighborhoods are more desirable than others. Given the trouble we've seen with categorical variables above, though, a numeric feature would be more useful. Instead of identifying the neighborhood, let's use proximity to highly desirable neighborhoods as a numeric feature. Forbes magazine has an article, The Top 10 New York City Neighborhoods to Live In, According to the Locals, from which we can get neighborhood names. Then, using a mapping website, we can estimate the longitude and latitude of those neighborhoods and record them like this:

hoods = { "hells" : [40.7622, -73.9924], "astoria" : [40.7796684, -73.9215888], "Evillage" : [40.723163774, -73.984829394], "Wvillage" : [40.73578, -74.00357], "LowerEast" : [40.715033, -73.9842724], "UpperEast" : [40.768163594, -73.959329496], "ParkSlope" : [40.672404, -73.977063], "Prospect Park" : [40.93704, -74.17431], "Crown Heights" : [40.657830702, -73.940162906], "financial" : [40.703830518, -74.005666644], "brooklynheights" : [40.7022621909, -73.9871760513], "gowanus" : [40.673, -73.997] }

To synthesize new features, we compute the so-called Manhattan distance (also called L1 distance) from each apartment to each neighborhood center, which measures the number of blocks one must walk to reach the neighborhood. In contrast, the Euclidean distance would measure distance as the crow flies, cutting through the middle of buildings.

for hood,loc in hoods.items(): # compute manhattan distance df[hood] = np.abs(df.latitude - loc[0]) + np.abs(df.longitude - loc[1])

Training a model on the numeric features and these new neighborhood features, gives us a decent bump in performance:

hoodfeatures = list(hoods.keys()) X, y = df[numfeatures+hoodfeatures], df['price'] rf, oob_hood = test(X, y)

OOB R^2 0.87276 using 2,413,238 tree nodes with 43.0 median tree height

An OOB score of 0.873 is noticeably better than the baseline 0.868 for numeric features. The number of trees increased significantly, but the number of nodes is about the same. The model works a little bit harder to associate similar apartments with similar rent prices, but it's worth the extra complexity for the performance boost.

The new proximity features effectively triangulate an apartment relative to desirable neighborhoods, which means we probably don't need apartment longitude and latitude anymore. Dropping longitude and latitude and retraining a model shows a similar OOB score and shallower trees:

X = X.drop(['longitude','latitude'],axis=1) rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True), y) print(f"{rf.oob_score_:.4f} score {rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

0.8705 score 2,422,502 tree nodes and 40.0 median tree height

Injecting outside data is definitely worth trying as a general rule. As another example, consider predicting sales at a store or website. We've found that injecting columns indicating paydays or national holidays often helps predict fluctuations in sales volume that are not explained well with existing features.

6.7 Our final model

We've explored a lot of common feature engineering techniques for categorical variables in this chapter, so let's put them all together to see how a combined model performs:

X = df[['interest_level']+textfeatures+hoodfeatures+numfeatures] rf, oob_combined = test(X, y)

OOB R^2 0.87869 using 4,840,370 tree nodes with 44.0 median tree height

While OOB score 0.879 is not too much larger than our numeric-only model baseline of 0.868 in absolute terms, it means we've squeezed an extra 7.966% relative accuracy out of our data (computed using ((oob_combined-oob_baseline) / (1-oob_baseline))*100).

Something to notice is that the RF model does not get confused with a combination of all of these features, which is one of the reasons we recommend RFs. RFs simply ignore features without much predictive power. The feature importance graph to the right indicates what the model finds predictive. For model accuracy reasons, it's a good idea to keep all features, but we might want to remove unimportant features to simplify the model for interpretation purposes (when explaining model behavior).

6.8 Summary of categorical feature engineering

{TODO: Summarize techniques from this chapter}