Top 5 Dummies of all time

In a pervious post I showed you how to make quick and dirty dummy variables with list comprehensions. I called that the dumb way because you have to hard code what goes in the column(s). This is about how to do it a smarter way, where the data itself dictates what dummy variables get made. All examples use the The titanic dataset from kaggle

In [6]:
df.head()
Out[6]:
Survived Pclass Name Sex Age
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
3 1 3 Heikkinen, Miss. Laina female 26.0
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
5 0 3 Allen, Mr. William Henry male 35.0

This methond is usefull for data sets with catigorgacal data with many values. As an example lets use the last name of passengers.

In [7]:
df['family_name'] = [x.split(',')[0] for x in df['Name']]

This should be somewhat familiar from the previous post. The list comprehension splits the Name data where it finds a comma, and returns a list of items found. This code only keeps the first item, which in this data set is passenger last name.

In [9]:
df.head()
Out[9]:
Survived Pclass Name Sex Age family_name
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 Braund
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 Cumings
3 1 3 Heikkinen, Miss. Laina female 26.0 Heikkinen
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 Futrelle
5 0 3 Allen, Mr. William Henry male 35.0 Allen

This is how we will get top five family names. Lets look at it and go through it pice by pice

In [21]:
df['family_name'].value_counts()[:5].index.values

This code counts how many of each value are in the column
df['family_name'].value_counts()

Then keeps the first five from that list
[:5]

which is a np.series. as you can see here:

In [20]:
df['family_name'].value_counts()[:5]
Out[20]:
Andersson    9
Sage         7
Carter       6
Goodwin      6
Skoog        6
Name: family_name, dtype: int64

To get only the index of this series use
.index

Altogether it adds up to:

In [13]:
df['family_name'].value_counts()[:5].index.values
Out[13]:
array(['Andersson', 'Sage', 'Carter', 'Goodwin', 'Skoog'], dtype=object)

Maybe this explains why there aren’t very many Skoog's in the United States

Any way, we don’t want to know this information, we want our model to know it. and to keep from copy/pasting values into list comprehension we can loop over these values and make dummies at the same time.

In [22]:
Top5Fam = df['family_name'].value_counts()[:5].index # Store the list

# make a column in the data frame for each value, and fill it with the dummy info
for fam_name in Top5Fam:
    df[fam_name] = [x == fam_name for x in df['family_name']]
In [23]:
df.head()
Out[23]:
Survived Pclass Name Sex Age family_name Andersson Sage Carter Goodwin Skoog
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 Braund False False False False False
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 Cumings False False False False False
3 1 3 Heikkinen, Miss. Laina female 26.0 Heikkinen False False False False False
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 Futrelle False False False False False
5 0 3 Allen, Mr. William Henry male 35.0 Allen False False False False False

Ta Da!

This method is usefull when you want ot be able to prep multiple data sets, or when you want to make a whole lot of dummie variables.