In a pervious post I showed you how to make quick and dirty dummy variables with list comprehensions. I called that the dumb way because you have to hard code what goes in the column(s). This is about how to do it a smarter way, where the data itself dictates what dummy variables get made. All examples use the The titanic dataset from kaggle
df.head()
This methond is usefull for data sets with catigorgacal data with many values. As an example lets use the last name of passengers.
df['family_name'] = [x.split(',')[0] for x in df['Name']]
This should be somewhat familiar from the previous post. The list comprehension splits the Name data where it finds a comma, and returns a list of items found. This code only keeps the first item, which in this data set is passenger last name.
df.head()
This is how we will get top five family names. Lets look at it and go through it pice by pice
df['family_name'].value_counts()[:5].index.values
This code counts how many of each value are in the column
df['family_name'].value_counts()
Then keeps the first five from that list
[:5]
which is a np.series. as you can see here:
df['family_name'].value_counts()[:5]
To get only the index of this series use
.index
Altogether it adds up to:
df['family_name'].value_counts()[:5].index.values
Maybe this explains why there aren’t very many Skoog's in the United States
Any way, we don’t want to know this information, we want our model to know it. and to keep from copy/pasting values into list comprehension we can loop over these values and make dummies at the same time.
Top5Fam = df['family_name'].value_counts()[:5].index # Store the list
# make a column in the data frame for each value, and fill it with the dummy info
for fam_name in Top5Fam:
df[fam_name] = [x == fam_name for x in df['family_name']]
df.head()
Ta Da!
This method is usefull when you want ot be able to prep multiple data sets, or when you want to make a whole lot of dummie variables.