I work with Dummies

To make good models you often have to make a lot of dummy variables. I’ve learned there are smart ways to create them and dumb ways to create them. FYI All the examples here use the pandas python package. The dumbest way, as any programer will tell you is to hard code it. This is true, but in some situations you just want to make a column of dummy variables quick and not look back. For that I’m a big fan of list comprehensions. As you will see, these list comprehensions are easy to read. You get the data frame name and the column being created first, then you get the logic that was applied to every item, then you get the data frame and column each item was draw from. Also they will not create duplicate rows if run more than once.

In [2]:
# The sample data frame
animals
Out[2]:
name species land_sea_air age sex owner owner_type
id
1 Arthur Aardvark L 2 M Atlanta zoo 1
2 Bubbles Badger L 4 F Bob Saget 2
3 Chaucer Chimpanzee L 8 M Curie Institute 1
4 Daisy Dog L 3 F Daniel Craig 2
5 Elwood Electric Eel S 11 M El Dorado Aquarium 1
6 Felix Ferret L 5 M Will Ferrell 2
7 Gerry Gecko L 5 M North Georgia Zoo 1
8 Hellen Humming Bird A 3 F Halley Berry 2

This list comprehension checks for one value in a column and creates a boolean column. The most obvious use of this is making a column to represent sex.

In [3]:
animals['is_male'] = [x == 'M' for x in animals['sex']]
animals.head(3)
Out[3]:
name species land_sea_air age sex owner owner_type is_male
id
1 Arthur Aardvark L 2 M Atlanta zoo 1 True
2 Bubbles Badger L 4 F Bob Saget 2 False
3 Chaucer Chimpanzee L 8 M Curie Institute 1 True

Here is a version of that list comprehension that cleans off extra spaces and forces the original value to upper case, then returns one for true and zero for false.

In [ ]:
animals['is_male'] = [int(x.strip().upper() == 'M')
                      for x 
                      in animals['sex']]

In this way you can deal with a lot of little issues in one fast line of code. For those of you with Big data see this stack exchange article about the fastest way to do .strip().upper()

This is also useful for creating dummy variables that show if something is in a range

In [4]:
animals['is_old'] = [x > 5 for x in animals['age']]
animals.head(3)
Out[4]:
name species land_sea_air age sex owner owner_type is_male is_old
id
1 Arthur Aardvark L 2 M Atlanta zoo 1 True False
2 Bubbles Badger L 4 F Bob Saget 2 False False
3 Chaucer Chimpanzee L 8 M Curie Institute 1 True True

But what if you want to check if something is in a range of categorical values? For example if we wanted to make a dummy variable for sea and air animals?

In [5]:
animals['non_land'] = [x in ['S','A']
                       for x
                       in animals['land_sea_air']]
animals.head(3)
Out[5]:
name species land_sea_air age sex owner owner_type is_male is_old non_land
id
1 Arthur Aardvark L 2 M Atlanta zoo 1 True False False
2 Bubbles Badger L 4 F Bob Saget 2 False False False
3 Chaucer Chimpanzee L 8 M Curie Institute 1 True True False

In this way you have a lot of control over dummy variables besides making one column for each value. FYI if you do want to make a column for every value see this great article by Chris Albon about pd.get_dummies()