Brice Wolfgang
a Data scientist

Welcome to

Business Casual

By Start Bootstrap

SQL EDA Looper

This is a function I used in my General Assembly data science Immersive capstone project! The project used NYC cab data which is available from Google BigQuery.

As with any data science endeavor you have to start by getting the shape of the data. This function does that for several columns over several tables using SQL. The trick here is it puts the output in one easy to read pandas data frame.

Here is the function. It uses dictionaries so that you can decide what the short hand name of each column and table will be. Its optimized for readability! The function takes a dictionary for tables and columns. Setting up the dictionaries is shown below.

def sql_eda(tbl_dict, cols_dict, verbose=False):
  '''Iterates overe talbes and columns returns max and min values in a df'''
  sql_string = 'SELECT {0}({1}) FROM `{2}`'# This is the base sql query
  eda_df = pd.DataFrame() # This will hold infromation about each table
  for tn,tbl in tbl_dict.iteritems():
    if verbose:print; print 'Table: ', tbl

    # add table name to sql string
    tbl_sql = sql_string.format('{0}','{1}',tbl)

    for cn,col in cols_dict.iteritems():

      # add column to sql string
      col_sql = tbl_sql.format('{0}',col)

      # Feel free to add more column wise
      # SQL summry functions to this list! 
      for func in ['MAX','MIN']:

        # add summary function to sql string
        func_sql = col_sql.format(func)

        # make the name for the output df

        if verbose: print cname; print func_sql

        # the results 
        # Tables and rows are checked in a random way since they are drawn from
        #    dictionaries. This cleans it up
        eda_df = eda_df.reindex_axis(sorted(eda_df.columns), axis=1)
        eda_df = eda_df.reindex_axis(sorted(eda_df.index), axis=0)
  return eda_df

Bikeshare tabe

A simple example that looks at NYC bike share data

# This is the human readable name and Big Query name of the table of interest
bike_table = {'bike':'citibike_trips_copy'} 

# These are the columns of interest 
'e_lat':'end_station_latitude', 'e_lon':'end_station_longitude'}
sql_eda(bike_table, bike_cols, )
e_lat_max 40.8042
e_lat_min 0
e_lon_max 0
e_lon_min -74.0836
e_time_max 2016-10-04 09:10:01
e_time_min 2013-07-01 00:04:02
s_lat_max 40.8042
s_lat_min 40.4453
s_lon_max -73.9285
s_lon_min -74.0254
s_time_max 2016-09-30 23:59:51
s_time_min 2013-07-01 00:00:00

You can already see errors in the geographic data that would bias any analysis towards the west coast of Africa. a.k.a. there are points with a Lat Lon of (0,0) and there are no bike share stations there.