PPOL564 - Data Science I: Foundations

Lecture 10

Data Exploration

Concepts Covered Today:

  • Initial Exploration of a Dataset
  • Basics of Visualization
  • Approaches to Data Exploration
    • Numeric/Categorical
    • univariate/bivariate

Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
import scipy.stats as stats # for calculating the quantiles for a QQ plot
import requests

# Print all columns from the Pandas DataFrame
pd.set_option('display.max_columns', None) 

# Ignore warnings from Seaborn (specifically, future update warnings)
import warnings
warnings.filterwarnings("ignore")
In [ ]:
def download_data(git_loc,dest_name):
    '''
    Download data from Github and save to the notebook's working directory.
    '''
    req = requests.get(git_loc)
    with open(dest_name,"w") as file:
        for line in req.text:
            file.writelines(line)
            
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_10/country_data.csv',
              "country_data.csv")
In [2]:
# Read in Data 
dat = pd.read_csv("country_data.csv")

Data Exploration

Dimensions

In [3]:
dat.shape
Out[3]:
(13855, 12)
In [4]:
dat.columns
Out[4]:
Index(['country', 'ccode', 'year', 'polity', 'gdppc', 'pop', 'continent',
       'regime_type', 'infant_mort', 'life_exp', 'life_exp_female',
       'life_exp_male'],
      dtype='object')
In [5]:
dat.index
Out[5]:
RangeIndex(start=0, stop=13855, step=1)

Determine the unit of observation

In [6]:
dat.head()
Out[6]:
country ccode year polity gdppc pop continent regime_type infant_mort life_exp life_exp_female life_exp_male
0 Afghanistan 700 1800.0 -6.0 211.276682 1.564573e+07 Asia authoritarian NaN NaN NaN NaN
1 Afghanistan 700 1801.0 -6.0 211.276682 1.564573e+07 Asia authoritarian NaN NaN NaN NaN
2 Afghanistan 700 1802.0 -6.0 211.276682 1.564573e+07 Asia authoritarian NaN NaN NaN NaN
3 Afghanistan 700 1803.0 -6.0 211.276682 1.564573e+07 Asia authoritarian NaN NaN NaN NaN
4 Afghanistan 700 1804.0 -6.0 211.276682 1.564573e+07 Asia authoritarian NaN NaN NaN NaN

Explore the data types

  • What are the data types in the data?
  • Are they correct?
In [7]:
dat.dtypes
Out[7]:
country             object
ccode                int64
year               float64
polity             float64
gdppc              float64
pop                float64
continent           object
regime_type         object
infant_mort        float64
life_exp           float64
life_exp_female    float64
life_exp_male      float64
dtype: object

Clean:

  • year to integer type
  • continent to categorical type
  • regime_type to categorical type
In [8]:
dat.year = dat.year.astype("int")
dat.country = dat.country.astype("category")
dat.continent = dat.continent.astype("category")
dat.regime_type = dat.regime_type.astype("category")
In [9]:
dat.dtypes
Out[9]:
country            category
ccode                 int64
year                  int64
polity              float64
gdppc               float64
pop                 float64
continent          category
regime_type        category
infant_mort         float64
life_exp            float64
life_exp_female     float64
life_exp_male       float64
dtype: object

Categorical variables are similar to factor variables in R

In [10]:
dat.continent.unique()
Out[10]:
[Asia, Europe, Africa, Americas, Oceania]
Categories (5, object): [Asia, Europe, Africa, Americas, Oceania]
In [11]:
dat.continent.cat.codes.unique()
Out[11]:
array([2, 3, 0, 1, 4])

Coverage

Depending on the unit of analysis:

  • what's the temporal coverage of the data?
  • what's the spatial coverage of the data?
In [12]:
min_year = dat.year.min()
max_year = dat.year.max()
print(f"The data ranges from {min_year} to {max_year}.")
The data ranges from 1800 to 2016.
In [13]:
dat.country.unique()
Out[13]:
[Afghanistan, Albania, Algeria, Angola, Argentina, ..., Uruguay, United States, Venezuela, Zambia, Zimbabwe]
Length: 122
Categories (122, object): [Afghanistan, Albania, Algeria, Angola, ..., United States, Venezuela, Zambia, Zimbabwe]

There are 122 countries in the data.

Are all countries in the data for the same years? A simple way we can explore this is to plot the spatial unit (if fixed and not too large, on the temporal unit.

In [22]:
sns.set_context("notebook", font_scale=2)
g = sns.relplot("year","country",
                hue = "continent",
                kind="scatter",
                height=30,s=200,
                data=dat.sort_values('continent'))