Chapter 1

Percentiles and Boxplot
Frequency Table and Histograms

This blog post is 1) my attempt to learn the material in the book Practical Statistics for Data Scientists by Bruce and Bruce and 2) better learn the different Python packages needed for statistics and data science.

I have made every attempt to convert the code supplied in the book from R to Python correctly. Any mistakes and errors must be assumed to be mine and mine alone.

Covers pages 20 - 23

Objectives

  • Computet percentiles and create boxplots
  • Create a frequency table and corresponding histogram plot

Data set

  • state.csv

Packages

  • pandas
  • matplotlib
  • Numpy (will be installed with pandas)

First thing is to import all of the packages we need. Note the %matplotlib inline. This is so that Juypter will display our plots within the notebook. It also needs to be the first line (see the 1st answer on Stack Overflow).

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

Read in the data as we did before. If you need to download the data see this post: (PSDS) Data Download

In [2]:
state = pd.read_csv('../data/state.csv')

Define a list that contains the desired quantiles

In [3]:
q = [0.05, 0.25, 0.5, 0.75, 0.95]
In [4]:
# See Table 1-4 on page 20
state["Murder.Rate"].quantile(q)
Out[4]:
0.05    1.600
0.25    2.425
0.50    4.000
0.75    5.550
0.95    6.510
Name: Murder.Rate, dtype: float64

Boxplot

Pandas has integrated matplotlib almost seamlessly. Here I am using pandas built in boxplot method to create the plot.

In [5]:
ax = state.boxplot(column='Population', figsize=(3, 6), fontsize=14)
ax.set_ylabel('Population (millions)', fontsize=14)
# note 1e7 at top of box plot which dictates the significant numbers of the y axis
# 1e7 = 10,000,000
# 3.5 * 1e7 = 35,000,000 (i.e. 35 million)
Out[5]:
<matplotlib.text.Text at 0x7f1c210b57f0>

You could use plain ol' matplotlib to generate the boxplot like this:

In [6]:
plt.boxplot(state.Population)
Out[6]:
{'boxes': [<matplotlib.lines.Line2D at 0x7f1c20f8ca58>],
 'caps': [<matplotlib.lines.Line2D at 0x7f1c20f98dd8>,
  <matplotlib.lines.Line2D at 0x7f1c20f9ec50>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f1c20fa7cc0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7f1c20f9ee10>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7f1c20f8cc18>,
  <matplotlib.lines.Line2D at 0x7f1c20f98c18>]}

As you can see using matplotlib over pandas method for boxplot leaves something to be desired. It would take a few more commands to get matplotlib boxplot to look like the one panda generates out of the box (pun intended!)

API for creating boxplots with pandas pandas.DataFrame.boxplot
Box plot demos at Matplotlib.org

One other thing to mention, notice the ugly dictionary listing above the boxplot. Let's get rid of that. It is simple. Just add a semi-colon ';' after the command. Like this:

In [7]:
plt.boxplot(state.Population);

Frequency tables

I started going down a wrong path when I got trying to recreate the frequency table. Until I came across pandas cut method in Wes McKinney's book Python For Data Analysis. If you don't know who Wes McKinney is he is the creator of pandas.

In [8]:
state['bin_range'] = pd.cut(state.Population, 10, precision=0, include_lowest=True)
state.head(5)
Out[8]:
State Population Murder.Rate Abbreviation bin_range
0 Alabama 4779736 5.7 AL (4232659.0, 7901692.0]
1 Alaska 710231 5.6 AK (526935.0, 4232659.0]
2 Arizona 6392017 4.7 AZ (4232659.0, 7901692.0]
3 Arkansas 2915918 5.6 AR (526935.0, 4232659.0]
4 California 37253956 4.4 CA (33584923.0, 37253956.0]

Next I create a pandas Series to store the frequency of the bin ranges for the population.

In [9]:
bin_freq = state.State.groupby(state.bin_range).count()
bin_freq
Out[9]:
bin_range
(526935.0, 4232659.0]       24
(4232659.0, 7901692.0]      14
(7901692.0, 11570725.0]      6
(11570725.0, 15239758.0]     2
(15239758.0, 18908791.0]     1
(18908791.0, 22577824.0]     1
(22577824.0, 26246857.0]     1
(26246857.0, 29915890.0]     0
(29915890.0, 33584923.0]     0
(33584923.0, 37253956.0]     1
Name: State, dtype: int64

I then create a new DataFrame using the results from above. I rename the column so it is more meaningful. The inplace=True is used to tell pandas to make the changes using the current DataFrame and don't create a new copy of it.

In [10]:
bin_freq_df = pd.DataFrame(bin_freq)
bin_freq_df.rename(columns={'State': 'count'}, inplace=True)

Again we use pandas plotting abilities for a fairly decent plot without a lot of fuss. Again, note the semi-colon at the end of the command to supress all the extraneous output.

In [11]:
bin_freq_df.plot(kind='bar');

I was having difficulty recreating Table 1.5 on page 22. I kept getting errors and warnings about assigning values to a DataFrame column. Finally, I settled on a brute force method. This probably isn't a "Pythonic" way of doing things but for our purposes it gets the job done.

I don't espouse writing crappy code and I don't think below is necessarily crappy. But I am sure there is probably a more elegant way to write it using Python and pandas. However I have read in several articles where the authors advocate to "get the job done", with correct results of course, and then come back and refactor the code. And that is the approach I am taking here.

In [12]:
temp_dict = {}
for index, st in state.iterrows():
    try:
        state_str = temp_dict[st.bin_range]
        temp_dict[st.bin_range] = state_str + ' ' + st.Abbreviation
    except:
        temp_dict[st.bin_range] = st.Abbreviation
        
for key, value in temp_dict.items():
    bin_freq_df.set_value(key, 'states', value)
    
bin_freq_df["states"].fillna('', inplace=True);
In [13]:
bin_freq_df.head(10)
Out[13]:
count states
bin_range
(526935.0, 4232659.0] 24 AK AR CT DE HI ID IA KS ME MS MT NE NV NH NM N...
(4232659.0, 7901692.0] 14 AL AZ CO IN KY LA MD MA MN MO SC TN WA WI
(7901692.0, 11570725.0] 6 GA MI NJ NC OH VA
(11570725.0, 15239758.0] 2 IL PA
(15239758.0, 18908791.0] 1 FL
(18908791.0, 22577824.0] 1 NY
(22577824.0, 26246857.0] 1 TX
(26246857.0, 29915890.0] 0
(29915890.0, 33584923.0] 0
(33584923.0, 37253956.0] 1 CA