Chapter 1 - Location Estimates and Estimates of Variability

This blog post is 1) my attempt to learn the material in the book Practical Statistics for Data Scientists by Bruce and Bruce and 2) better learn the different Python packages needed for statistics and data science.

I have made every attempt to convert the code supplied in the book from R to Python correctly. Any mistakes and errors must be assumed to be mine and mine alone.

Covers pages 13 - 19

Objectives

  • Compute mean, trimmed mean, and median
  • Compute standard deviation, IQR, percentiles

Data set

  • state.csv

Packages

  • pandas
  • scipy (for trim_mean)
  • Numpy (will be installed with pandas)
In [9]:
import pandas as pd
from scipy import stats
import numpy as np
In [ ]:
state = pd.read_csv('../data/state.csv')
In [3]:
# See Table 1-2 pg 12
state.head(8)
Out[3]:
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
In [4]:
state["Population"].mean()
Out[4]:
6162876.2999999998
In [7]:
# Trimmed mean -- the mean after removing 10% of data points from either side
stats.trim_mean(state["Population"], 0.1)
Out[7]:
4783697.125
In [8]:
state["Population"].median()
Out[8]:
4436369.5
In [16]:
# need to use NumPy's average function to get a weighted mean
np.average(state["Murder.Rate"], weights=state["Population"])
Out[16]:
4.4458339811233927

Standard Deviation

In [17]:
state["Population"].std()
Out[17]:
6848235.3474011421

Percentiles and IQR

In [19]:
# First quantile is 0.25 or the 25th percentile
Q1 = state["Population"].quantile(0.25)
Q1
Out[19]:
1833004.25
In [21]:
# Third quantile is 0.75 or 75th percentile
Q3 = state["Population"].quantile(0.75)
Q3
Out[21]:
6680312.25
In [23]:
# pandas does not have an IQR function but it is easy to compute
# it is just the difference between Q3 and Q1
IQR = Q3 - Q1
IQR
Out[23]:
4847308.0