(Many thanks to Evimaria Terzi and Mark Crovella for their code and examples)
In this tutorial we will see Python Data Analysis library, as well as other libraries useful for data processing.
I assume that you all know Python. A brief introduction to Python Basics can be found in this notebook from last year (ipynb, html). Here we will only review the use of list comprehension
Recall the mathematical notation:
$$L_1 = \left\{x^2 : x \in \{0\ldots 9\}\right\}$$$$L_2 = \left(1, 2, 4, 8,\ldots, 2^{12}\right)$$$$M = \left\{x \mid x \in L_1 \text{ and } x \text{ is even}\right\}$$L1 = [x**2 for x in range(10)] # range(n): returns an iterator over the numbers 0,...,n-1
L2 = [2**i for i in range(13)]
L3 = [x for x in L1 if x % 2 == 0]
print (L1)
print (L2)
print (L3)
[x for x in [x**2 for x in range(10)] if x % 2 == 0]
words = 'The quick brown fox jumps over the lazy dog'.split()
print(words)
upper = [w.upper() for w in words]
print(upper)
stuff = [[w.upper(), w.lower(), len(w)] for w in words]
print(stuff)
s = input('Give numbers separated by comma: ')
x = [int(n) for n in s.split(',')]
print(x)
#create a vector of all 10 zeros
z = [0 for i in range(10)]
print(z)
#create a 10x10 matrix with all 0s
M = [[0 for i in range(10)] for j in range(10)]
#set the diagonal to 1
for i in range(10): M[i][i] = 1
M
#create a list of random integers in [0,99]
import random
R = [random.choice(range(100)) for i in range(10)]
print(R)
# Removing elements from a list while you iterate it can lead to problems
L = [1,2,4,5,6,8]
for x in L:
if x%2 == 0:
L.remove(x)
print(L)
#Another way to do this:
L = [1,2,4,5,6,8]
L = [x for x in L if x%2 == 1] #creates a new list
L[:] = [x for x in L if x%2 == 1]
print(L)
L = [1,2,4,5,6,8]
R =[y for y in L if y%2 == 0]
for x in R: L.remove(x)
print(L)
Pandas is the Python Data Analysis Library.
Pandas is an extremely versatile tool for manipulating datasets, mostly tabular data. You can think of Pandas as the evolution of excel spreadsheets, with more capabilities for coding, and SQL queries such as joins and group-by.
It also produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.
You can find more details here
The most important tool provided by Pandas is the data frame.
A data frame is a table in which each row and column is given a label. Very similar to a spreahsheet or a SQL table.
Pandas DataFrames are documented at: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
import pandas as pd #The pandas library
import pandas_datareader.data as web # For accessing web data
from pandas import Series, DataFrame #Main pandas data structures
import matplotlib.pyplot as plt #main plotting tool for python
import matplotlib as mpl
import seaborn as sns #A more fancy plotting library
from datetime import datetime #For handling dates
import scipy as sp #library for scientific computations
from scipy import stats #The statistics part of the library
#pd.__version__
#For presenting plots inline
%matplotlib inline
For demonstration purposes, we'll use a library built-in to Pandas that fetches data from standard online sources. More information on what types of data you can fetch is at: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
We will use stock quotes from IEX. To make use of these you need to first create an account and obtain an API key. Then you set the environment variable IEX_API_KEY to the value of the key as it is snown below
import os
os.environ["IEX_API_KEY"] = "pk_******************************"
stocks = 'FB'
data_source = 'iex'
start = datetime(2018,1,1)
end = datetime(2018,12,31)
stocks_data = web.DataReader(stocks, data_source, start, end)
#If you want to load only some of the attributes:
#stocks_data = web.DataReader(stocks, data_source, start, end)[['open','close']]
stocks_data.info()
stocks_data.head()
Note that the date attribute is the index of the rows, not an attribute.
#trying to access the date column will give an error
stocks_data.date
Use describe to get some basic statistics for the data
stocks_data.describe()
stocks_data.to_csv('stocks_data.csv')
for x in open('stocks_data.csv').readlines()[0:10]:
print(x.strip())
The index values are also printed in the file, together with the column name
df = pd.read_csv('stocks_data.csv')
df.head()
Note that in the new dataframe, there is now a date column, while the index values are numbers 0,1,...
The number of rows in the DataFrame:
len(df)
Getting the attribute characteristics
df.info()
d = {'A':[1., 2., 3., 4.],
'B':[4., 3., 2., 1.]}
ddf = pd.DataFrame(d)
ddf.to_csv('test.csv')
print(open('test.csv').read())
d = [[1,2,3],[4,5,6]]
test_df = pd.DataFrame(d)
print(test_df)
test_df.columns= ['A','B','C']
test_df.index = ['one','two']
test_df
The columns or "features" in your data
df.columns
We can also assign a list to the columns property in order to change the attribute names.
Alternatively, you can change the name of an attribute using rename:
df = df.rename(columns = {'volume':'vol'})df.columns = ['date', 'open', 'high', 'low', 'close', 'vol']
df.head()
Selecting a single column from your data
df['open']
Another way of selecting a single column from your data
df.open
df[['open','close']].head()
df.date.head(10)
df.date.tail(10)
We can use the values method to obtain the values of one or more attributes. It returns a numpy array. You can trasform it into a list, by applying the list() operator.
df.open.values
A DataFrame object has many useful methods.
df.mean()
df.std()
df.median()
df.open.mean()
df.high.mean()
df.high.plot()
df.low.plot(label='low')
plt.legend(loc='best') #puts the ledgent in the best possible position
df.close.hist(bins=50)
sns.distplot(df.close,bins=50)
Methods like sum( ) and std( ) work on entire columns.
We can run our own functions across all values in a column (or row) using apply( ).
df.date.head()
The values property of the column returns a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.
first_date = df.date.values[0]
first_date
#returns a string
datetime.strptime(first_date, "%Y-%m-%d")
We will now make use of two operations:
The apply method takes a dataframe and applies a function that is given as input to apply to all the entries in the data frame. In the case below we apply it to just one column.
The lambda function allows to define an anonymus function that takes some parameters (d) and uses them to compute some expression.
Using the lambda function with apply, we can apply the function to all the entries of the data frame (in this case the column values)
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date.head()
Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.
The row indices so far have been auto-generated by pandas, and are simply integers starting from 0.
From now on we will use dates instead of integers for indices -- the benefits of this will show later.
Overwriting the index is as easy as assigning to the index
property of the DataFrame.
df.index = df.date
df.head()
Now that we have made an index based on date, we can drop the original date
column.
We will not do it in this example to use it later on.
df = df.drop(['date'],axis=1) #axis = 0 means rows, axis =1 means columns
df.info()
So far we've seen how to access a column of the DataFrame. To access a row we use a different notation.
To access a row by its index value, use the .loc()
method.
df.loc[datetime(2018,5,7)]
To access a row by its sequence number (ie, like an array index), use .iloc()
('Integer Location')
df.iloc[10:30] #dataframe with rows from 10 to 30
df.iloc[0:2,[1,3]] #dataframe with rows 0:2, and the second and fourth columns
.iterrows()
¶num_positive_days = 0
for idx, row in df.iterrows(): #returns the index name and the row
if row.close > row.open:
num_positive_days += 1
print("The total number of positive-gain days is {}.".format(num_positive_days))
You can also do it this way:
num_positive_days = 0
for i in range(len(df)):
row = df.iloc[i]
if row.close > row.open:
num_positive_days += 1
print("The total number of positive-gain days is {}.".format(num_positive_days))
Or this way:
pos_days = [idx for (idx,row) in df.iterrows() if row.close > row.open]
print("The total number of positive-gain days is "+str(len(pos_days)))
It is very easy to select interesting rows from the data.
All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.
tmp_high = df.high > 170
tmp_high.head()
Summing a Boolean array is the same as counting the number of True
values.
sum(tmp_high)
Now, let's select only the rows of df
that correspond to tmp_high
df[tmp_high].head()
Putting it all together, we have the following commonly-used patterns:
positive_days = df[df.close > df.open]
positive_days.head()
very_positive_days = df[df.close-df.open > 5]
very_positive_days.head()
To create a new column, simply assign values to it. Think of the columns as a dictionary:
df['profit'] = (df.close - df.open)
df.head()
df.profit[df.profit>0].describe()
for idx, row in df.iterrows():
if row.close < row.open:
df.loc[idx,'gain']='negative'
elif (row.close - row.open) < 1:
df.loc[idx,'gain']='small_gain'
elif (row.close - row.open) < 3:
df.loc[idx,'gain']='medium_gain'
else:
df.loc[idx,'gain']='large_gain'
df.head()
Here is another, more "functional", way to accomplish the same thing.
Define a function that classifies rows, and apply
it to each row.
def gainrow(row):
if row.close < row.open:
return 'negative'
elif (row.close - row.open) < 1:
return 'small_gain'
elif (row.close - row.open) < 3:
return 'medium_gain'
else:
return 'large_gain'
df['test_column'] = df.apply(gainrow, axis = 1)
#axis = 0 means columns, axis =1 means rows
df.head()
OK, point made, let's get rid of that extraneous test_column
:
df = df.drop('test_column', axis = 1)
df.head()
An extremely powerful DataFrame method is groupby()
.
This is entirely analagous to GROUP BY
in SQL.
It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.
Here we will look at the average gain among the categories of gains (negative, small, medium and large) we defined above and stored in column gain
.
gain_groups = df.groupby('gain')
type(gain_groups)
Essentially, gain_groups
behaves like a dictionary
gain
column, and for gain, gain_data in gain_groups:
print(gain)
print(gain_data.head())
print('=============================')
We can obtain the dataframe that corresponds to a specific group by using the get_group method of the groupby object
sm = gain_groups.get_group('small_gain')
sm.head()
for gain, gain_data in df.groupby("gain"):
print('The average closing value for the {} group is {}'.format(gain,
gain_data.close.mean()))
for gain, gain_data in df.groupby("gain"):
print('The median volumn value for the {} group is {}'.format(gain,
gain_data.vol.median()))
We often want to do a typical SQL-like group by, where we group by one or more attributes, and aggreagate the values of some other attributes. For example group by "gain" and take the average of the values for open, high, low, close, volume. You can also use other aggregators such as count, sum, median, max, min. Pandas is now returning a new dataframe indexed by the values if the group-by attribute(s), with columns the other attributes
gdf= df[['open','low','high','close','vol','gain']].groupby('gain').mean()
type(gdf)
#This can be used to remove the hiearchical index, if necessary
gdf = df[['open','low','high','close','vol','gain']].groupby('gain').mean().reset_index()
gdf.set_index('gain')
gdf
Are these differences statistically significant? We can test that using the Student t-test. The Student t-test will give us a value for the differnece between the means in units of standard error, and a p-value that says how important this difference is. Usually we require the p-value to be less than 0.05 (or 0.01 if we want to be more strict). Note that for the test we will need to use all the values in the group
The t-test value is:
$$t = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} $$where $\bar x_i$ is the mean value of the $i$ dataset, $\sigma_i^2$ is the variance, and $n_i$ is the size.
#Test statistical significance of the difference in the mean volume numbers
sm = gain_groups.get_group('small_gain').vol
lg = gain_groups.get_group('large_gain').vol
med = gain_groups.get_group('medium_gain').vol
neg = gain_groups.get_group('negative').vol
print(stats.ttest_ind(sm,neg,equal_var = False))
print(stats.ttest_ind(sm,med, equal_var = False))
print(stats.ttest_ind(sm,lg, equal_var = False))
print(stats.ttest_ind(neg,med,equal_var = False))
print(stats.ttest_ind(neg,lg,equal_var = False))
print(stats.ttest_ind(med,lg, equal_var = False))
We can compute the standard error of the mean using the stats.sem method of scipy, which can also be called from the data frame
print(sm.sem())
print(neg.sem())
print(stats.sem(med))
print(stats.sem(lg))
We can also visualize the mean and the standard error in a bar-plot, using the barplot function of seaborn. Note that we need to apply this to the original data. The averaging is done automatically.
sns.barplot(x='gain',y='vol', data = df)
sns.boxplot(x='gain',y='vol', data = df)
#Removing outliers
sns.boxplot(x='gain',y='vol', data = df, showfliers = False)
Plot the average volume over the different months
def get_month(row):
return row.date.month
df['month'] = df.apply(get_month,axis = 1)
sns.lineplot(x='month', y = 'vol', data = df)
df['positive_profit'] = (df.profit>0)
sns.lineplot(x='month', y = 'vol', hue='positive_profit', data = df)
We can join data frames in a similar way that we can do joins in SQL
data_source = 'iex'
start = datetime(2018,1,1)
end = datetime(2018,12,31)
dfb = web.DataReader('FB', data_source, start, end)
dgoog = web.DataReader('GOOG', data_source, start, end)
print(dfb.head())
print(dgoog.head())
Perform join on the date (the index value)
common_dates = pd.merge(dfb,dgoog,on='date')
common_dates.head()
Compute gain and perform join on the data AND gain
dfb['gain'] = dfb.apply(gainrow, axis = 1)
dgoog['gain'] = dgoog.apply(gainrow, axis = 1)
dfb['profit'] = dfb.close-dfb.open
dgoog['profit'] = dgoog.close-dgoog.open
common_gain_dates = pd.merge(dfb, dgoog, on=['date','gain'])
common_gain_dates.head()
More join examples, including left outer join
left = pd.DataFrame({'key': ['foo', 'foo', 'boo'], 'lval': [1, 2,3]})
print(left)
right = pd.DataFrame({'key': ['foo', 'hoo'], 'rval': [4, 5]})
print(right)
dfm = pd.merge(left, right, on='key') #keeps only the common key 'foo'
print(dfm)
Left outer join
dfm = pd.merge(left, right, on='key', how='left') #keeps all the keys from the left and puts NaN for missing values
print(dfm)
dfm = dfm.fillna(0) #fills the NaN values with specified value
dfm
A DataFrame is essentially an annotated 2-D array.
Pandas also has annotated versions of 1-D and 3-D arrays.
A 1-D array in Pandas is called a Series
.
A 3-D array in Pandas is called a Panel
.
To use these, read the documentation!
As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.
stocks = ['FB','GOOG','TSLA', 'MSFT','NFLX']
attr = 'close'
df = web.DataReader(stocks,
data_source,
start=datetime(2018, 1, 1),
end=datetime(2018, 12, 31))[attr]
df.head()
df.FB.plot(label = 'facebook')
df.GOOG.plot(label = 'google')
df.TSLA.plot(label = 'tesla')
df.MSFT.plot(label = 'microsoft')
df.NFLX.plot(label = 'netflix')
_ = plt.legend(loc='best')
Next, we will calculate returns over a period of length $T$, defined as:
$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$The returns can be computed with a simple DataFrame method pct_change()
. Note that for the first $T$ timesteps, this value is not defined (of course):
rets = df.pct_change(30)
rets.iloc[25:35]
Now we'll plot the timeseries of the returns of the different stocks.
Notice that the NaN
values are gracefully dropped by the plotting function.
rets.FB.plot(label = 'facebook')
rets.GOOG.plot(label = 'google')
rets.TSLA.plot(label = 'tesla')
rets.MSFT.plot(label = 'microsoft')
rets.NFLX.plot(label = 'netflix')
_ = plt.legend(loc='best')
plt.scatter(rets.TSLA, rets.GOOG)
plt.xlabel('TESLA 30-day returns')
_ = plt.ylabel('GOOGLE 30-day returns')
We can also use the seaborn library for doing the scatterplot. Note that this method returns an object which we can use to set different parameters of the plot. In the example below we use it to set the x and y labels of the plot. Read online for more options.
#Also using seaborn
fig = sns.scatterplot(dfb.profit, dgoog.profit)
fig.set_xlabel('FB profit')
fig.set_ylabel('GOOG profit')
Get all pairwise correlations in a single plot
sns.pairplot(rets.iloc[30:])
There appears to be some (fairly strong) correlation between the movement of TSLA and YELP stocks. Let's measure this.
The correlation coefficient between variables $X$ and $Y$ is defined as follows:
$$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$Pandas provides a DataFrame method to compute the correlation coefficient of all pairs of columns: corr()
.
rets.corr()
rets.corr(method='spearman')
It takes a bit of time to examine that table and draw conclusions.
To speed that process up it helps to visualize the table.
We will learn more about visualization later, but for now this is a simple example.
_ = sns.heatmap(rets.corr(), annot=True)
Use the scipy.stats library to obtain the p-values for the pearson and spearman rank correlations
print(stats.pearsonr(rets.iloc[30:].NFLX, rets.iloc[30:].TSLA))
print(stats.spearmanr(rets.iloc[30:].NFLX, rets.iloc[30:].TSLA))
print(stats.pearsonr(rets.iloc[30:].GOOG, rets.iloc[30:].FB))
print(stats.spearmanr(rets.iloc[30:].GOOG, rets.iloc[30:].FB))
print(stats.pearsonr(dfb.profit, dgoog.profit))
print(stats.spearmanr(dfb.profit, dgoog.profit))
Finally, it is important to know that the plotting performed by Pandas is just a layer on top of matplotlib
(i.e., the plt
package).
So Panda's plots can (and should) be replaced or improved by using additional functions from matplotlib
.
For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).
Here is visualization of the result of such an analysis, and we construct the plot using only functions from matplotlib
.
_ = plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
plt.annotate(
label,
xy = (x, y), xytext = (20, -20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
To understand what these functions are doing, (especially the annotate
function), you will need to consult the online documentation for matplotlib. Just use Google to find it.
df = pd.read_csv('distributions_short.csv',
names=list('ABCD'))
dfs = df.sort_values(by='A', ascending = True) #Sorting in data frames
Plot column B against A
The plt.figure() command creates a new figure for each plot
plt.figure(); dfs.plot(x = 'A', y = 'B');
Plot both columns B and C against A.
Clearly they are different functions
plt.figure(); dfs.plot(x = 'A', y = ['B','C']);
Plot column B against A in log-log scale.
We observe a line. So B is a polynomial function of A
plt.figure(); dfs.plot(x = 'A', y = 'B', loglog=True);
Plot both columns B and C against A in log scale
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], loglog=True);
Plot B and C against A, with log scale only on y-axis.
The plot of C becomes a line, indicating that C is an exponential function of A
plt.figure(); dfs.plot(x = 'A', y = ['B','C'], logy=True);
Plotting using matlab notation
Also how to put two figures in a 1x2 grid
plt.figure(figsize = (15,5)) #defines the size of figure
plt.subplot(121) #plot with 1 row, 2 columns, 1st plot
plt.plot(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
plt.subplot(122) #plot with 1 row, 2 columns, 2nd plot
plt.loglog(dfs['A'],dfs['B'],'bo-',dfs['A'],dfs['C'],'g*-')
Using seaborn
sns.lineplot(x= 'A', y='B',data = dfs,marker='o')
The same plots as scatter plots using the dataframe functions
fig, ax = plt.subplots(1, 2, figsize=(15,5))
df.plot(kind ='scatter', x='A', y='B', ax = ax[0])
df.plot(kind ='scatter', x='A', y='B', loglog = True,ax = ax[1])
plt.scatter(df.A, df.B)
Putting many scatter plots into the same plot
t = df.plot(kind='scatter', x='A', y='B', color='DarkBlue', label='B curve', loglog=True);
df.plot(kind='scatter', x='A', y='C',color='DarkGreen', label='C curve', ax=t, loglog = True);
Using seaborn
sns.scatterplot(x='A',y='B', data = df)
In log-log scale (for some reason it seems to throw away small values)
splot = sns.scatterplot(x='A',y='B', data = df)
splot.set(xscale="log", yscale="log")