Module 8 Assignment

A few things you should keep in mind when working on assignments:

  1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

  2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

  3. Do not change the title (i.e. file name) of this notebook.

  4. Make sure that you save your work (in the menubar, select FileSave and CheckPoint)


In [ ]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

from sklearn.neighbors import KernelDensity

from nose.tools import assert_equal, assert_is_instance, assert_is_not, assert_almost_equal

The problems will use data from the down jones index.

In [ ]:
#Load the data and see what it looks like
df = pd.read_csv('./dow_jones_index.data')
df.head()

Problem 1: Making a histogram

Write a function called histogram_plotter that takes in a data frame, a column name from that data frame, and a number of bins and then plots a histogram of the data in that column.

Furthremore:

  1. Set the y axis label to "Counts"

  2. Set the x axis label to the name of the column being plotted

In [ ]:
def histogram_plotter(df, column, num_bins):
    """
    Input
    ------
    df: a pandas dataframe that contains the column we want to plot
    
    column: a string that is the name of the column to be plotted
    
    num_bins: an integer, the number of bins to use
    
    Output
    ------
    
    ax: a matplotlib.axes._subplots.AxesSubplot object
    """
        
    ### YOUR CODE HERE
    
    return ax
In [ ]:
my_plot = histogram_plotter(df, 'open', 20)
In [ ]:
assert_equal(my_plot.get_xlabel(), 'open')
assert_is_instance(my_plot,mpl.axes. Axes)
assert_almost_equal(my_plot.get_ylim()[1], 100.8)
assert_equal(len(my_plot.get_xticks()), 11)
assert_equal(my_plot.get_ylabel(), 'Counts')

Problem 2: Kernel Density Estimation

Write a function called kde_plotter that takes in a data frame, a column name from that data frame, and a number of bins and then plots a histogram along with a kernel density estimate of the data in that column, using seaborn.

Furthremore:

  1. Set the y axis label to "Density"

  2. Set the x axis label to the name of the column being plotted

In [ ]:
def kde_plotter(df, column, num_bins):
    """
    Input
    ------
    df: a pandas dataframe that contains the column we want to plot
    
    column: a string that is the name of the column to be plotted
    
    num_bins: an integer, the number of bins to use
    
    
    Output
    ------
    
    ax: a matplotlib.axes._subplots.AxesSubplot object
    """
    
    ### YOUR CODE HERE
    
    return ax
In [ ]:
my_kde = kde_plotter(df, 'open', 20)
In [ ]:
x, y = my_kde.get_lines()[0].get_data()
assert_almost_equal(0.00159, y[10], places=3)
assert_almost_equal(17.617908, x[20], places=3)
assert_equal(my_kde.get_xlabel(), 'open')
assert_is_instance(my_kde,mpl.axes.Axes)
assert_equal(my_kde.get_ylabel(), 'Density')

Problem 3: Create a 2D KDE

For this problem in the mv_kde function create a 2D KDE where the x axis will be percent_change_price and the y axis will be high. Both of this variables are in the dataframe that is passed into to mv_kde.

In [ ]:
def mv_kde(df):
    '''
    df: dataframe with data from dow jones index
    returns Jointgrid object
    '''
    
    ### YOUR CODE HERE
    
    return ax
In [ ]:
pcp_h = mv_kde(df)
In [ ]:
assert_is_instance(pcp_h, sns.axisgrid.JointGrid , msg='Return JointGridObject, you can do this by using the JoinGrid function in seaborn.')  
assert_equal(np.array_equal(pcp_h.x, df.percent_change_price.values), True, msg='Percent change price should used for the x-axis')
assert_equal(np.array_equal(pcp_h.y, df.high.values), True, msg='High should used for the y-axis')

Problem 4: Generating More Stock Data

We have taken a subset of the dow jones dataset and stored in a variable called X which is displayed below. Using the data in X we want to generate more stock data by fitting a KDE and sampling from it's distribution.

Your task is to complete the function gen_stock_data. This function takes in X (the data), n_samples (the number of samples to produce), and random_state (which is used to control to control the generator state used for random sampling.)

For this function:

  • Create a KernelDensity using sklearn's library (Use the default parameters of KernelDensity)
  • fit the KernelDensity on X
  • Sample from the KernelDensity using n_samples and the random_state
  • Lastly return the Sample
In [ ]:
X = df[['open', 'high', 'low', 'close']]
X.head()
In [ ]:
def gen_stock_data(X, n_samples=100, random_state=0):
    '''
    X - dataset containing subset of dowjones
    n_samples - integer which tells us how many samples to return
    random_state - controls generator state for random sampling
    '''
    
    ### YOUR CODE HERE

Visually let us compare generated data against sampled data

In [ ]:
sd1 = gen_stock_data(X, n_samples=1000, random_state=0)
fig, ((ax1_orig, ax2_orig, ax3_orig, ax4_orig),
      (ax1_samp ,ax2_samp, ax3_samp, ax4_samp)) = plt.subplots(2, 4, figsize=(10, 5))

ax1_orig.hist(X.open, alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax2_orig.hist(X.high, alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax3_orig.hist(X.low, alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax4_orig.hist(X.close, alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')

def column(matrix, i):
    return [row[i] for row in matrix]

ax1_samp.hist(column(sd1,0), alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax2_samp.hist(column(sd1,1), alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax3_samp.hist(column(sd1,2), alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')
ax4_samp.hist(column(sd1,3), alpha=0.5, color=sns.xkcd_rgb["denim blue"], normed=True, label='')

for i in [ax1_orig, ax2_orig, ax3_orig, ax4_orig, ax1_samp ,ax2_samp, ax3_samp, ax4_samp]:
    if i != ax1_orig or i != ax1_samp:
        i.set_yticks([])

ax1_orig.set_title('open', fontsize=14)
ax2_orig.set_title('high', fontsize=14)
ax3_orig.set_title('low', fontsize=14)
ax4_orig.set_title('close', fontsize=14)
ax1_orig.set_ylabel('Orignal Data', fontsize=14)
ax1_samp.set_ylabel('Sampled Data', fontsize=14)

Check Your Solution

In [ ]:
from helper import gsd

assert_is_instance(sd1, np.ndarray, msg='Your function does not return a numpy array.')
assert_equal(len(sd1), 1000, msg='Your function should use the n_samples parameter. The array should return 1000 rows it current returns {0}'.format(len(sd1)))
assert_equal(np.array_equal(sd1, gsd(X, n_samples=1000, random_state=0)), True, msg='The generated data does not match the solution')

© 2017: Robert J. Brunner at the University of Illinois.

This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.