AI For Trading:Project 7: Combine Signals for Enhanced Alpha (116)

Instructions

Each problem consists of a function to implement and instructions on how to implement the function. The parts of the function that need to be implemented are marked with a # TODO comment. After implementing the function, run the cell to test it against the unit tests we've provided. For each problem, we provide one or more unit tests from our project_tests package. These unit tests won't tell you if your answer is correct, but will warn you of any major errors. Your code will be checked for the correct solution when you submit it to Udacity.

Packages

When you implement the functions, you'll only need to you use the packages you've used in the classroom, like Pandas and Numpy. These packages will be imported for you. We recommend you don't add any import statements, otherwise the grader might not be able to run your code.

The other packages that we're importing are project_helper and project_tests. These are custom packages built to help you solve the problems. The project_helper module contains utility functions and graph functions. The project_tests contains the unit tests for all the problems.

Install Packages

import sys
!{sys.executable} -m pip install -r requirements.txt
...

Collecting alphalens==0.3.2 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/a5/dc/2f9cd107d0d4cf6223d37d81ddfbbdbf0d703d03669b83810fa6b97f32e5/alphalens-0.3.2.tar.gz (18.9MB)
[K    100% |████████████████████████████████| 18.9MB 1.7MB/s eta 0:00:01  7% 0 numpy-1.13.3 pandas-0.18.1 pandas-datareader-0.5.0 python-editor-1.0.4 requests-file-1.4.3 requests-ftp-0.3.1 scipy-1.0.0 sortedcontainers-2.1.0 tables-3.3.0 tqdm-4.19.5 zipline-1.2.0

Load Packages

import project_helper
import project_tests

import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14, 8)

Data Pipeline

Data Bundle

We'll be using Zipline to handle our data. We've created a end of day data bundle for this project. Run the cell below to register this data bundle in zipline.

import os
from zipline.data import bundles

os.environ['ZIPLINE_ROOT'] = os.path.join(os.getcwd(), '..', '..', 'data', 'project_7_eod')

ingest_func = bundles.csvdir.csvdir_equities(['daily'], project_helper.EOD_BUNDLE_NAME)
bundles.register(project_helper.EOD_BUNDLE_NAME, ingest_func)

print('Data Registered')
Data Registered

Build Pipeline Engine

We'll be using Zipline's pipeline package to access our data for this project. To use it, we must build a pipeline engine. Run the cell below to build the engine.

from zipline.pipeline import Pipeline
from zipline.pipeline.factors import AverageDollarVolume
from zipline.utils.calendars import get_calendar

universe = AverageDollarVolume(window_length=120).top(500) 
trading_calendar = get_calendar('NYSE') 
bundle_data = bundles.load(project_helper.EOD_BUNDLE_NAME)
engine = project_helper.build_pipeline_engine(bundle_data, trading_calendar)

View Data

With the pipeline engine built, let's get the stocks at the end of the period in the universe we're using.

universe_end_date = pd.Timestamp('2016-01-05', tz='UTC')

universe_tickers = engine\
    .run_pipeline(
        Pipeline(screen=universe),
        universe_end_date,
        universe_end_date)\
    .index.get_level_values(1)\
    .values.tolist()

universe_tickers[:5]
[Equity(0 [A]),
 Equity(1 [AAL]),
 Equity(2 [AAP]),
 Equity(3 [AAPL]),
 Equity(4 [ABBV])]

Get Returns

Not that we have our pipeline built, let's access the returns data. We'll start by building a data portal.

from zipline.data.data_portal import DataPortal

data_portal = DataPortal(
    bundle_data.asset_finder,
    trading_calendar=trading_calendar,
    first_trading_day=bundle_data.equity_daily_bar_reader.first_trading_day,
    equity_minute_reader=None,
    equity_daily_reader=bundle_data.equity_daily_bar_reader,
    adjustment_reader=bundle_data.adjustment_reader)

To make the code easier to read, we've built the helper function get_pricing to get the pricing from the data portal.

def get_pricing(data_portal, trading_calendar, assets, start_date, end_date, field='close'):
    end_dt = pd.Timestamp(end_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')
    start_dt = pd.Timestamp(start_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')

    end_loc = trading_calendar.closes.index.get_loc(end_dt)
    start_loc = trading_calendar.closes.index.get_loc(start_dt)

    return data_portal.get_history_window(
        assets=assets,
        end_dt=end_dt,
        bar_count=end_loc - start_loc,
        frequency='1d',
        field=field,
        data_frequency='daily')

Alpha Factors

It's time to start working on the alpha factors. In this project, we'll use the following factors:

  • Momentum 1 Year Factor
  • Mean Reversion 5 Day Sector Neutral Smoothed Factor
  • Overnight Sentiment Smoothed Factor
from zipline.pipeline.factors import CustomFactor, DailyReturns, Returns, SimpleMovingAverage, AnnualizedVolatility
from zipline.pipeline.data import USEquityPricing

factor_start_date = universe_end_date - pd.DateOffset(years=3, days=2)
sector = project_helper.Sector()

def momentum_1yr(window_length, universe, sector):
    return Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \
        .zscore()

def mean_reversion_5day_sector_neutral_smoothed(window_length, universe, sector):
    unsmoothed_factor = -Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \
        .zscore()
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=window_length) \
        .rank() \
        .zscore()

class CTO(Returns):
    """
    Computes the overnight return, per hypothesis from
    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2554010
    """
    inputs = [USEquityPricing.open, USEquityPricing.close]

    def compute(self, today, assets, out, opens, closes):
        """
        The opens and closes matrix is 2 rows x N assets, with the most recent at the bottom.
        As such, opens[-1] is the most recent open, and closes[0] is the earlier close
        """
        out[:] = (opens[-1] - closes[0]) / closes[0]

class TrailingOvernightReturns(Returns):
    """
    Sum of trailing 1m O/N returns
    """
    window_safe = True

    def compute(self, today, asset_ids, out, cto):
        out[:] = np.nansum(cto, axis=0)

def overnight_sentiment_smoothed(cto_window_length, trail_overnight_returns_window_length, universe):
    cto_out = CTO(mask=universe, window_length=cto_window_length)
    unsmoothed_factor = TrailingOvernightReturns(inputs=[cto_out], window_length=trail_overnight_returns_window_length) \
        .rank() \
        .zscore()
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=trail_overnight_returns_window_length) \
        .rank() \
        .zscore()

Combine the Factors to a single Pipeline

Let's add the factors to a pipeline.

universe = AverageDollarVolume(window_length=120).top(500)
sector = project_helper.Sector()

pipeline = Pipeline(screen=universe)
pipeline.add(
    momentum_1yr(252, universe, sector),
    'Momentum_1YR')
pipeline.add(
    mean_reversion_5day_sector_neutral_smoothed(20, universe, sector),
    'Mean_Reversion_Sector_Neutral_Smoothed')
pipeline.add(
    overnight_sentiment_smoothed(2, 10, universe),
    'Overnight_Sentiment_Smoothed')

Features and Labels

Let's create some features that we think will help the model make predictions.

"Universal" Quant Features

To capture the universe, we'll use the following as features:

  • Stock Volatility 20d, 120d
  • Stock Dollar Volume 20d, 120d
  • Sector
pipeline.add(AnnualizedVolatility(window_length=20, mask=universe).rank().zscore(), 'volatility_20d')
pipeline.add(AnnualizedVolatility(window_length=120, mask=universe).rank().zscore(), 'volatility_120d')
pipeline.add(AverageDollarVolume(window_length=20, mask=universe).rank().zscore(), 'adv_20d')
pipeline.add(AverageDollarVolume(window_length=120, mask=universe).rank().zscore(), 'adv_120d')
pipeline.add(sector, 'sector_code')

Regime Features

We are going to try to capture market-wide regimes. To do that, we'll use the following features:

  • High and low volatility 20d, 120d
  • High and low dispersion 20d, 120d
class MarketDispersion(CustomFactor):
    inputs = [DailyReturns()]
    window_length = 1
    window_safe = True

    def compute(self, today, assets, out, returns):
        # returns are days in rows, assets across columns
        out[:] = np.sqrt(np.nanmean((returns - np.nanmean(returns))**2))

pipeline.add(SimpleMovingAverage(inputs=[MarketDispersion(mask=universe)], window_length=20), 'dispersion_20d')
pipeline.add(SimpleMovingAverage(inputs=[MarketDispersion(mask=universe)], window_length=120), 'dispersion_120d')
class MarketVolatility(CustomFactor):
    inputs = [DailyReturns()]
    window_length = 1
    window_safe = True

    def compute(self, today, assets, out, returns):
        mkt_returns = np.nanmean(returns, axis=1)
        out[:] = np.sqrt(260.* np.nanmean((mkt_returns-np.nanmean(mkt_returns))**2))

pipeline.add(MarketVolatility(window_length=20), 'market_vol_20d')
pipeline.add(MarketVolatility(window_length=120), 'market_vol_120d')

Target

Let's try to predict the go forward 1-week return. When doing this, it's important to quantize the target. The factor we create is the trailing 5-day return.

pipeline.add(Returns(window_length=5, mask=universe).quantiles(2), 'return_5d')
pipeline.add(Returns(window_length=5, mask=universe).quantiles(25), 'return_5d_p')

Date Features

Let's make columns for the trees to split on that might capture trader/investor behavior due to calendar anomalies.

all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)

all_factors['is_Janaury'] = all_factors.index.get_level_values(0).month == 1
all_factors['is_December'] = all_factors.index.get_level_values(0).month == 12
all_factors['weekday'] = all_factors.index.get_level_values(0).weekday
all_factors['quarter'] = all_factors.index.get_level_values(0).quarter
all_factors['qtr_yr'] = all_factors.quarter.astype('str') + '_' + all_factors.index.get_level_values(0).year.astype('str')
all_factors['month_end'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM'))
all_factors['month_start'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BMS'))
all_factors['qtr_end'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQ'))
all_factors['qtr_start'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQS'))

all_factors.head()
Mean_Reversion_Sector_Neutral_Smoothed Momentum_1YR Overnight_Sentiment_Smoothed adv_120d adv_20d dispersion_120d dispersion_20d market_vol_120d market_vol_20d return_5d ... volatility_20d is_Janaury is_December weekday quarter qtr_yr month_end month_start qtr_end qtr_start
2013-01-03 00:00:00+00:00 Equity(0 [A]) -0.26276899 -1.20797813 -1.48566901 1.33857307 1.39741144 0.01326964 0.01117804 0.12966421 0.13758558 0 ... -1.21980876 True False 3 1 1_2013 False False False False

5 rows × 23 columns

One Hot Encode Sectors

For the model to better understand the sector data, we'll one hot encode this data.

sector_lookup = pd.read_csv(
    os.path.join(os.getcwd(), '..', '..', 'data', 'project_7_sector', 'labels.csv'),
    index_col='Sector_i')['Sector'].to_dict()
sector_lookup

sector_columns = []
for sector_i, sector_name in sector_lookup.items():
    secotr_column = 'sector_{}'.format(sector_name)
    sector_columns.append(secotr_column)
    all_factors[secotr_column] = (all_factors['sector_code'] == sector_i)

all_factors[sector_columns].head()
sector_Healthcare sector_Technology sector_Consumer Defensive sector_Industrials sector_Utilities sector_Financial Services sector_Real Estate sector_Communication Services sector_Consumer Cyclical sector_Energy sector_Basic Materials
2013-01-03 00:00:00+00:00 Equity(0 [A]) True False False False False False False False False False False
Equity(1 [AAL]) False False False True False False False False False False False

Shift Target

We'll use shifted 5 day returns for training the model.

all_factors['target'] = all_factors.groupby(level=1)['return_5d'].shift(-5)

all_factors[['return_5d','target']].reset_index().sort_values(['level_1', 'level_0']).head(10)
level_0 level_1 return_5d target
0 2013-01-03 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
471 2013-01-04 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
942 2013-01-07 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
1413 2013-01-08 00:00:00+00:00 Equity(0 [A]) 0 1.00000000
1884 2013-01-09 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
2355 2013-01-10 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
2826 2013-01-11 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
3297 2013-01-14 00:00:00+00:00 Equity(0 [A]) 0 0.00000000
3768 2013-01-15 00:00:00+00:00 Equity(0 [A]) 1 0.00000000
4239 2013-01-16 00:00:00+00:00 Equity(0 [A]) 0 0.00000000

IID Check of Target

Let's see if the returns are independent and identically distributed.

from scipy.stats import spearmanr

def sp(group, col1_name, col2_name):
    x = group[col1_name]
    y = group[col2_name]
    return spearmanr(x, y)[0]

all_factors['target_p'] = all_factors.groupby(level=1)['return_5d_p'].shift(-5)
all_factors['target_1'] = all_factors.groupby(level=1)['return_5d'].shift(-4)
all_factors['target_2'] = all_factors.groupby(level=1)['return_5d'].shift(-3)
all_factors['target_3'] = all_factors.groupby(level=1)['return_5d'].shift(-2)
all_factors['target_4'] = all_factors.groupby(level=1)['return_5d'].shift(-1)

g = all_factors.dropna().groupby(level=0)
for i in range(4):
    label = 'target_'+str(i+1)
    ic = g.apply(sp, 'target', label)
    ic.plot(ylim=(-1, 1), label=label)
plt.legend(bbox_to_anchor=(1.04, 1), borderaxespad=0)
plt.title('Rolling Autocorrelation of Labels Shifted 1,2,3,4 Days')
plt.show()

file

Question: What do you observe in the rolling autocorrelation of labels shifted?

#TODO: Put Answer In this Cell

Train/Valid/Test Splits

Now let's split the data into a train, validation, and test dataset. Implement the function train_valid_test_split to split the input samples, all_x, and targets values, all_y into a train, validation, and test dataset. The proportion sizes are train_size, valid_size, test_size respectively.

When splitting, make sure the data is in order from train, validation, and test respectivly. Say train_size is 0.7, valid_size is 0.2, and test_size is 0.1. The first 70 percent of all_x and all_y would be the train set. The next 20 percent of all_x and all_y would be the validation set. The last 10 percent of all_x and all_y would be the test set. Make sure not split a day between multiple datasets. It should be contained within a single dataset.

def train_valid_test_split(all_x, all_y, train_size, valid_size, test_size):
    """
    Generate the train, validation, and test dataset.

    Parameters
    ----------
    all_x : DataFrame
        All the input samples
    all_y : Pandas Series
        All the target values
    train_size : float
        The proportion of the data used for the training dataset
    valid_size : float
        The proportion of the data used for the validation dataset
    test_size : float
        The proportion of the data used for the test dataset

    Returns
    -------
    x_train : DataFrame
        The train input samples
    x_valid : DataFrame
        The validation input samples
    x_test : DataFrame
        The test input samples
    y_train : Pandas Series
        The train target values
    y_valid : Pandas Series
        The validation target values
    y_test : Pandas Series
        The test target values
    """
    assert train_size >= 0 and train_size <= 1.0
    assert valid_size >= 0 and valid_size <= 1.0
    assert test_size >= 0 and test_size <= 1.0
    assert train_size + valid_size + test_size == 1.0

    # TODO: Implement
    dates = all_x.index.levels[0]
    # print
    # print("all_x:\n", all_x)
    # print("dates:\n", dates)

    train_split = round(train_size*len(dates))
    valid_split = round(valid_size*len(dates)) + train_split
    test_split = round(test_size*len(dates)) + valid_split

    x_train = all_x.loc[:dates[train_split-1]]
    # print("x_train:\n", x_train)

    x_valid = all_x.loc[dates[train_split]:dates[valid_split-1]]
    x_test = all_x.loc[dates[valid_split]:]

    # 打印 all_y
    # print("all_y:\n", all_y)
    y_train = all_y.loc[:dates[train_split-1]]
    y_valid = all_y.loc[dates[train_split]:dates[valid_split-1]]
    y_test = all_y.loc[dates[valid_split]:]

    return x_train,x_valid,x_test,y_train,y_valid,y_test

project_tests.test_train_valid_test_split(train_valid_test_split)
Tests Passed

With train_valid_test_split implemented, let's split the data into a train, validation, and test set. For this, we'll use some of the features and the 5 day returns for our target.

features = [
    'Mean_Reversion_Sector_Neutral_Smoothed', 'Momentum_1YR',
    'Overnight_Sentiment_Smoothed', 'adv_120d', 'adv_20d',
    'dispersion_120d', 'dispersion_20d', 'market_vol_120d',
    'market_vol_20d', 'volatility_20d',
    'is_Janaury', 'is_December', 'weekday',
    'month_end', 'month_start', 'qtr_end', 'qtr_start'] + sector_columns
target_label = 'target'

temp = all_factors.dropna().copy()
X = temp[features]
y = temp[target_label]

X_train, X_valid, X_test, y_train, y_valid, y_test = train_valid_test_split(X, y, 0.6, 0.2, 0.2)

X_train.head()
Mean_Reversion_Sector_Neutral_Smoothed Momentum_1YR Overnight_Sentiment_Smoothed adv_120d adv_20d dispersion_120d dispersion_20d market_vol_120d market_vol_20d volatility_20d ... sector_Technology sector_Consumer Defensive sector_Industrials sector_Utilities sector_Financial Services sector_Real Estate sector_Communication Services sector_Consumer Cyclical sector_Energy sector_Basic Materials
2013-01-03 00:00:00+00:00 Equity(0 [A]) -0.26276899 -1.20797813 -1.48566901 1.33857307 1.39741144 0.01326964 0.01117804 0.12966421 0.13758558 -1.21980876 ... False False False False False False False False False False

5 rows × 28 columns

Random Forests

Visualize a Simple Tree

Let's see how a single tree would look using our data.

from IPython.display import display
from sklearn.tree import DecisionTreeClassifier

# This is to get consistent results between each run.
clf_random_state = 0

simple_clf = DecisionTreeClassifier(
    max_depth=3,
    criterion='entropy',
    random_state=clf_random_state)
simple_clf.fit(X_train, y_train)

display(project_helper.plot_tree_classifier(simple_clf, feature_names=features))
project_helper.rank_features_by_importance(simple_clf.feature_importances_, features)

file

      Feature                                    Importance
 1. dispersion_20d                         (0.46843340608552875)
 2. market_vol_120d                        (0.19358730055081794)
 3. sector_Real Estate                     (0.12813661543951108)
 4. Momentum_1YR                           (0.11211962120010666)
 5. sector_Healthcare                      (0.09772305672403549)
 6. sector_Basic Materials                 (0.0)
 7. weekday                                (0.0)
 8. Overnight_Sentiment_Smoothed           (0.0)
 9. adv_120d                               (0.0)
10. adv_20d                                (0.0)
11. dispersion_120d                        (0.0)
12. market_vol_20d                         (0.0)
13. volatility_20d                         (0.0)
14. is_Janaury                             (0.0)
15. is_December                            (0.0)
16. month_end                              (0.0)
17. sector_Energy                          (0.0)
18. month_start                            (0.0)
19. qtr_end                                (0.0)
20. qtr_start                              (0.0)
21. sector_Technology                      (0.0)
22. sector_Consumer Defensive              (0.0)
23. sector_Industrials                     (0.0)
24. sector_Utilities                       (0.0)
25. sector_Financial Services              (0.0)
26. sector_Communication Services          (0.0)
27. sector_Consumer Cyclical               (0.0)
28. Mean_Reversion_Sector_Neutral_Smoothed (0.0)

Question: Why does dispersion_20d have the highest feature importance, when the first split is on the Momentum_1YR feature?

#TODO: Put Answer In this Cell

Train Random Forests with Different Tree Sizes

Let's build models using different tree sizes to find the model that best generalizes.

Parameters

When building the models, we'll use the following parameters.

n_days = 10
n_stocks = 500

clf_parameters = {
    'criterion': 'entropy',
    'min_samples_leaf': n_stocks * n_days,
    'oob_score': True,
    'n_jobs': -1,
    'random_state': clf_random_state}
n_trees_l = [50, 100, 250, 500, 1000]

Recall from the lesson, that we’ll choose a min_samples_leaf parameter to be small enough to allow the tree to fit the data with as much detail as possible, but not so much that it overfits. We can first propose 500, which is the number of assets in the estimation universe. Since we have about 500 stocks in the stock universe, we’ll want at least 500 stocks in a leaf for the leaf to make a prediction that is representative. It’s common to multiply this by 2,3,5 or 10, so we’d have min samples leaf of 500, 1000, 1500, 2500, and 5000. If we were to try these values, we’d notice that the model is “too good to be true” on the training data. A good rule of thumb for what is considered “too good to be true”, and therefore a sign of overfitting, is if the sharpe ratio is greater than 4. Based on this, we recommend using min_sampes_leaf of 10 * 500, or 5,000.

Feel free to try other values for these parameters, but also keep in mind that making too many small adjustments to hyper-parameters can lead to overfitting even the validation data, and therefore lead to less generalizable performance on the out-of-sample test set. So when trying different parameter values, choose values that are different enough in scale (i.e. 10, 20, 100 instead of 10,11,12).

from sklearn.ensemble import RandomForestClassifier

train_score = []
valid_score = []
oob_score = []
feature_importances = []

for n_trees in tqdm(n_trees_l, desc='Training Models', unit='Model'):
    clf = RandomForestClassifier(n_trees, **clf_parameters)
    clf.fit(X_train, y_train)

    train_score.append(clf.score(X_train, y_train.values))
    valid_score.append(clf.score(X_valid, y_valid.values))
    oob_score.append(clf.oob_score_)
    feature_importances.append(clf.feature_importances_)
Training Models: 100%|██████████| 5/5 [10:32<00:00, 126.52s/Model]

Let's look at the accuracy of the classifiers over the number of trees.

project_helper.plot(
    [n_trees_l]*3,
    [train_score, valid_score, oob_score],
    ['train', 'validation', 'oob'],
    'Random Forrest Accuracy',
    'Number of Trees')

file

Question: 1) What do you observe with the accuracy vs tree size graph? 2) Does the graph indicate the model is overfitting or underfitting? Describe how it indicates this.

#TODO: Put Answer In this Cell

Now let's looks at the average feature importance of the classifiers.

print('Features Ranked by Average Importance:\n')
project_helper.rank_features_by_importance(np.average(feature_importances, axis=0), features)
Features Ranked by Average Importance:

      Feature                                    Importance
 1. dispersion_20d                         (0.1269778070278301)
 2. volatility_20d                         (0.12109751092486871)
 3. market_vol_120d                        (0.10610349775271208)
 4. market_vol_20d                         (0.1031284505504851)
 5. Momentum_1YR                           (0.09619459615815973)
 6. dispersion_120d                        (0.07964771723886842)
 7. Overnight_Sentiment_Smoothed           (0.07791967844615746)
 8. Mean_Reversion_Sector_Neutral_Smoothed (0.06975870911774727)
 9. adv_120d                               (0.05918153910377677)
10. adv_20d                                (0.05434312068691877)
11. sector_Healthcare                      (0.03158933275282147)
12. sector_Basic Materials                 (0.012648395225842244)
13. sector_Consumer Defensive              (0.011217900024550611)
14. sector_Industrials                     (0.010608580029761724)
15. sector_Financial Services              (0.009124102603531675)
16. weekday                                (0.008277277899941867)
17. sector_Real Estate                     (0.0058334489315844)
18. sector_Utilities                       (0.005173200127194804)
19. sector_Technology                      (0.004050360629711794)
20. sector_Consumer Cyclical               (0.0038784870650338853)
21. sector_Energy                          (0.0026093313325679574)
22. is_Janaury                             (0.0004917284046585393)
23. is_December                            (0.00011882364724656508)
24. month_end                              (2.6404318028065363e-05)
25. qtr_end                                (0.0)
26. month_start                            (0.0)
27. qtr_start                              (0.0)
28. sector_Communication Services          (0.0)

You might notice that some of the features of low to no importance. We will be removing them when training the final model.

Model Results

Let's look at some additional metrics to see how well a model performs. We've created the function show_sample_results to show the following results of a model:

  • Sharpe Ratios
  • Factor Returns
  • Factor Rank Autocorrelation
import alphalens as al

all_assets = all_factors.index.levels[1].values.tolist()
all_pricing = get_pricing(
    data_portal,
    trading_calendar,
    all_assets,
    factor_start_date,
    universe_end_date)

def show_sample_results(data, samples, classifier, factors, pricing=all_pricing):
    # Calculate the Alpha Score
    prob_array=[-1,1]
    alpha_score = classifier.predict_proba(samples).dot(np.array(prob_array))

    # Add Alpha Score to rest of the factors
    alpha_score_label = 'AI_ALPHA'
    factors_with_alpha = data.loc[samples.index].copy()
    factors_with_alpha[alpha_score_label] = alpha_score

    # Setup data for AlphaLens
    print('Cleaning Data...\n')
    factor_data = project_helper.build_factor_data(factors_with_alpha[factors + [alpha_score_label]], pricing)
    print('\n-----------------------\n')

    # Calculate Factor Returns and Sharpe Ratio
    factor_returns = project_helper.get_factor_returns(factor_data)
    sharpe_ratio = project_helper.sharpe_ratio(factor_returns)

    # Show Results
    print('             Sharpe Ratios')
    print(sharpe_ratio.round(2))
    project_helper.plot_factor_returns(factor_returns)
    project_helper.plot_factor_rank_autocorrelation(factor_data)

Results

Let's compare our AI Alpha factor to a few other factors. We'll use the following:

factor_names = [
    'Mean_Reversion_Sector_Neutral_Smoothed',
    'Momentum_1YR',
    'Overnight_Sentiment_Smoothed',
    'adv_120d',
    'volatility_20d']
Training Prediction

Let's see how well the model runs on training data.

show_sample_results(all_factors, X_train, clf, factor_names)
Cleaning Data...

Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed   0.91000000
Momentum_1YR                             0.19000000
Overnight_Sentiment_Smoothed             0.80000000
adv_120d                                 0.40000000
volatility_20d                           1.06000000
AI_ALPHA                                 5.75000000
dtype: float64

file

file

Validation Prediction

Let's see how well the model runs on validation data.

show_sample_results(all_factors, X_valid, clf, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
 ...
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed    0.29000000
Momentum_1YR                              0.76000000
Overnight_Sentiment_Smoothed             -1.33000000
adv_120d                                 -1.34000000
volatility_20d                            0.09000000
AI_ALPHA                                  1.95000000
dtype: float64

file

file

So that's pretty extraordinary. Even when the input factor returns are sideways to down, the AI Alpha is positive with Sharpe Ratio > 2. If we hope that this model will perform well in production we need to correct though for the non-IID labels and mitigate likely overfitting.

Overlapping Samples

Let's fix this by removing overlapping samples. We can do a number of things:

  • Don't use overlapping samples
  • Use BaggingClassifier's max_samples
  • Build an ensemble of non-overlapping trees

In this project, we'll do all three methods and compare.

Drop Overlapping Samples

This is the simplest of the three methods. We'll just drop any overlapping samples from the dataset. Implement the non_overlapping_samples function to return a new dataset overlapping samples.

def non_overlapping_samples(x, y, n_skip_samples, start_i=0):
    """
    Get the non overlapping samples.

    Parameters
    ----------
    x : DataFrame
        The input samples
    y : Pandas Series
        The target values
    n_skip_samples : int
        The number of samples to skip
    start_i : int
        The starting index to use for the data

    Returns
    -------
    non_overlapping_x : 2 dimensional Ndarray
        The non overlapping input samples
    non_overlapping_y : 1 dimensional Ndarray
        The non overlapping target values
    """
    assert len(x.shape) == 2
    assert len(y.shape) == 1

    # TODO: Implement
    non_overlapping_x = x.loc[x.index.levels[0][start_i::n_skip_samples+1].tolist()]
    non_overlapping_y = y.loc[y.index.levels[0][start_i::n_skip_samples+1].tolist()]
    return non_overlapping_x,non_overlapping_y

project_tests.test_non_overlapping_samples(non_overlapping_samples)
Tests Passed

With the dataset created without overlapping samples, lets train a new model and look at the results.

Train Model

train_score = []
valid_score = []
oob_score = []

for n_trees in tqdm(n_trees_l, desc='Training Models', unit='Model'):
    clf = RandomForestClassifier(n_trees, **clf_parameters)
    clf.fit(*non_overlapping_samples(X_train, y_train, 4))

    train_score.append(clf.score(X_train, y_train.values))
    valid_score.append(clf.score(X_valid, y_valid.values))
    oob_score.append(clf.oob_score_)
Training Models: 100%|██████████| 5/5 [01:11<00:00, 14.22s/Model]

Results

project_helper.plot(
    [n_trees_l]*3,
    [train_score, valid_score, oob_score],
    ['train', 'validation', 'oob'],
    'Random Forrest Accuracy',
    'Number of Trees')

file

show_sample_results(all_factors, X_valid, clf, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
 ...
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed    0.29000000
Momentum_1YR                              0.76000000
Overnight_Sentiment_Smoothed             -1.33000000
adv_120d                                 -1.34000000
volatility_20d                            0.09000000
AI_ALPHA                                 -0.37000000
dtype: float64

file

file

This looks better, but we are throwing away a lot of information by taking every 5th row.

Use BaggingClassifier's max_samples

In this method, we'll set max_samples to be on the order of the average uniqueness of the labels. Since RandomForrestClassifier does not take this param, we're using BaggingClassifier. Implement bagging_classifier to build the bagging classifier.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

def bagging_classifier(n_estimators, max_samples, max_features, parameters):
    """
    Build the bagging classifier.

    Parameters
    ----------
    n_estimators : int 
        The number of base estimators in the ensemble
    max_samples : float 
        The proportion of input samples drawn from when training each base estimator
    max_features : float 
        The proportion of input sample features drawn from when training each base estimator
    parameters : dict
        Parameters to use in building the bagging classifier
        It should contain the following parameters:
            criterion
            min_samples_leaf
            oob_score
            n_jobs
            random_state

    Returns
    -------
    bagging_clf : Scikit-Learn BaggingClassifier
        The bagging classifier
    """

    required_parameters = {'criterion', 'min_samples_leaf', 'oob_score', 'n_jobs', 'random_state'}
    assert not required_parameters - set(parameters.keys())

    # TODO: Implement
    bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion=parameters['criterion'],min_samples_leaf= parameters['min_samples_leaf']),\
                                    n_estimators=n_estimators, \
                                    max_samples=max_samples, \
                                    max_features=max_features,\
                                    oob_score = parameters['oob_score'],\
                                    n_jobs = parameters['n_jobs'],\
                                    random_state = parameters['random_state']
                                   )

    return bagging_clf

project_tests.test_bagging_classifier(bagging_classifier)
Tests Passed

With the bagging classifier built, lets train a new model and look at the results.

Train Model

train_score = []
valid_score = []
oob_score = []

for n_trees in tqdm(n_trees_l, desc='Training Models', unit='Model'):
    clf = bagging_classifier(n_trees, 0.2, 1.0, clf_parameters)
    clf.fit(X_train, y_train)

    train_score.append(clf.score(X_train, y_train.values))
    valid_score.append(clf.score(X_valid, y_valid.values))
    oob_score.append(clf.oob_score_)
Training Models: 100%|██████████| 5/5 [18:01<00:00, 216.30s/Model]

Results

project_helper.plot(
    [n_trees_l]*3,
    [train_score, valid_score, oob_score],
    ['train', 'validation', 'oob'],
    'Random Forrest Accuracy',
    'Number of Trees')

file

show_sample_results(all_factors, X_valid, clf, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
 ...
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed    0.29000000
Momentum_1YR                              0.76000000
Overnight_Sentiment_Smoothed             -1.33000000
adv_120d                                 -1.34000000
volatility_20d                            0.09000000
AI_ALPHA                                  0.78000000
dtype: float64

file

file

This seems much "better" in the sense that we have much better fidelity between the three.

Build an ensemble of non-overlapping trees

The last method is to create ensemble of non-overlapping trees. Here we are going to write a custom scikit-learn estimator. We inherit from VotingClassifier and we override the fit method so we fit on non-overlapping periods.

import abc

from sklearn.ensemble import VotingClassifier
from sklearn.base import clone
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import Bunch

class NoOverlapVoterAbstract(VotingClassifier):
    @abc.abstractmethod
    def _calculate_oob_score(self, classifiers):
        raise NotImplementedError

    @abc.abstractmethod
    def _non_overlapping_estimators(self, x, y, classifiers, n_skip_samples):
        raise NotImplementedError

    def __init__(self, estimator, voting='soft', n_skip_samples=4):
        # List of estimators for all the subsets of data
        estimators = [('clf'+str(i), estimator) for i in range(n_skip_samples + 1)]

        self.n_skip_samples = n_skip_samples
        super().__init__(estimators, voting)

    def fit(self, X, y, sample_weight=None):
        estimator_names, clfs = zip(*self.estimators)
        self.le_ = LabelEncoder().fit(y)
        self.classes_ = self.le_.classes_

        clone_clfs = [clone(clf) for clf in clfs]
        self.estimators_ = self._non_overlapping_estimators(X, y, clone_clfs, self.n_skip_samples)
        self.named_estimators_ = Bunch(**dict(zip(estimator_names, self.estimators_)))
        self.oob_score_ = self._calculate_oob_score(self.estimators_)

        return self

You might notice that two of the functions are abstracted. These will be the functions that you need to implement.

OOB Score

In order to get the correct OOB score, we need to take the average of all the estimator's OOB scores. Implement calculate_oob_score to calculate this score.

def calculate_oob_score(classifiers):
    """
    Calculate the mean out-of-bag score from the classifiers.

    Parameters
    ----------
    classifiers : list of Scikit-Learn Classifiers
        The classifiers used to calculate the mean out-of-bag score

    Returns
    -------
    oob_score : float
        The mean out-of-bag score
    """

    # TODO: Implement

    return np.mean([o.oob_score_ for o in classifiers])

project_tests.test_calculate_oob_score(calculate_oob_score)
Tests Passed

Non Overlapping Estimators

With calculate_oob_score implemented, let's create non overlapping estimators. Implement non_overlapping_estimators to build non overlapping subsets of the data, then run a estimator on each subset of data.

def non_overlapping_estimators(x, y, classifiers, n_skip_samples):
    """
    Fit the classifiers to non overlapping data.

    Parameters
    ----------
    x : DataFrame
        The input samples
    y : Pandas Series
        The target values
    classifiers : list of Scikit-Learn Classifiers
        The classifiers used to fit on the non overlapping data
    n_skip_samples : int
        The number of samples to skip

    Returns
    -------
    fit_classifiers : list of Scikit-Learn Classifiers
        The classifiers fit to the the non overlapping data
    """

    # TODO: Implement
    fit_classifiers=[]
    for i in range(len(classifiers)):
            non_overlapping_x ,non_overlapping_y = non_overlapping_samples(x, y, n_skip_samples, start_i=i)
            fit_classifiers.append( classifiers[i].fit(non_overlapping_x,non_overlapping_y))

    return fit_classifiers

project_tests.test_non_overlapping_estimators(non_overlapping_estimators)
Tests Passed
class NoOverlapVoter(NoOverlapVoterAbstract):
    def _calculate_oob_score(self, classifiers):
        return calculate_oob_score(classifiers)

    def _non_overlapping_estimators(self, x, y, classifiers, n_skip_samples):
        return non_overlapping_estimators(x, y, classifiers, n_skip_samples)

Now that we have our NoOverlapVoter class, let's train it.

Train Model

train_score = []
valid_score = []
oob_score = []

for n_trees in tqdm(n_trees_l, desc='Training Models', unit='Model'):
    clf = RandomForestClassifier(n_trees, **clf_parameters)

    clf_nov = NoOverlapVoter(clf)
    clf_nov.fit(X_train, y_train)

    train_score.append(clf_nov.score(X_train, y_train.values))
    valid_score.append(clf_nov.score(X_valid, y_valid.values))
    oob_score.append(clf_nov.oob_score_)
Training Models: 100%|██████████| 5/5 [05:56<00:00, 71.36s/Model]

Results

project_helper.plot(
    [n_trees_l]*3,
    [train_score, valid_score, oob_score],
    ['train', 'validation', 'oob'],
    'Random Forrest Accuracy',
    'Number of Trees')

file

show_sample_results(all_factors, X_valid, clf_nov, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
 ...
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed    0.29000000
Momentum_1YR                              0.76000000
Overnight_Sentiment_Smoothed             -1.33000000
adv_120d                                 -1.34000000
volatility_20d                            0.09000000
AI_ALPHA                                  0.29000000
dtype: float64

file

file

Final Model

Re-Training Model

In production, we would roll forward the training. Typically you would re-train up to the "current day" and then test. Here, we will train on the train & validation dataset.

n_trees = 500

clf = RandomForestClassifier(n_trees, **clf_parameters)
clf_nov = NoOverlapVoter(clf)
clf_nov.fit(
    pd.concat([X_train, X_valid]),
    pd.concat([y_train, y_valid]))
NoOverlapVoter(estimator=None, n_skip_samples=4, voting='soft')

Results

Accuracy

print('train: {}, oob: {}, valid: {}'.format(
    clf_nov.score(X_train, y_train.values),
    clf_nov.score(X_valid, y_valid.values),
    clf_nov.oob_score_))
train: 0.5139232658735564, oob: 0.5122386644165191, valid: 0.5068127879069977

Train

show_sample_results(all_factors, X_train, clf_nov, factor_names)
Cleaning Data...

Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
 ...
Dropped 0.2% entries from factor data: 0.2% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed   0.91000000
Momentum_1YR                             0.19000000
Overnight_Sentiment_Smoothed             0.80000000
adv_120d                                 0.40000000
volatility_20d                           1.06000000
AI_ALPHA                                 2.32000000
dtype: float64

file

file

Validation

show_sample_results(all_factors, X_valid, clf_nov, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
...
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

-----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed    0.29000000
Momentum_1YR                              0.76000000
Overnight_Sentiment_Smoothed             -1.33000000
adv_120d                                 -1.34000000
volatility_20d                            0.09000000
AI_ALPHA                                  2.89000000
dtype: float64

file

file

Test

show_sample_results(all_factors, X_test, clf_nov, factor_names)
Cleaning Data...

Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

     -----------------------

             Sharpe Ratios
Mean_Reversion_Sector_Neutral_Smoothed   -2.01000000
Momentum_1YR                              2.60000000
Overnight_Sentiment_Smoothed              0.33000000
adv_120d                                 -1.57000000
volatility_20d                           -1.68000000
AI_ALPHA                                  0.90000000
dtype: float64
...

file

file

So, hopefully you are appropriately amazed by this. Despite the significant differences between the factor performances in the three sets, the AI APLHA is able to deliver positive performance.

为者常成,行者常至