AI For Trading:Feature Engineering and Labeling (113)

Feature Engineering and Labeling

We'll use the price-volume data and generate features that we can feed into a model. We'll use this notebook for all the coding exercises of this lesson, so please open this notebook in a separate tab of your browser.

Please run the following code up to and including "Make Factors." Then continue on with the lesson.

import sys
!{sys.executable} -m pip install --quiet -r requirements.txt
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
import numpy as np
import pandas as pd
import time

import matplotlib.pyplot as plt
%matplotlib inline'ggplot')
plt.rcParams['figure.figsize'] = (14, 8)

Registering data

import os
import project_helper
from import bundles

os.environ['ZIPLINE_ROOT'] = os.path.join(os.getcwd(), '..', '..', 'data', 'module_4_quizzes_eod')

ingest_func = bundles.csvdir.csvdir_equities(['daily'], project_helper.EOD_BUNDLE_NAME)
bundles.register(project_helper.EOD_BUNDLE_NAME, ingest_func)

print('Data Registered')
Data Registered
from zipline.pipeline import Pipeline
from zipline.pipeline.factors import AverageDollarVolume
from zipline.utils.calendars import get_calendar

universe = AverageDollarVolume(window_length=120).top(500) 
trading_calendar = get_calendar('NYSE') 
bundle_data = bundles.load(project_helper.EOD_BUNDLE_NAME)
engine = project_helper.build_pipeline_engine(bundle_data, trading_calendar)
universe_end_date = pd.Timestamp('2016-01-05', tz='UTC')

universe_tickers = engine\
from import DataPortal

data_portal = DataPortal(

def get_pricing(data_portal, trading_calendar, assets, start_date, end_date, field='close'):
    end_dt = pd.Timestamp(end_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')
    start_dt = pd.Timestamp(start_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')

    end_loc = trading_calendar.closes.index.get_loc(end_dt)
    start_loc = trading_calendar.closes.index.get_loc(start_dt)

    return data_portal.get_history_window(
        bar_count=end_loc - start_loc,

Make Factors

  • We'll use the same factors we have been using in the lessons about alpha factor research. Factors can be features that we feed into the model.
from zipline.pipeline.factors import CustomFactor, DailyReturns, Returns, SimpleMovingAverage
from import USEquityPricing

factor_start_date = universe_end_date - pd.DateOffset(years=3, days=2)
sector = project_helper.Sector()

def momentum_1yr(window_length, universe, sector):
    return Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \

def mean_reversion_5day_sector_neutral(window_length, universe, sector):
    return -Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \

def mean_reversion_5day_sector_neutral_smoothed(window_length, universe, sector):
    unsmoothed_factor = mean_reversion_5day_sector_neutral(window_length, universe, sector)
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=window_length) \
        .rank() \

class CTO(Returns):
    Computes the overnight return, per hypothesis from
    inputs = [, USEquityPricing.close]

    def compute(self, today, assets, out, opens, closes):
        The opens and closes matrix is 2 rows x N assets, with the most recent at the bottom.
        As such, opens[-1] is the most recent open, and closes[0] is the earlier close
        out[:] = (opens[-1] - closes[0]) / closes[0]

class TrailingOvernightReturns(Returns):
    Sum of trailing 1m O/N returns
    window_safe = True

    def compute(self, today, asset_ids, out, cto):
        out[:] = np.nansum(cto, axis=0)

def overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe):
    cto_out = CTO(mask=universe, window_length=cto_window_length)
    return TrailingOvernightReturns(inputs=[cto_out], window_length=trail_overnight_returns_window_length) \
        .rank() \

def overnight_sentiment_smoothed(cto_window_length, trail_overnight_returns_window_length, universe):
    unsmoothed_factor = overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe)
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=trail_overnight_returns_window_length) \
        .rank() \

universe = AverageDollarVolume(window_length=120).top(500)
sector = project_helper.Sector()

pipeline = Pipeline(screen=universe)
    momentum_1yr(252, universe, sector),
    mean_reversion_5day_sector_neutral_smoothed(20, universe, sector),
    overnight_sentiment_smoothed(2, 10, universe),

all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)

Mean_Reversion_Sector_Neutral_Smoothed Momentum_1YR Overnight_Sentiment_Smoothed
2013-01-03 00:00:00+00:00 Equity(0 [A]) -0.262769 -1.207978 -1.485669
Equity(1 [AAL]) 0.099926 1.713471 0.919350
Equity(2 [AAP]) 1.669138 -1.535061 1.507733
Equity(3 [AAPL]) 1.698746 1.193111 -1.367992
Equity(4 [ABBV]) NaN NaN -0.250063

Stop here and continue with the lesson section titled "Features".

Universal Quant Features

  • stock volatility: zipline has a custom factor called AnnualizedVolatility. The source code is here and also pasted below:
class AnnualizedVolatility(CustomFactor):
    Volatility. The degree of variation of a series over time as measured by
    the standard deviation of daily returns.
    **Default Inputs:** :data:`zipline.pipeline.factors.Returns(window_length=2)`  # noqa
    annualization_factor : float, optional
        The number of time units per year. Defaults is 252, the number of NYSE
        trading days in a normal year.
    inputs = [Returns(window_length=2)]
    params = {'annualization_factor': 252.0}
    window_length = 252

    def compute(self, today, assets, out, returns, annualization_factor):
        out[:] = nanstd(returns, axis=0) * (annualization_factor ** .5)
from zipline.pipeline.factors import AnnualizedVolatility
AnnualizedVolatility((Returns((USEquityPricing.close::float64,), window_length=2),), window_length=252)


We can see that the returns window_length is 2, because we're dealing with daily returns, which are calculated as the percent change from one day to the following day (2 days). The AnnualizedVolatility window_length is 252 by default, because it's the one-year volatility. Try to adjust the call to the constructor of AnnualizedVolatility so that this represents one-month volatility (still annualized, but calculated over a time window of 20 trading days)


AnnualizedVolatility((Returns((USEquityPricing.close::float64,), window_length=2),), window_length=20)

Quiz: Create one-month and six-month annualized volatility.

Create AnnualizedVolatility objects for 20 day and 120 day (one month and six-month) time windows. Remember to set the mask parameter to the universe object created earlier (this filters the stocks to match the list in the universe). Convert these to ranks, and then convert the ranks to zscores.

volatility_20d = AnnualizedVolatility(window_length=20, mask=universe).rank().zscore()
volatility_120d = AnnualizedVolatility(window_length=120, mask=universe).rank().zscore()

Add to the pipeline

pipeline.add(volatility_20d, 'volatility_20d')
pipeline.add(volatility_120d, 'volatility_120d')

Quiz: Average Dollar Volume feature

We've been using AverageDollarVolume to choose the stock universe based on stocks that have the highest dollar volume. We can also use it as a feature that is input into a predictive model.
Use 20 day and 120 day window_length for average dollar volume. Then rank it and convert to a zscore.

"""already imported earlier, but shown here for reference"""
#from zipline.pipeline.factors import AverageDollarVolume 

# TODO: 20-day and 120 day average dollar volume
adv_20d = AverageDollarVolume(window_length=20, mask=universe).rank().zscore()

adv_120d = AverageDollarVolume(window_length=120, mask=universe).rank().zscore()
GroupedRowTransform((Rank(AverageDollarVolume((USEquityPricing.close::float64, USEquityPricing.volume::float64), window_length=20), method='ordinal', mask=AssetExists()), Everything((), window_length=0)), window_length=0)

Add average dollar volume features to pipeline

pipeline.add(adv_20d, 'adv_20d')
pipeline.add(adv_120d, 'adv_120d')

Market Regime Features

We are going to try to capture market-wide regimes: Market-wide means we'll look at the aggregate movement of the universe of stocks.

High and low dispersion: dispersion is looking at the dispersion (standard deviation) of the cross section of all stocks at each period of time (on each day). We'll inherit from CustomFactor. We'll feed in DailyReturns as the inputs.


If the inputs to our market dispersion factor are the daily returns, and we plan to calculate the market dispersion on each day, what should be the window_length of the market dispersion class?


Quiz: market dispersion feature

Create a class that inherits from CustomFactor. Override the compute function to calculate the population standard deviation of all the stocks over a specified window of time.

mean returns

$$\mu = \sum{t=0}^{T}\sum{i=1}^{N}r_{i,t}$$

Market Dispersion

$$\sqrt{\frac{1}{T} \sum{t=0}^{T} \frac{1}{N}\sum{i=1}^{N}(r_{i,t} - \mu)^2}$$

Use numpy.nanmean to calculate the average market return $\mu$ and to calculate the average of the squared differences.

class MarketDispersion(CustomFactor):
    inputs = [DailyReturns()]
    window_length = 1
    window_safe = True

    def compute(self, today, assets, out, returns):

        # TODO: calculate average returns
        mean_returns = np.nanmean(returns)

        #TODO: calculate standard deviation of returns
        out[:] = np.sqrt(np.nanmean((returns - mean_returns)**2))


Create the MarketDispersion object. Apply two separate smoothing operations using SimpleMovingAverage. One with a one-month window, and another with a 6-month window. Add both to the pipeline.

# TODO: create MarketDispersion object
dispersion = MarketDispersion(mask=universe)

# TODO: apply one-month simple moving average
dispersion_20d = SimpleMovingAverage(inputs=[dispersion], window_length=20)

# TODO: apply 6-month simple moving average
dispersion_120d = SimpleMovingAverage(inputs=[dispersion], window_length=120)

# Add to pipeline
pipeline.add(dispersion_20d, 'dispersion_20d')
pipeline.add(dispersion_120d, 'dispersion_120d')

Market volatility feature

  • High and low volatility
    We'll also build a class for market volatility, which inherits from CustomFactor. This will measure the standard deviation of the returns of the "market". In this case, we're approximating the "market" as the equal weighted average return of all the stocks in the stock universe.
Market return

$r{m,t} = \frac{1}{N}\sum{i=1}^{N}r_{i,t}$ for each day $t$ in window_length.

Average market return

Also calculate the average market return over the window_length $T$ of days:

$$\mu{m} = \frac{1}{T}\sum{t=1}^{T} r_{m,t}$$

Standard deviation of market return

Then calculate the standard deviation of the market return

$$\sigma{m,t} = \sqrt{252 \times \frac{1}{N} \sum{t=1}^{T}(r{m,t} - \mu{m})^2 } $$

  • Please use numpy.nanmean so that it ignores null values.
  • When using numpy.nanmean:
    axis=0 will calculate one average for every column (think of it like creating a new row in a spreadsheet)
    axis=1 will calculate one average for every row (think of it like creating a new column in a spreadsheet)
  • The returns data in compute has one day in each row, and one stock in each column.
  • Notice that we defined a dictionary params that has a key annualization_factor. This annualization_factor can be used as a regular variable, and you'll be using it in the compute function. This is also done in the definition of AnnualizedVolatility (as seen earlier in the notebook).
class MarketVolatility(CustomFactor):
    inputs = [DailyReturns()]
    window_length = 1  # We'll want to set this in the constructor when creating the object.
    window_safe = True
    params = {'annualization_factor': 252.0}

    def compute(self, today, assets, out, returns, annualization_factor):

        # TODO
        For each row (each row represents one day of returns), 
        calculate the average of the cross-section of stock returns
        So that market_returns has one value for each day in the window_length
        So choose the appropriate axis (please see hints above)
        mkt_returns = np.nanmean(returns, axis=1) 

        # TODO
        # Calculate the mean of market returns
        mkt_returns_mu = np.nanmean(mkt_returns)

        # TODO
        # Calculate the standard deviation of the market returns, then annualize them.
        out[:] = np.sqrt(annualization_factor * np.nanmean((mkt_returns-mkt_returns_mu)**2))
# TODO: create market volatility features using one month and six-month windows
market_vol_20d = MarketVolatility(window_length=20, mask=universe)
market_vol_120d = MarketVolatility(window_length=120, mask=universe)
# add market volatility features to pipeline
pipeline.add(market_vol_20d, 'market_vol_20d')
pipeline.add(market_vol_120d, 'market_vol_120d')

Stop here and continue with the lesson section "Sector and Industry"

Sector and Industry

Add sector code

Note that after we run the pipeline and get the data in a dataframe, we can work on enhancing the sector code feature with one-hot encoding.

pipeline.add(sector, 'sector_code')

Run pipeline to calculate features

all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)
Mean_Reversion_Sector_Neutral_Smoothed Momentum_1YR Overnight_Sentiment_Smoothed adv_120d adv_20d dispersion_120d dispersion_20d market_vol_120d market_vol_20d sector_code volatility_120d volatility_20d
2013-01-03 00:00:00+00:00 Equity(0 [A]) -0.262769 -1.207978 -1.485669 1.338573 1.397411 0.013270 0.011178 0.127654 0.135452 0 -0.836546 -1.219809
Equity(1 [AAL]) 0.099926 1.713471 0.919350 1.139994 1.081155 0.013270 0.011178 0.127654 0.135452 3 1.639924 1.566220
Equity(2 [AAP]) 1.669138 -1.535061 1.507733 -0.301547 -0.919350 0.013270 0.011178 0.127654 0.135452 8 1.072400 -1.470404
Equity(3 [AAPL]) 1.698746 1.193111 -1.367992 1.728377 1.728377 0.013270 0.011178 0.127654 0.135452 1 1.050289 1.617813
Equity(4 [ABBV]) NaN NaN -0.250063 -1.728377 -1.647475 0.014595 0.014595 0.127654 0.135452 0 NaN NaN

One-hot encode sector

Let's get all the unique sector codes. Then we'll use the == comparison operator to check when the sector code equals a particular value. This returns a series of True/False values. For some functions that we'll use in a later lesson, it's easier to work with numbers instead of booleans. We can convert the booleans to type int. So False becomes 0, and 1 becomes True.

sector_code_l = set(all_factors['sector_code'])
sector_0 = all_factors['sector_code'] == 0
2013-01-03 00:00:00+00:00  Equity(0 [A])        True
                           Equity(1 [AAL])     False
                           Equity(2 [AAP])     False
                           Equity(3 [AAPL])    False
                           Equity(4 [ABBV])     True
Name: sector_code, dtype: bool
sector_0_numeric = sector_0.astype(int)
2013-01-03 00:00:00+00:00  Equity(0 [A])       1
                           Equity(1 [AAL])     0
                           Equity(2 [AAP])     0
                           Equity(3 [AAPL])    0
                           Equity(4 [ABBV])    1
Name: sector_code, dtype: int64

Quiz: One-hot encode sector

Choose column names that look like "sector_code_0", "sector_code_1" etc. Store the values as 1 when the row matches the sector code of the column, 0 otherwise.

# TODO: one-hot encode sector and store into dataframe
for s in sector_code_l:
    all_factors[f'sector_code_{s}'] = (all_factors['sector_code'] == s).astype(int)
Mean_Reversion_Sector_Neutral_Smoothed Momentum_1YR Overnight_Sentiment_Smoothed adv_120d adv_20d dispersion_120d dispersion_20d market_vol_120d market_vol_20d sector_code ... sector_code_2 sector_code_3 sector_code_4 sector_code_5 sector_code_6 sector_code_7 sector_code_8 sector_code_9 sector_code_10 sector_code_-1
2013-01-03 00:00:00+00:00 Equity(0 [A]) -0.262769 -1.207978 -1.485669 1.338573 1.397411 0.013270 0.011178 0.127654 0.135452 0 ... 0 0 0 0 0 0 0 0 0 0
Equity(1 [AAL]) 0.099926 1.713471 0.919350 1.139994 1.081155 0.013270 0.011178 0.127654 0.135452 3 ... 0 1 0 0 0 0 0 0 0 0
Equity(2 [AAP]) 1.669138 -1.535061 1.507733 -0.301547 -0.919350 0.013270 0.011178 0.127654 0.135452 8 ... 0 0 0 0 0 0 1 0 0 0
Equity(3 [AAPL]) 1.698746 1.193111 -1.367992 1.728377 1.728377 0.013270 0.011178 0.127654 0.135452 1 ... 0 0 0 0 0 0 0 0 0 0
Equity(4 [ABBV]) NaN NaN -0.250063 -1.728377 -1.647475 0.014595 0.014595 0.127654 0.135452 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 24 columns

Stop here and continue with the lesson section "Date Parts".

Date Parts

  • We will make features that might capture trader/investor behavior due to calendar anomalies.
  • We can get the dates from the index of the dataframe that is returned from running the pipeline.

Accessing index of dates

  • Note that we can access the date index. using Dataframe.index.get_level_values(0), since the date is stored as index level 0, and the asset name is stored in index level 1. This is of type DateTimeIndex.
DatetimeIndex(['2013-01-03', '2013-01-03', '2013-01-03', '2013-01-03',
               '2013-01-03', '2013-01-03', '2013-01-03', '2013-01-03',
               '2013-01-03', '2013-01-03',
               '2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
               '2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
               '2016-01-05', '2016-01-05'],
              dtype='datetime64[ns, UTC]', length=363734, freq=None)

DateTimeIndex attributes

  • The month attribute is a numpy array with a 1 for January, 2 for February ... 12 for December etc.

  • We can use a comparison operator such as == to return True or False.

  • It's usually easier to have all data of a similar type (numeric), so we recommend converting booleans to integers.
    The numpy ndarray has a function .astype() that can cast the data to a specified type.
    For instance, astype(int) converts False to 0 and True to 1.

# Example
print(all_factors.index.get_level_values(0).month == 1)
print( (all_factors.index.get_level_values(0).month == 1).astype(int) )
[1 1 1 ... 1 1 1]
[ True  True  True ...  True  True  True]
[1 1 1 ... 1 1 1]


  • Create a numpy array that has 1 when the month is January, and 0 otherwise. Store it as a column in the all_factors dataframe.
  • Add another similar column to indicate when the month is December
# TODO: create a feature that indicate whether it's January
all_factors['is_January'] = (all_factors.index.get_level_values(0).month == 1).astype(int)

# TODO: create a feature to indicate whether it's December
all_factors['is_December'] = (all_factors.index.get_level_values(0).month == 12).astype(int)

Weekday, quarter

  • add columns to the all_factors dataframe that specify the weekday, quarter and year.
  • As you can see in the documentation for DateTimeIndex, weekday, quarter, and year are attributes that you can use here.
# we can see that 0 is for Monday, 4 is for Friday
{0, 1, 2, 3, 4}
# Q1, Q2, Q3 and Q4 are represented by integers too
{1, 2, 3, 4}


Add features for weekday, quarter and year.

all_factors['weekday'] = all_factors.index.get_level_values(0).weekday
all_factors['quarter'] = all_factors.index.get_level_values(0).quarter
all_factors['year'] = all_factors.index.get_level_values(0).year

Start and end-of features

  • The start and end of the week, month, and quarter may have structural differences in trading activity.
  • Pandas.date_range takes the start_date, end_date, and frequency.
  • The frequency for end of month is BM.
# Example
tmp = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM')
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-29', '2013-04-30',
               '2013-05-31', '2013-06-28', '2013-07-31', '2013-08-30',
               '2013-09-30', '2013-10-31', '2013-11-29', '2013-12-31',
               '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
               '2014-05-30', '2014-06-30', '2014-07-31', '2014-08-29',
               '2014-09-30', '2014-10-31', '2014-11-28', '2014-12-31',
               '2015-01-30', '2015-02-27', '2015-03-31', '2015-04-30',
               '2015-05-29', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-30', '2015-11-30', '2015-12-31'],
              dtype='datetime64[ns, UTC]', freq='BM')


Create a DatetimeIndex that stores the dates which are the last business day of each month.
Use the .isin function, passing in these last days of the month, to create a series of booleans.
Convert the booleans to integers.

last_day_of_month = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM')
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-29', '2013-04-30',
               '2013-05-31', '2013-06-28', '2013-07-31', '2013-08-30',
               '2013-09-30', '2013-10-31', '2013-11-29', '2013-12-31',
               '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
               '2014-05-30', '2014-06-30', '2014-07-31', '2014-08-29',
               '2014-09-30', '2014-10-31', '2014-11-28', '2014-12-31',
               '2015-01-30', '2015-02-27', '2015-03-31', '2015-04-30',
               '2015-05-29', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-30', '2015-11-30', '2015-12-31'],
              dtype='datetime64[ns, UTC]', freq='BM')
tmp_month_end = all_factors.index.get_level_values(0).isin(last_day_of_month)
array([False, False, False, ..., False, False, False])
tmp_month_end_int = tmp_month_end.astype(int)
array([0, 0, 0, ..., 0, 0, 0])
all_factors['month_end'] = tmp_month_end_int

Quiz: Start of Month

Create a feature that indicates the first business day of each month.

Hint: The frequency for first business day of the month uses the code BMS.

# TODO: month_start feature
first_day_of_month = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BMS')
all_factors['month_start'] = (all_factors.index.get_level_values(0).isin(first_day_of_month)).astype(int)

Quiz: Quarter end and quarter start

Create features for the last business day of each quarter, and first business day of each quarter.
Hint: use freq=BQ for business day end of quarter, and freq=BQS for business day start of quarter.

# TODO: qtr_end feature
last_day_qtr = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQ')
all_factors['qtr_end'] = (all_factors.index.get_level_values(0).isin(last_day_qtr)).astype(int)
# TODO: qtr_start feature
first_day_qtr = pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQS')
all_factors['qtr_start'] = (all_factors.index.get_level_values(0).isin(first_day_qtr)).astype(int)

View all features


Note that we can skip the sector_code feature, since we one-hot encoded it into separate features.

features = ['Mean_Reversion_Sector_Neutral_Smoothed',
 #'sector_code', # removed sector_code

Stop here and continue to the lesson section "Targets"

Targets (Labels)

  • We are going to try to predict the go forward 1-week return
  • Very important! Quantize the target. Why do we do this?
    • Makes it market neutral return
    • Normalizes changing volatility and dispersion over time
    • Make the target robust to changes in market regimes
  • The factor we create is the trailing 5-day return.
# we'll create a separate pipeline to handle the target
pipeline_target = Pipeline(screen=universe)


We'll convert weekly returns into 2-quantiles.

return_5d_2q = Returns(window_length=5, mask=universe).quantiles(2)
Quantiles((Returns((USEquityPricing.close::float64,), window_length=5),), window_length=0)
pipeline_target.add(return_5d_2q, 'return_5d_2q')


Create another weekly return target that's converted to 5-quantiles.

# TODO: create a target using 5-quantiles
return_5d_5q = Returns(window_length=5, mask=universe).quantiles(5)

# TODO: add the feature to the pipeline
pipeline_target.add(return_5d_2q, 'return_5d_5q')

# Let's run the pipeline to get the dataframe
targets_df = engine.run_pipeline(pipeline_target, factor_start_date, universe_end_date)
return_5d_2q return_5d_5q
2013-01-03 00:00:00+00:00 Equity(0 [A]) 0 0
Equity(1 [AAL]) 1 1
Equity(2 [AAP]) 0 0
Equity(3 [AAPL]) 1 1
Equity(4 [ABBV]) -1 -1
Index(['return_5d_2q', 'return_5d_5q'], dtype='object')