AI 量化交易-基础成对交易算法实现 (二)

【实战】基础成对交易算法实现

数据获得

想要做量化,首先你得有数据。股市的交易数据的来源可以有很多地方,比如直接对接交易所,比如从万徳,彭博社,路透社等数据第三方。当然,以上大部分都是收费的,企业用户用起来无压力,个人用户就有点吃力了。所以我这里还推荐一些免费的交易数据网站,大家可以玩一玩:

https://www.alphavantage.co/

http://www.financialcontent.com/

https://www.tiingo.com/

https://finance.yahoo.com/

https://www.quandl.com/ (教育目的免费)

他们都提供很直观便捷的API,大家注册一下就可以了,然后就可以下载或者导入自己需要的数据资源。

选取目标股票

假设,按照我们前面讲的理论,我们找一些相同行业内的企业,来首先保证他们之间有足够强的内在经济联系

我来举几个大家熟悉的美股饮料公司的股票:

file

import pandas as pd

从各种可行的数据源下载或者导入这些股票的历史交易记录,并读入

data_PEP = pd.read_csv('./data/daily_PEP.csv', index_col=[0],usecols=['timestamp','close'])
data_PEP.columns = ["PEP"]
data_KO = pd.read_csv('./data/daily_KO.csv', index_col=[0],usecols=['timestamp','close'])
data_KO.columns = ["KO"]
data_KDP = pd.read_csv('./data/daily_KDP.csv', index_col=[0],usecols=['timestamp','close'])
data_KDP.columns = ["KDP"]
data_MNST = pd.read_csv('./data/daily_MNST.csv', index_col=[0],usecols=['timestamp','close'])
data_MNST.columns = ["MNST"]

我们暂时不需要所有的信息,只需要每天的closing price,所以,我们精简一下数据

print(data_PEP)
               PEP
timestamp         
2018-12-13  118.35
2018-12-12  117.00
2018-12-11  117.29
2018-12-10  116.19
2018-12-07  115.82
2018-12-06  116.84
2018-12-04  117.80
2018-12-03  118.98
2018-11-30  121.94
2018-11-29  118.27
2018-11-28  118.50
2018-11-27  116.44
2018-11-26  115.86
2018-11-23  115.41
2018-11-21  115.28
2018-11-20  116.00
...            ...
1998-01-27   36.06
1998-01-05   36.50
1998-01-02   36.00

[5272 rows x 1 columns]

把所有的数据都塞在一起

data = pd.concat([data_PEP,data_KO,data_KDP, data_MNST,],axis=1,)
print(data)
               PEP     KO    KDP    MNST
1998-01-02   36.00  66.94    NaN   1.810
1998-01-05   36.50  66.44    NaN   1.781
1998-01-06   35.19  66.13    NaN   1.750
1998-01-07   35.88  66.19    NaN   1.719
1998-01-08   35.88  66.63    NaN   1.813
1998-01-09   34.75  64.13    NaN   1.625
1998-01-12   35.38  66.06    NaN   1.500
1998-01-13   36.50  65.38    NaN   1.500
1998-01-14   36.69  64.63    NaN   1.469
1998-01-15   35.94  63.75    NaN   1.625
1998-01-16   36.69  65.00    NaN   1.688
1998-01-20   37.63  65.94    NaN   1.656
1998-01-21   37.00  65.50    NaN   1.750
1998-01-22   36.88  65.44    NaN   1.563
1998-01-23   36.31  63.81    NaN   1.563
1998-01-26   36.44  63.06    NaN   1.719
...            ...    ...    ...     ...
2018-10-31  112.38  47.88  26.00  52.850
2018-11-01  111.51  47.74  26.52  54.450
2018-12-11  117.29  49.54  26.29  57.580
2018-12-12  117.00  49.22  26.12  57.440
2018-12-13  118.35  49.47  26.15  53.430

[5272 rows x 4 columns]

/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

只为案例:我们这里挑个时间段 2015-05-01 ~ 2016-05-01

data = data['2016-03-01':'2016-10-01']

除此之外,我们还要把数据中的Nan给解决好。比如,如果当日的close是Nan,说明要么数据遗漏,要么当日没有开,那么我们沿用前一天的价格

data.fillna(method='ffill')
PEP KO KDP MNST
2016-03-01 99.09 43.69 92.71 129.73
2016-03-02 98.33 43.77 91.98 127.75
2016-03-03 99.16 43.96 91.95 128.47
2016-03-04 100.00 44.11 91.78 128.89
2016-03-07 99.25 44.01 90.21 127.99
2016-03-08 99.74 44.32 90.75 129.33
2016-03-09 100.20 44.81 92.02 130.07
2016-03-10 100.78 45.23 92.12 131.65
2016-03-11 101.31 45.20 91.04 133.86
2016-03-14 100.65 45.29 90.75 131.96
2016-03-18 101.29 45.60 90.35 136.53
2016-03-21 101.54 45.67 90.24 134.52
2016-03-22 100.77 45.50 89.37 134.60
2016-03-23 100.81 45.46 89.96 132.83
2016-03-24 100.68 45.58 88.23 131.25
2016-09-07 107.21 43.64 93.55 152.17
2016-09-08 106.88 43.63 92.99 152.12
2016-09-09 104.05 42.27 89.84 147.55
2016-09-12 106.02 43.19 91.26 148.30

150 rows × 4 columns

# n = data.shape
# print(n)
# (150, 4)

# type(data)
# pandas.core.frame.DataFrame

# row_n = 4
# score_matrix = np.zeros((row_n, row_n))
# score_matrix[:2]

# pvalue_matrix = np.ones((row_n, row_n))
# pvalue_matrix

keys = data.keys()
print(keys)
Index(['PEP', 'KO', 'KDP', 'MNST'], dtype='object')

寻找Cointegration

我们这一个步骤来做这件事:

对于所有的目标池内的股票,我们计算他们是否各自Coint:

# 两两配对,算一下Coint
def find_cointegrated_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

我们把算出来的数值都画出来,可以直观的看到哪一对最好:

import numpy as np
from statsmodels.tsa.stattools import coint
%matplotlib inline

scores, pvalues, pairs = find_cointegrated_pairs(data)
import seaborn

# 直观点把这个图画出来
seaborn.heatmap(pvalues, xticklabels=['PEP','KO','KDP','MNST'], yticklabels=['PEP','KO','KDP','MNST'], cmap='RdYlGn_r' 
                , mask = (pvalues >= 0.05)
                )
/opt/conda/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

<matplotlib.axes._subplots.AxesSubplot at 0x7f70b5a34470>

file

由图可见,我们可以挑 ['KDP', 'MNST'] 这一对,作为我们重点考察目标:

S1 = data['KDP']
S2 = data['MNST']

我们来看看此刻的P-value是多少

score, pvalue, _ = coint(S1, S2)
pvalue
0.4548934445039827

计算Spread

按照之前讲的套路,计算S1和S2的OLS,得到我们的spread,并画出图像

import statsmodels
import statsmodels.api as sm
import matplotlib.pyplot as plt

print("S1\n",S1.head(5))
S1 = sm.add_constant(S1)
print("S1-constant\n", S1.head(5))

# OLS 最小二乘法
results = sm.OLS(S2, S1).fit()
S1 = S1['KDP']
print("results:\n", results.params);

# 打印出一个模型的诊断细节
print("results.summary:\n", results.summary());

b = results.params['KDP']

spread = S2 - b * S1
spread.plot()
plt.axhline(spread.mean(), color='black')
plt.legend(['Spread']);
S1
 2018-03-01    116.44
2018-03-02    116.24
2018-03-05    115.91
2018-03-06    116.12
2018-03-07    116.30
Name: KDP, dtype: float64
S1-constant
             const     KDP
2018-03-01    1.0  116.44
2018-03-02    1.0  116.24
2018-03-05    1.0  115.91
2018-03-06    1.0  116.12
2018-03-07    1.0  116.30
results:
 const    61.59733
KDP      -0.05579
dtype: float64
results.summary:
                             OLS Regression Results                            
==============================================================================
Dep. Variable:                   MNST   R-squared:                       0.533
Model:                            OLS   Adj. R-squared:                  0.530
Method:                 Least Squares   F-statistic:                     167.7
Date:                Sun, 06 Oct 2019   Prob (F-statistic):           4.51e-26
Time:                        08:17:25   Log-Likelihood:                -345.15
No. Observations:                 149   AIC:                             694.3
Df Residuals:                     147   BIC:                             700.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         61.5973      0.406    151.771      0.000      60.795      62.399
KDP           -0.0558      0.004    -12.950      0.000      -0.064      -0.047
==============================================================================
Omnibus:                       20.992   Durbin-Watson:                   0.141
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.176
Skew:                          -0.979   Prob(JB):                     3.41e-06
Kurtosis:                       3.469   Cond. No.                         189.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

file

按照标准measurement,算个Standard Score,这样我们可以清楚看到基准线(=0),也可以很轻松的画出两条边际线

def zscore(series):
    return (series - series.mean()) / np.std(series)

此刻,我们就可以按照+1.0和-1.0的,画出两条指导线了

zscore(spread).plot()
plt.axhline(zscore(spread).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Spread z-score', 'Mean', '+1', '-1']);

file

我们的策略行动很简单粗暴:

  • 当波动低于-1.0的线,Long the spread
  • 当波动高于1.0的线,Short the spread
  • 当波动接近于0,退出Hedged Position

【实战】优化版的成对交易模型

Moving Average

既然是滚动平均值,那我们就需要一个滚动的OLS来计算每一次:

from pyfinance.ols import PandasRollingOLS
rolling_beta = PandasRollingOLS(y=S1, x=S2, window=30)
print(rolling_beta.beta.head())
            feature1
2016-04-12 -0.119287
2016-04-13 -0.063562
2016-04-14  0.018802
2016-04-15  0.094118
2016-04-18  0.151175

然后,我们用这个滚动的OLS,去计算滚动的平均值:


spread = S2 - rolling_beta.beta['feature1'] * S1
spread.name = 'spread'

# 画出每一天的MAVG
spread_mavg1 = spread.rolling(window=1).mean()
spread_mavg1.name = 'spread 1d mavg'

# 画出每30天的MAVG
spread_mavg30 = spread.rolling(window=30).mean()
spread_mavg30.name = 'spread 30d mavg'

plt.plot(spread_mavg1.index, spread_mavg1.values)
plt.plot(spread_mavg30.index, spread_mavg30.values)

plt.legend(['1 Day Spread MAVG', '30 Day Spread MAVG'])

plt.ylabel('Spread');

file

大家看到,楼上的滚动平均值更加的平滑,并且我们更准确清晰地知道什么时候我们要采取什么策略。

相比于单独的一个全局平均值,滚动平均值会随着市场的变化而变化,更加符合我们曲线的走势。

既然我们有了1天和30天的spread,我们就可以把这两个值当作我们的x和y,并用他们计算出一个统一的standard score,

面对这个zscore,我们可以更加直观的看出来我们的股票价格在什么时间点处于“极端值”:

# Take a rolling 30 day standard deviation
std_30 = spread.rolling(window=30).std()
std_30.name = 'std 30d'

# Compute the z score for each day
zscore_30_1 = (spread_mavg1 - spread_mavg30)/std_30
zscore_30_1.name = 'z-score'
zscore_30_1.plot()
plt.axhline(0, color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--');

file

好,为了更加清晰地给大家看看我们这个走势变化与真实股票的关系,

我们可以把三个线 都画在同一个坐标系内:

# Plot the prices scaled down along with the negative z-score
# just divide the stock prices by 10 to make viewing it on the plot easier
plt.plot(S1.index, S1.values/10)
plt.plot(S2.index, S2.values/10)
plt.plot(zscore_30_1.index, zscore_30_1.values)
plt.legend(['S1 Price / 10', 'S2 Price / 10', 'Price Spread Rolling z-Score']);

file

Out of Sample

如果我们的目标数据在我们考察的时间段范围以外怎么办?这样我们就需要再做一次Coint测试

比如,我们拿出一个S1与S2在新的时间段中的数据比较:

data_KDP = pd.read_csv('./data/daily_KDP.csv', index_col=[0],usecols=['timestamp','close'])
data_KDP.columns = ["KDP"]
data_MNST = pd.read_csv('./data/daily_MNST.csv', index_col=[0],usecols=['timestamp','close'])
data_MNST.columns = ["MNST"]
data = pd.concat([data_KDP, data_MNST,],axis=1,)
data = data['2018-03-01':'2018-10-01']
data.fillna(method='ffill')
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:5: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """
KDP MNST
2018-03-01 116.44 54.22
2018-03-02 116.24 54.16
2018-03-05 115.91 55.74
2018-03-06 116.12 55.67
2018-03-07 116.30 55.55
... ... ...
2018-10-01 23.10 57.36

149 rows × 2 columns

S1 = data['KDP']
S2 = data['MNST']
score, pvalue, _ = coint(S1, S2)
print(pvalue)
0.5241888917528962

看,这个值明显大于了我们的threshold 0.05,所以我们就不可以再用他们两对来进行成对交易。

所以这个时候,我们就该退出我们的交易策略,并寻找新的可能性机会。

为者常成,行者常至