AI 量化交易-基础成对交易算法实现 (二)

2019-10-10 16:31:36 ⋅ 19383 ⋅ 0 ⋅ 0

【实战】基础成对交易算法实现

数据获得

想要做量化，首先你得有数据。股市的交易数据的来源可以有很多地方，比如直接对接交易所，比如从万徳，彭博社，路透社等数据第三方。当然，以上大部分都是收费的，企业用户用起来无压力，个人用户就有点吃力了。所以我这里还推荐一些免费的交易数据网站，大家可以玩一玩：

https://www.alphavantage.co/

http://www.financialcontent.com/

https://www.tiingo.com/

https://finance.yahoo.com/

https://www.quandl.com/ （教育目的免费）

他们都提供很直观便捷的API，大家注册一下就可以了，然后就可以下载或者导入自己需要的数据资源。

选取目标股票

假设，按照我们前面讲的理论，我们找一些相同行业内的企业，来首先保证他们之间有足够强的内在经济联系

我来举几个大家熟悉的美股饮料公司的股票：

file

import pandas as pd

从各种可行的数据源下载或者导入这些股票的历史交易记录，并读入

data_PEP = pd.read_csv('./data/daily_PEP.csv', index_col=[0],usecols=['timestamp','close'])
data_PEP.columns = ["PEP"]
data_KO = pd.read_csv('./data/daily_KO.csv', index_col=[0],usecols=['timestamp','close'])
data_KO.columns = ["KO"]
data_KDP = pd.read_csv('./data/daily_KDP.csv', index_col=[0],usecols=['timestamp','close'])
data_KDP.columns = ["KDP"]
data_MNST = pd.read_csv('./data/daily_MNST.csv', index_col=[0],usecols=['timestamp','close'])
data_MNST.columns = ["MNST"]

我们暂时不需要所有的信息，只需要每天的closing price，所以，我们精简一下数据

print(data_PEP)

               PEP
timestamp         
2018-12-13  118.35
2018-12-12  117.00
2018-12-11  117.29
2018-12-10  116.19
2018-12-07  115.82
2018-12-06  116.84
2018-12-04  117.80
2018-12-03  118.98
2018-11-30  121.94
2018-11-29  118.27
2018-11-28  118.50
2018-11-27  116.44
2018-11-26  115.86
2018-11-23  115.41
2018-11-21  115.28
2018-11-20  116.00
...            ...
1998-01-27   36.06
1998-01-05   36.50
1998-01-02   36.00

[5272 rows x 1 columns]

把所有的数据都塞在一起

data = pd.concat([data_PEP,data_KO,data_KDP, data_MNST,],axis=1,)
print(data)

               PEP     KO    KDP    MNST
1998-01-02   36.00  66.94    NaN   1.810
1998-01-05   36.50  66.44    NaN   1.781
1998-01-06   35.19  66.13    NaN   1.750
1998-01-07   35.88  66.19    NaN   1.719
1998-01-08   35.88  66.63    NaN   1.813
1998-01-09   34.75  64.13    NaN   1.625
1998-01-12   35.38  66.06    NaN   1.500
1998-01-13   36.50  65.38    NaN   1.500
1998-01-14   36.69  64.63    NaN   1.469
1998-01-15   35.94  63.75    NaN   1.625
1998-01-16   36.69  65.00    NaN   1.688
1998-01-20   37.63  65.94    NaN   1.656
1998-01-21   37.00  65.50    NaN   1.750
1998-01-22   36.88  65.44    NaN   1.563
1998-01-23   36.31  63.81    NaN   1.563
1998-01-26   36.44  63.06    NaN   1.719
...            ...    ...    ...     ...
2018-10-31  112.38  47.88  26.00  52.850
2018-11-01  111.51  47.74  26.52  54.450
2018-12-11  117.29  49.54  26.29  57.580
2018-12-12  117.00  49.22  26.12  57.440
2018-12-13  118.35  49.47  26.15  53.430

[5272 rows x 4 columns]

/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

只为案例：我们这里挑个时间段 2015-05-01 ～ 2016-05-01

data = data['2016-03-01':'2016-10-01']

除此之外，我们还要把数据中的Nan给解决好。比如，如果当日的close是Nan，说明要么数据遗漏，要么当日没有开，那么我们沿用前一天的价格

data.fillna(method='ffill')

	PEP	KO	KDP	MNST
2016-03-01	99.09	43.69	92.71	129.73
2016-03-02	98.33	43.77	91.98	127.75
2016-03-03	99.16	43.96	91.95	128.47
2016-03-04	100.00	44.11	91.78	128.89
2016-03-07	99.25	44.01	90.21	127.99
2016-03-08	99.74	44.32	90.75	129.33
2016-03-09	100.20	44.81	92.02	130.07
2016-03-10	100.78	45.23	92.12	131.65
2016-03-11	101.31	45.20	91.04	133.86
2016-03-14	100.65	45.29	90.75	131.96
2016-03-18	101.29	45.60	90.35	136.53
2016-03-21	101.54	45.67	90.24	134.52
2016-03-22	100.77	45.50	89.37	134.60
2016-03-23	100.81	45.46	89.96	132.83
2016-03-24	100.68	45.58	88.23	131.25
2016-09-07	107.21	43.64	93.55	152.17
2016-09-08	106.88	43.63	92.99	152.12
2016-09-09	104.05	42.27	89.84	147.55
2016-09-12	106.02	43.19	91.26	148.30

150 rows × 4 columns

# n = data.shape
# print(n)
# (150, 4)

# type(data)
# pandas.core.frame.DataFrame

# row_n = 4
# score_matrix = np.zeros((row_n, row_n))
# score_matrix[:2]

# pvalue_matrix = np.ones((row_n, row_n))
# pvalue_matrix

keys = data.keys()
print(keys)

Index(['PEP', 'KO', 'KDP', 'MNST'], dtype='object')

寻找Cointegration

我们这一个步骤来做这件事：

对于所有的目标池内的股票，我们计算他们是否各自Coint：

# 两两配对，算一下Coint
def find_cointegrated_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

我们把算出来的数值都画出来，可以直观的看到哪一对最好：

import numpy as np
from statsmodels.tsa.stattools import coint
%matplotlib inline

scores, pvalues, pairs = find_cointegrated_pairs(data)
import seaborn

# 直观点把这个图画出来
seaborn.heatmap(pvalues, xticklabels=['PEP','KO','KDP','MNST'], yticklabels=['PEP','KO','KDP','MNST'], cmap='RdYlGn_r' 
                , mask = (pvalues >= 0.05)
                )

/opt/conda/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

<matplotlib.axes._subplots.AxesSubplot at 0x7f70b5a34470>

file

由图可见，我们可以挑 ['KDP', 'MNST'] 这一对，作为我们重点考察目标：

S1 = data['KDP']
S2 = data['MNST']

我们来看看此刻的P-value是多少

score, pvalue, _ = coint(S1, S2)
pvalue

0.4548934445039827

计算Spread

按照之前讲的套路，计算S1和S2的OLS，得到我们的spread，并画出图像

import statsmodels
import statsmodels.api as sm
import matplotlib.pyplot as plt

print("S1\n",S1.head(5))
S1 = sm.add_constant(S1)
print("S1-constant\n", S1.head(5))

# OLS 最小二乘法
results = sm.OLS(S2, S1).fit()
S1 = S1['KDP']
print("results:\n", results.params);

# 打印出一个模型的诊断细节
print("results.summary:\n", results.summary());

b = results.params['KDP']

spread = S2 - b * S1
spread.plot()
plt.axhline(spread.mean(), color='black')
plt.legend(['Spread']);

S1
 2018-03-01    116.44
2018-03-02    116.24
2018-03-05    115.91
2018-03-06    116.12
2018-03-07    116.30
Name: KDP, dtype: float64
S1-constant
             const     KDP
2018-03-01    1.0  116.44
2018-03-02    1.0  116.24
2018-03-05    1.0  115.91
2018-03-06    1.0  116.12
2018-03-07    1.0  116.30
results:
 const    61.59733
KDP      -0.05579
dtype: float64
results.summary:
                             OLS Regression Results                            
==============================================================================
Dep. Variable:                   MNST   R-squared:                       0.533
Model:                            OLS   Adj. R-squared:                  0.530
Method:                 Least Squares   F-statistic:                     167.7
Date:                Sun, 06 Oct 2019   Prob (F-statistic):           4.51e-26
Time:                        08:17:25   Log-Likelihood:                -345.15
No. Observations:                 149   AIC:                             694.3
Df Residuals:                     147   BIC:                             700.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         61.5973      0.406    151.771      0.000      60.795      62.399
KDP           -0.0558      0.004    -12.950      0.000      -0.064      -0.047
==============================================================================
Omnibus:                       20.992   Durbin-Watson:                   0.141
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.176
Skew:                          -0.979   Prob(JB):                     3.41e-06
Kurtosis:                       3.469   Cond. No.                         189.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

file

按照标准measurement，算个Standard Score，这样我们可以清楚看到基准线（=0），也可以很轻松的画出两条边际线

def zscore(series):
    return (series - series.mean()) / np.std(series)

此刻，我们就可以按照+1.0和-1.0的，画出两条指导线了

zscore(spread).plot()
plt.axhline(zscore(spread).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Spread z-score', 'Mean', '+1', '-1']);

file

我们的策略行动很简单粗暴：

当波动低于-1.0的线，Long the spread
当波动高于1.0的线，Short the spread
当波动接近于0，退出Hedged Position

【实战】优化版的成对交易模型

Moving Average

既然是滚动平均值，那我们就需要一个滚动的OLS来计算每一次：

from pyfinance.ols import PandasRollingOLS
rolling_beta = PandasRollingOLS(y=S1, x=S2, window=30)
print(rolling_beta.beta.head())

            feature1
2016-04-12 -0.119287
2016-04-13 -0.063562
2016-04-14  0.018802
2016-04-15  0.094118
2016-04-18  0.151175

然后，我们用这个滚动的OLS，去计算滚动的平均值：


spread = S2 - rolling_beta.beta['feature1'] * S1
spread.name = 'spread'

# 画出每一天的MAVG
spread_mavg1 = spread.rolling(window=1).mean()
spread_mavg1.name = 'spread 1d mavg'

# 画出每30天的MAVG
spread_mavg30 = spread.rolling(window=30).mean()
spread_mavg30.name = 'spread 30d mavg'

plt.plot(spread_mavg1.index, spread_mavg1.values)
plt.plot(spread_mavg30.index, spread_mavg30.values)

plt.legend(['1 Day Spread MAVG', '30 Day Spread MAVG'])

plt.ylabel('Spread');

file

大家看到，楼上的滚动平均值更加的平滑，并且我们更准确清晰地知道什么时候我们要采取什么策略。

相比于单独的一个全局平均值，滚动平均值会随着市场的变化而变化，更加符合我们曲线的走势。

既然我们有了1天和30天的spread，我们就可以把这两个值当作我们的x和y，并用他们计算出一个统一的standard score，

面对这个zscore，我们可以更加直观的看出来我们的股票价格在什么时间点处于“极端值”：

# Take a rolling 30 day standard deviation
std_30 = spread.rolling(window=30).std()
std_30.name = 'std 30d'

# Compute the z score for each day
zscore_30_1 = (spread_mavg1 - spread_mavg30)/std_30
zscore_30_1.name = 'z-score'
zscore_30_1.plot()
plt.axhline(0, color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--');

file

好，为了更加清晰地给大家看看我们这个走势变化与真实股票的关系，

我们可以把三个线都画在同一个坐标系内：

# Plot the prices scaled down along with the negative z-score
# just divide the stock prices by 10 to make viewing it on the plot easier
plt.plot(S1.index, S1.values/10)
plt.plot(S2.index, S2.values/10)
plt.plot(zscore_30_1.index, zscore_30_1.values)
plt.legend(['S1 Price / 10', 'S2 Price / 10', 'Price Spread Rolling z-Score']);

file

Out of Sample

如果我们的目标数据在我们考察的时间段范围以外怎么办？这样我们就需要再做一次Coint测试

比如，我们拿出一个S1与S2在新的时间段中的数据比较：

data_KDP = pd.read_csv('./data/daily_KDP.csv', index_col=[0],usecols=['timestamp','close'])
data_KDP.columns = ["KDP"]
data_MNST = pd.read_csv('./data/daily_MNST.csv', index_col=[0],usecols=['timestamp','close'])
data_MNST.columns = ["MNST"]
data = pd.concat([data_KDP, data_MNST,],axis=1,)
data = data['2018-03-01':'2018-10-01']
data.fillna(method='ffill')

/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:5: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """

	KDP	MNST
2018-03-01	116.44	54.22
2018-03-02	116.24	54.16
2018-03-05	115.91	55.74
2018-03-06	116.12	55.67
2018-03-07	116.30	55.55
...	...	...
2018-10-01	23.10	57.36

149 rows × 2 columns

S1 = data['KDP']
S2 = data['MNST']

score, pvalue, _ = coint(S1, S2)
print(pvalue)

0.5241888917528962

看，这个值明显大于了我们的threshold 0.05，所以我们就不可以再用他们两对来进行成对交易。

所以这个时候，我们就该退出我们的交易策略，并寻找新的可能性机会。

为者常成，行者常至

AI 量化交易-基础成对交易算法实现 (二)

【实战】基础成对交易算法实现

数据获得

选取目标股票

寻找Cointegration

计算Spread

【实战】优化版的成对交易模型

Moving Average

Out of Sample

AI

作者：Corwien

专栏推荐

AI 量化交易-基础成对交易算法实现 (二)

【实战】基础成对交易算法实现

数据获得

选取目标股票

寻找Cointegration

计算Spread

【实战】优化版的成对交易模型

Moving Average

Out of Sample

添加附言

AI

作者：Corwien

专栏推荐