计算模型稳定性评估指标 PSI

2021-12-26 17:09:49 ⋅ 17282 ⋅ 0 ⋅ 0

一、基本概念

PSI反映了验证样本在各分数段的分布与建模样本分布的稳定性，在建模中，我们常用来筛选特征变量、评估模型稳定性
稳定性是有参照的，因此需要有两个分布——实际分布（actual）和预期分布（expected），在建模时通常以训练样本（In the Sample, INS）作为预期分布，而验证样本通常作为实际分布。

file

把两个分布重叠放在一起，以下为相关解释：

训练数据中的标签列即y列 表示实际数据
训练数据用于模型训练 得到一个模型
用该模型进行数据预测 预测出来的y列即预期数据
将实际数据和预期数据进行同样的分段 
每一段内分别计算数据分布 然后比较
用来反映模型的稳定性

file
file

将变量预期分布（excepted）进行分箱（binning）离散化
统计各个分箱里的样本占比

file

PSI 源码

https://gitee.com/pingfanrenbiji/population-stability-index-argo/blob/master/psi.py

import numpy as np

def calculate_psi(expected, actual, buckettype='bins', buckets=10, axis=0):
    '''Calculate the PSI (population stability index) across all variables

    Args:
       expected: numpy matrix of original values
       actual: numpy matrix of new values, same size as expected
       buckettype: type of strategy for creating buckets, bins splits into even splits, quantiles splits into quantile buckets
       buckets: number of quantiles to use in bucketing variables
       axis: axis by which variables are defined, 0 for vertical, 1 for horizontal

    Returns:
       psi_values: ndarray of psi values for each variable

    Author:
       Matthew Burke
       github.com/mwburke
       worksofchart.com
    '''

    def psi(expected_array, actual_array, buckets):
        '''Calculate the PSI for a single variable

        Args:
           expected_array: numpy array of original values
           actual_array: numpy array of new values, same size as expected
           buckets: number of percentile ranges to bucket the values into

        Returns:
           psi_value: calculated PSI value
        '''

        def scale_range (input, min, max):
            input += -(np.min(input))
            input /= np.max(input) / (max - min)
            input += min
            return input

        breakpoints = np.arange(0, buckets + 1) / (buckets) * 100

        if buckettype == 'bins':
            breakpoints = scale_range(breakpoints, np.min(expected_array), np.max(expected_array))
        elif buckettype == 'quantiles':
            breakpoints = np.stack([np.percentile(expected_array, b) for b in breakpoints])

        expected_percents = np.histogram(expected_array, breakpoints)[0] / len(expected_array)
        actual_percents = np.histogram(actual_array, breakpoints)[0] / len(actual_array)

        def sub_psi(e_perc, a_perc):
            '''Calculate the actual PSI value from comparing the values.
               Update the actual value to a very small number if equal to zero
            '''
            if a_perc == 0:
                a_perc = 0.0001
            if e_perc == 0:
                e_perc = 0.0001

            value = (e_perc - a_perc) * np.log(e_perc / a_perc)
            return(value)

        psi_value = np.sum(sub_psi(expected_percents[i], actual_percents[i]) for i in range(0, len(expected_percents)))

        return(psi_value)

    if len(expected.shape) == 1:
        psi_values = np.empty(len(expected.shape))
    else:
        psi_values = np.empty(expected.shape[axis])

    for i in range(0, len(psi_values)):
        if len(psi_values) == 1:
            psi_values = psi(expected, actual, buckets)
        elif axis == 0:
            psi_values[i] = psi(expected[:,i], actual[:,i], buckets)
        elif axis == 1:
            psi_values[i] = psi(expected[i,:], actual[i,:], buckets)

    return(psi_values)

file

结合具体业务实现需求

经过上面的学习 咱们已经知道了PSI是个什么玩意了
而且还有了实现好的算法源码
该算法需要2方面的数据 
一个是实际数据（训练数据）

实际数据即是训练数据集中的标签列 
比如贷款数据样本 
标签列为是否按时还款

另一个是预期数据（预测数据根据模型得到的预测结果）

预期样本是
根据训练出来的模型
对于即将要贷款的用户进行预测是否会还款

接下来就要结合自己公司的业务来得到这块数据调用算法就可以了

为者常成，行者常至

计算模型稳定性评估指标 PSI

一、基本概念

PSI 源码

AI

作者：Corwien

专栏推荐

计算模型稳定性评估指标 PSI

一、基本概念

PSI 源码

添加附言

AI

作者：Corwien

专栏推荐