Python:分析学生数据 Lab (六十九)

2019-03-28 23:31:38 ⋅ 20553 ⋅ 0 ⋅ 0

Lab：分析学生数据

现在，我们已经准备好将神经网络用于实践。我们将分析以下加州大学洛杉矶分校的学生录取的数据。

在这个 notebook 中，你将执行神经网络训练的一些步骤，即：

One-hot 编码数据
缩放数据
编写反向传播步骤

利用神经网络来预测学生录取情况

在该 notebook 中，我们基于以下三条数据预测了加州大学洛杉矶分校 (UCLA) 的研究生录取情况：

GRE 分数（测试）即 GRE Scores (Test)
GPA 分数（成绩）即 GPA Scores (Grades)
评级（1-4）即 Class rank (1-4)

数据集来源： http://www.ats.ucla.edu/

加载数据

为了加载数据并很好地进行格式化，我们将使用两个非常有用的包，即 Pandas 和 Numpy。你可以在这里阅读文档：

# Importing pandas and numpy
import pandas as pd
import numpy as np

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('student_data.csv')

# Printing out the first 10 rows of our data
data[:10]

	admit	gre	gpa	rank
0	0	380	3.61	3
1	1	660	3.67	3
2	1	800	4.00	1
3	1	640	3.19	4
4	0	520	2.93	4
5	1	760	3.00	2
6	1	560	2.98	1
7	0	400	3.08	2
8	1	540	3.39	3
9	0	700	3.92	2

绘制数据

首先让我们对数据进行绘图，看看它是什么样的。为了绘制二维图，让我们先忽略评级 (rank)。

# Importing matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# Function to help us plot
def plot_points(data):
    X = np.array(data[["gre","gpa"]])
    #print(X)
    y = np.array(data["admit"])
    #print(y)
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'cyan', edgecolor = 'k')
    plt.xlabel('Test (GRE)')
    plt.ylabel('Grades (GPA)')

# Plotting the points
plot_points(data)
plt.show()

file

粗略来说，它看起来像是，成绩（grades) 和测试 (test) 分数高的学生通过了，而得分低的学生却没有，但数据并没有如我们所希望的那样，很好地分离。也许将评级 (rank) 考虑进来会有帮助？接下来我们将绘制 4 个图，每个图代表一个级别。

# Separating the ranks
data_rank1 = data[data["rank"]==1]
data_rank2 = data[data["rank"]==2]
data_rank3 = data[data["rank"]==3]
data_rank4 = data[data["rank"]==4]

print(data_rank1.head(5))
print(data_rank2.head(5))

# Plotting the graphs
plot_points(data_rank1)
plt.title("Rank 1")
plt.show()
plot_points(data_rank2)
plt.title("Rank 2")
plt.show()
plot_points(data_rank3)
plt.title("Rank 3")
plt.show()
plot_points(data_rank4)
plt.title("Rank 4")
plt.show()

    admit  gre   gpa  rank
2       1  800  4.00     1
6       1  560  2.98     1
11      0  440  3.22     1
12      1  760  4.00     1
14      1  700  4.00     1
    admit  gre   gpa  rank
5       1  760  3.00     2
7       0  400  3.08     2
9       0  700  3.92     2
13      0  700  3.08     2
18      0  800  3.75     2

file

现在看起来更棒啦，看上去评级越低，录取率越高。让我们使用评级 (rank) 作为我们的输入之一。为了做到这一点，我们应该对它进行一次one-hot 编码。

将评级进行 One-hot 编码

我们将在 Pandas 中使用 get_dummies 函数。

# TODO:  Make dummy variables for rank
# 横向表拼接（行对齐）https://blog.csdn.net/mr_hhh/article/details/79488445
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)
print(one_hot_data.head(5))

# TODO: Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis=1)

# Print the first 10 rows of our data
one_hot_data[:10]

   admit  gre   gpa  rank  rank_1  rank_2  rank_3  rank_4
0      0  380  3.61     3       0       0       1       0
1      1  660  3.67     3       0       0       1       0
2      1  800  4.00     1       1       0       0       0
3      1  640  3.19     4       0       0       0       1
4      0  520  2.93     4       0       0       0       1

	admit	gre	gpa	rank_1	rank_2	rank_3	rank_4
0	0	380	3.61	0	0	1	0
1	1	660	3.67	0	0	1	0
2	1	800	4.00	1	0	0	0
3	1	640	3.19	0	0	0	1
4	0	520	2.93	0	0	0	1
5	1	760	3.00	0	1	0	0
6	1	560	2.98	1	0	0	0
7	0	400	3.08	0	1	0	0
8	1	540	3.39	0	0	1	0
9	0	700	3.92	0	1	0	0

缩放数据

下一步是缩放数据。我们注意到成绩 (grades) 的范围是 1.0-4.0，而测试分数（test scores) 的范围大概是 200-800，这个范围要大得多。这意味着我们的数据存在偏差，使得神经网络很难处理。让我们将两个特征放在 0-1 的范围内，将分数除以 4.0，将测试分数除以 800。

# Making a copy of our data
processed_data = one_hot_data[:]

# TODO: Scale the columns
processed_data['gre'] = processed_data['gre'] / 800
processed_data['gpa'] = processed_data['gpa'] / 4.0
# Printing the first 10 rows of our procesed data
processed_data[:10]

	admit	gre	gpa	rank_1	rank_2	rank_3	rank_4
0	0	0.475	0.9025	0	0	1	0
1	1	0.825	0.9175	0	0	1	0
2	1	1.000	1.0000	1	0	0	0
3	1	0.800	0.7975	0	0	0	1
4	0	0.650	0.7325	0	0	0	1
5	1	0.950	0.7500	0	1	0	0
6	1	0.700	0.7450	1	0	0	0
7	0	0.500	0.7700	0	1	0	0
8	1	0.675	0.8475	0	0	1	0
9	0	0.875	0.9800	0	1	0	0

将数据分成训练集和测试集

为了测试我们的算法，我们将数据分为训练集和测试集。测试集的大小将占总数据的 10％。

sample = np.random.choice(processed_data.index, size=int(len(processed_data)*0.9), replace=False)
train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:10])
print(test_data[:10])

Number of training samples is 360
Number of testing samples is 40
     admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
353      0  0.875  0.8800       0       1       0       0
52       0  0.925  0.8425       0       0       0       1
369      0  1.000  0.9725       0       1       0       0
362      0  0.850  0.7850       0       1       0       0
46       1  0.725  0.8650       0       1       0       0
121      1  0.600  0.6675       0       1       0       0
101      0  0.725  0.8925       0       0       1       0
202      1  0.875  1.0000       1       0       0       0
151      0  0.500  0.8450       0       1       0       0
358      1  0.700  0.9225       0       0       1       0
    admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
5       1  0.950  0.7500       0       1       0       0
10      0  1.000  1.0000       0       0       0       1
11      0  0.550  0.8050       1       0       0       0
17      0  0.450  0.6400       0       0       1       0
24      1  0.950  0.8375       0       1       0       0
33      1  1.000  1.0000       0       0       1       0
37      0  0.650  0.7250       0       0       1       0
57      0  0.475  0.7350       0       0       1       0
69      0  1.000  0.9325       1       0       0       0
76      0  0.700  0.8400       0       0       1       0

将数据分成特征和目标（标签）

现在，在培训前的最后一步，我们将把数据分为特征 (features)（X）和目标 (targets)（y）。

features = train_data.drop('admit', axis=1)
targets = train_data['admit']
features_test = test_data.drop('admit', axis=1)
targets_test = test_data['admit']

print(features[:10])
print(targets[:10])

       gre     gpa  rank_1  rank_2  rank_3  rank_4
353  0.875  0.8800       0       1       0       0
52   0.925  0.8425       0       0       0       1
369  1.000  0.9725       0       1       0       0
362  0.850  0.7850       0       1       0       0
46   0.725  0.8650       0       1       0       0
121  0.600  0.6675       0       1       0       0
101  0.725  0.8925       0       0       1       0
202  0.875  1.0000       1       0       0       0
151  0.500  0.8450       0       1       0       0
358  0.700  0.9225       0       0       1       0
353    0
52     0
369    0
362    0
46     1
121    1
101    0
202    1
151    0
358    1
Name: admit, dtype: int64

训练二层神经网络

下列函数会训练二层神经网络。首先，我们将写一些 helper 函数。

# Activation (sigmoid) function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
def sigmoid_prime(x):
    return sigmoid(x) * (1-sigmoid(x))
def error_formula(y, output):
    return - y*np.log(output) - (1 - y) * np.log(1-output)

误差反向传播

现在轮到你来练习，编写误差项。记住这是由方程 $$ (y-\hat{y}) $$ 给出的。

# TODO: Write the error term formula
def error_term_formula(y, output):
    return (y-output)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

# Training function
def train_nn(features, targets, epochs, learnrate):

    # Use to same seed to make debugging easier
    np.random.seed(42)

    n_records, n_features = features.shape
    last_loss = None

    # Initialize weights
    weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        for x, y in zip(features.values, targets):
            # Loop through all records, x is the input, y is the target

            # Activation of the output unit
            #   Notice we multiply the inputs and the weights here 
            #   rather than storing h as a separate variable 
            output = sigmoid(np.dot(x, weights))

            # The error, the target minus the network output
            error = error_formula(y, output)

            # The error term
            #   Notice we calulate f'(h) here instead of defining a separate
            #   sigmoid_prime function. This just makes it faster because we
            #   can re-use the result of the sigmoid function stored in
            #   the output variable
            error_term = error_term_formula(y, output)

            # The gradient descent step, the error times the gradient times the inputs
            del_w += error_term * x

        # Update the weights here. The learning rate times the 
        # change in weights, divided by the number of records to average
        weights += learnrate * del_w / n_records

        # Printing out the error on the training set
        if e % (epochs / 10) == 0:
            out = sigmoid(np.dot(features, weights))
            loss = np.mean((out - targets) ** 2)
            print("Epoch:", e)
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
            print("=========")
    print("Finished training!")
    return weights

weights = train_nn(features, targets, epochs, learnrate)

Epoch: 0
Train loss:  0.259480332669
=========
Epoch: 100
Train loss:  0.207651606773
=========
Epoch: 200
Train loss:  0.206908198898
=========
Epoch: 300
Train loss:  0.20638022849
=========
Epoch: 400
Train loss:  0.20590160288
=========
Epoch: 500
Train loss:  0.205462049248
=========
Epoch: 600
Train loss:  0.205056408521
=========
Epoch: 700
Train loss:  0.204680361936
=========
Epoch: 800
Train loss:  0.204330297852
=========
Epoch: 900
Train loss:  0.204003215114
=========
Finished training!

计算测试 (Test) 数据的精确度

# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Prediction accuracy: 0.700

为者常成，行者常至

Python:分析学生数据 Lab (六十九)

Lab：分析学生数据

加载数据

绘制数据

将评级进行 One-hot 编码

缩放数据

将数据分成训练集和测试集

将数据分成特征和目标（标签）

训练二层神经网络

计算测试 (Test) 数据的精确度

AI

作者：Corwien

专栏推荐

Python:分析学生数据 Lab (六十九)

Lab：分析学生数据

加载数据

绘制数据

将评级进行 One-hot 编码

缩放数据

将数据分成训练集和测试集

将数据分成特征和目标（标签）

训练二层神经网络

计算测试 (Test) 数据的精确度

添加附言

AI

作者：Corwien

专栏推荐