西瓜书习题 3.3(对率回归 LR)

编程实现对率回归，并给出西瓜数据集上的结果

西瓜数据集如下：

	ID  density  Sugar_content  label

0    1    0.697         0.4600      1
1    2    0.774         0.3760      1
2    3    0.634         0.2640      1
3    4    0.608         0.3180      1
4    5    0.556         0.2150      1
5    6    0.403         0.2370      1
6    7    0.481         0.1490      1
7    8    0.437         0.2110      1
8    9    0.666         0.0910      0
9   10    0.243         0.0267      0
10  11    0.245         0.0570      0
11  12    0.343         0.0990      0
12  13    0.639         0.1610      0
13  14    0.657         0.1980      0
14  15    0.360         0.3700      0
15  16    0.593         0.0420      0
16  17    0.719         0.1030      0

这题的关键就在于对对率回归的理解，附上对率回归的手写版公式推导过程：
在这里插入图片描述

在这里插入图片描述

推导RL的过程，得到了梯度公式，接下来用梯度上升算法实现RL（还有一种是用牛顿法实现，以后有时间在补充吧qwq~）

1
2
3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#读取文件
df = pd.read_csv('watermelon3.0alpha.csv')
print(df)
#方便矩阵运算，添一列1
df['one'] = 1.0
#将训练集装进矩阵
train_X = np.mat(df[['one','density','Sugar_content']])
#标签
labels = np.mat(df[['label']])

1
2
3

#sigmoid函数
def sigmoid(z):
    return 1.0/(1.0 + np.exp(-z))

#梯度上升算法
def grad(train_X,labels,iters = 2000):
    m,n = train_X.shape
    #步长alpha
    alpha = 0.05
    #初始化权重，全设为1
    weights = np.ones((n,1))
    	
    #2000次迭代
    for k in range(iters):
        #沿着梯度方向，向前移动，并更新权重
        P = sigmoid(train_X.dot(weights))
        error = labels - P
        weights += alpha * np.dot(train_X.T,error)

    return weights

#求出最优回归参数
weights = grad(train_X,labels)
print(weights)

求得参数如下：

1
2
3

[[-3.12066518]
 [ 0.76966008]
 [13.22972573]]

#绘图
x1,y1 = [],[]
x2,y2 = [],[]
x3,y3 = [],[]
x4,y4 = [],[]

for k in range(train_X.shape[0]):
    if labels[k] == 1:
        if sigmoid(np.dot(train_X[k,:],weights)) >= 0.5 :
            x1.append(train_X[k,1])
            y1.append(train_X[k,2])
        else:
            x2.append(train_X[k,1])
            y2.append(train_X[k,2]) 
    else:  
        if sigmoid(np.dot(train_X[k,:],weights)) < 0.5 :
            x3.append(train_X[k,1])
            y3.append(train_X[k,2])
        else:
            x4.append(train_X[k,1])
            y4.append(train_X[k,2])
            
plt.scatter(x1,y1,s=30,c='red')
plt.scatter(x2,y2,s=30,c='red',marker='x')
plt.scatter(x3,y3,s=30,c='green')
plt.scatter(x4,y4,s=30,c='green',marker='x')

#绘制直线    w0 + w1x1 +w2x2 = 0
X = np.arange(0,0.8,0.01)
Y = -(weights[0] + weights[1] * X)/weights[2]

#总结：绘制直线用 plot ， 绘制散点 用scatrer
plt.plot(X,Y)

plt.xlabel('Density')
plt.ylabel('Sugar_Content')
plt.title("LogisticRegression")
plt.show()

在这里插入图片描述