逻辑回归:处理缺失值
逻辑回归:处理缺失值
在机器学习项目中,数据预处理是一个至关重要的步骤,其中处理缺失值是必不可少的一环。本文将通过一个泰坦尼克号乘客数据集的实例,详细介绍如何使用逻辑回归模型处理缺失值。
识别缺失值
首先,我们需要查看数据集中哪些列的值缺失了一半以上:
df.isnull().sum() > (len(df) / 2)
输出结果如下:
PassengerId False
Survived False
Pclass False
Gender False
Age False
SibSp False
Parch False
Fare False
Cabin True
Embarked False
dtype: bool
从结果可以看出,"Cabin"列的缺失值超过一半,因此将其从数据框中删除:
df.drop('Cabin', axis=1, inplace=True)
处理剩余的缺失值
接下来,我们查看数据集中是否存在只有几个null值的列:
df.info()
输出结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Gender 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Fare 891 non-null float64
8 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB
从结果可以看出,"Age"列和"Embarked"列存在少量缺失值。我们先检查"Age"列的缺失值数量:
df['Age'].isnull().value_counts()
输出结果如下:
False 714
True 177
Name: Age, dtype: int64
由于"Age"与"船费"并不相关,这是一个我们希望最终在模型中使用的变量。因此,我们需要对那些缺失的值进行处理。在决定策略之前,我们应该检查一下年龄中位数对男女是否都相同:
df.groupby('Gender')['Age'].median().plot(kind='bar')
输出结果如下:
从图中可以看出,"泰坦尼克号"上男人和女人的年龄中位数是不同的,这意味着我们应该相应地处理缺失值。一个合理的策略是用"按性别划分的乘客"的年龄中位数来代替乘客缺失的年龄:
df['Age'] = df.groupby('Gender')['Age'].apply(lambda x: x.fillna(x.median()))
处理"Embarked"列的缺失值
接下来,我们查看"Embarked"列的缺失值情况:
df['Embarked'].value_counts()
输出结果如下:
S 644
C 168
Q 77
Name: Embarked, dtype: int64
由于"泰坦尼克号"的绝大部分乘客从南安普敦乘船,我们可以用统计上最可能的值(中位数结果)填充两个缺失值,即"南安普顿":
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()
输出结果如下:
S 646
C 168
Q 77
Name: Embarked, dtype: int64
转换分类数据
接下来,使用get_dummies
方法将"性别"和"登航港口"列中的分类数据转换为指标数据:
df = pd.get_dummies(data=df, columns=['Gender', 'Embarked'], drop_first=True)
df.head()
输出结果如下:
-------------------------------------------------------------------------------------------------------------------
| | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Gender_male | Embarked_Q | Embarked_S |
-------------------------------------------------------------------------------------------------------------------
| 0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
-------------------------------------------------------------------------------------------------------------------
| 1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
-------------------------------------------------------------------------------------------------------------------
| 2 | 3 | 0 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
-------------------------------------------------------------------------------------------------------------------
| 3 | 4 | 0 | 1 | 35.0 | 1 | 0 | 53.100 | 0 | 0 | 1 |
-------------------------------------------------------------------------------------------------------------------
| 4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
-------------------------------------------------------------------------------------------------------------------
最后检查相关矩阵
让我们最后看一下相关矩阵,看看是否还有其他需要删除的内容:
df.corr()
输出结果如下:
---------------------------------------------------------------------------------------------------------------------------------------------
| | PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Gender_male | Embarked_Q | Embarked_S |
---------------------------------------------------------------------------------------------------------------------------------------------
| PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.035734 | -0.057527 | -0.001652 | 0.012658 | 0.042939 | -0.033606 | 0.022204 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Survived | -0.005007 | 1.000000 | -0.338481 | -0.073296 | -0.035322 | 0.081629 | 0.257307 | -0.543351 | 0.003650 | -0.149683 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Pclass | -0.035144 | -0.338481 | 1.000000 | -0.338056 | 0.083081 | 0.018443 | -0.549500 | 0.131900 | 0.221009 | 0.074053 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Age | 0.035734 | -0.073296 | -0.338056 | 1.000000 | -0.236376 | -0.176038 | 0.094161 | 0.095256 | -0.032994 | -0.005855 |
---------------------------------------------------------------------------------------------------------------------------------------------
| SibSp | -0.057527 | -0.035322 | 0.083081 | -0.236376 | 1.000000 | 0.414838 | 0.159651 | -0.114631 | -0.026354 | 0.068734 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Parch | -0.001652 | 0.081629 | 0.018443 | -0.176038 | 0.414838 | 1.000000 | 0.216225 | -0.245489 | -0.081228 | 0.060814 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Fare | 0.012658 | 0.257307 | -0.549500 | 0.094161 | 0.159651 | 0.216225 | 1.000000 | -0.182333 | -0.117216 | -0.162184 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Gender_male | 0.042939 | -0.543351 | 0.131900 | 0.095256 | -0.114631 | -0.245489 | -0.182333 | 1.000000 | -0.074115 | 0.119224 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Embarked_Q | -0.033606 | 0.003650 | 0.221009 | -0.032994 | -0.026354 | -0.081228 | -0.117216 | -0.074115 | 1.000000 | -0.499421 |
---------------------------------------------------------------------------------------------------------------------------------------------
| Embarked_S | 0.022204 | -0.149683 | 0.074053 | -0.005855 | 0.068734 | 0.060814 | -0.162184 | 0.119224 | -0.499421 | 1.000000 |
---------------------------------------------------------------------------------------------------------------------------------------------
从相关矩阵可以看出,"Pclass"和"船费"有一定的关联,因此我们可以去掉其中一个。此外,还需要从X数据框中删除"Survived",因为它将成为我们的Y响应数据框:
X = df.drop(['Survived', 'Pclass'], axis=1)
y = df['Survived']
亲自试一试
现在我们需要将训练和测试数据分开,这将作为一个练习:
from sklearn.model_selection import train_test_split
了解我们如何使用train_test_split
,其位于线性回归:拟合模型单元。设置test_size = 0.3
和random_state = 67
以在运行代码示例的其余部分时获得与此处相同的结果。
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=67)
现在,你将导入并拟合逻辑回归模型:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
输出结果如下:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
然后,你将保存预测结果,以与y_test值(也称为真实输出)进行比较:
predictions = lr.predict(X_test)