资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

网络小说阅读数据分析与可视化实战

创作时间:

作者:

@小白创作中心

网络小说阅读数据分析与可视化实战

引用

51CTO

https://blog.51cto.com/u_16213621/12429848

本文通过爬虫获取某小说网站的数据，并进行可视化分析，详细描述了爬虫的实现过程、数据清洗、以及使用matplotlib和sklearn进行数据可视化和回归分析的具体步骤。

一、选题的背景

在电子阅读的时代，青少年在网上都喜欢看些什么样的书籍？是都市言情？经典文学？惊悚恐怖？只有把握住青少年阅读的趋势，我们的写作者，才能对症下药，提高受众面，写出符合青少年需求，符合时代主旋律的小说，进一步引导青少年培养正确的价值观。通过这次数据分析，我们通过抓取某小说网站的各类小说的阅读量，各个小说的打赏量，分析出目前青少年喜欢看的小说类型，以及阅读量与打赏量的回归关系。

二、爬虫设计

爬虫名称：某小说网站小说类型阅读量，打赏量爬虫。
爬取的内容：各小说类型的总阅读量，以及各个小说的阅读量与打赏量。数据特征：阅读量与打赏量都为数字，所在html元素格式相似，均可用re正则表达式提取。
思路：总得思路是，先爬虫，后数据分析。爬虫中，先使用requests打开小说网站的小说总库，顺着小说总库网页爬取小说的地址，进入小说介绍页面后提取小说的类型、阅读量、打赏量。完成爬虫，拿到小说数据以后，进行简单的清洗，绘制相关的统计图，建立回归方程模型，判断小说阅读量与打赏量之间的潜在关系。难点：俗话说，假若你使用re进行数据提取，那么原本你有一个问题，现在有了两个问题，正确写出re的表达式是个难点，同样熟练使用matplotlib，sklearn也是一个难点。

三、爬虫实现

所选的小说网站比较简单，缺乏相应的反爬措施，具体过程如下：

# -*- coding: UTF-8 -*-
#!/usr/bin/env python3
# Author:  ZhangQiancheng

import re
import time
import requests
import numpy as np
from time import sleep
from random import random
import matplotlib.pyplot as plt
from sklearn import linear_model

class Work(object):
    def __init__(self):
        self.novel_urls = []  # 初始化小说地址列表，
        self.novel_type_data = {
            "玄幻奇幻": 0,
            "武侠小说": 0,
            "同人小说": 0,
            "都市言情": 0,
            "军事历史": 0,
            "科幻网游": 0,
            "恐怖灵异": 0,
            "青春校园": 0
        }   # 初始化小说类型数据
        self.novel_data = {}  # 初始化小说数据
        plt.rcParams['font.sans-serif'] = ['SimHei']  # 设置matplotlib支持中文
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    def novel_url_spider(self):
        """
        在总库页面中，获取小说链接地址。
        :return:
        """
        for page in range(1, 51):
            index_url = f"https://b.faloo.com/y_0_0_0_0_0_0_{page}.html"
            index_res = requests.get(index_url, headers=self.headers)
            index_res.encoding = "gb2312"  # 网页内容编码转换， 下同。
            novel_urls_temp = re.findall(r"//b.faloo.com/\d+?\.html", index_res.text)
            self.novel_urls = list(set([f'https:{i}' for i in novel_urls_temp]))
            sleep(random())  # 随机休眠， 反爬并减少对目标网站的访问压力。
            break  # 测试使用，减少爬虫运行时间, 下同

    def novel_data_spider(self):
        """
        爬取小说数据，包裹小说类型，小说阅读量，小说打赏量。
        :return:
        """
        self.novel_url_spider()
        for novel_url in self.novel_urls:
            novel_res = requests.get(novel_url, headers=self.headers)
            novel_res.encoding = "gb2312"
            try:  # 异常捕捉，防止程序崩溃。
                # 使用枕着表达式提取数据。
                novel_type = re.search('<span class="colorQianHui">小说分类.+?<a href=.+? target="_blank" title=.+?>(.+?)</a>', novel_res.text, flags=re.DOTALL).group(1)
                reading_count = re.search(r'<span class="colorQianHui mgLeft20">总阅读数：<span class="colorHei">(\d+?)</span></span>', novel_res.text).group(1)
                reading_count = int(reading_count)
                reward_count = re.search(r'<span class="colorQianHui mgLeft20">总鲜花数：<span class="colorHei">(\d+?)</span></span>', novel_res.text).group(1)
            except (AttributeError, ValueError):
                print("数据抓取错误")
                continue
            else:
                # 将数据插入到对应的列表、字典中。
                if novel_type in self.novel_type_data:
                    self.novel_type_data[novel_type] += reading_count
                self.novel_data[novel_url] = {"reading_count": reading_count, "reward_count": reward_count}
                print(f'{novel_type}: {reading_count}阅读 - {reward_count}打赏')
            sleep(random())

        print(self.novel_type_data)

四、数据分析与可视化

1. 散点图

def make_scatter_plot(self):
    """
    制作散点图，
    :return:
    """
    x_reading, y_reward = self.data_clean()  # 获取清洗后的数据， 下同

    axes_3 = plt.axes()
    axes_3.scatter(x_reading, y_reward, s=5)
    axes_3.set_title("阅读量-打赏量散点图")
    axes_3.set_xlabel("阅读量")
    axes_3.set_ylabel("打赏量")
    plt.savefig("./scatter_plot.png")
    plt.show()

2. 柱状图

def make_histogram(self):
    """
    绘制柱状图，分析各类小说类型的阅读量。
    :return:
    """
    axes_1 = plt.axes()
    x = []
    y = []
    for each in self.novel_type_data:
        x.append(each)
        y.append(self.novel_type_data[each])

    axes_1.bar(x, y, 0.4, color="blue")
    axes_1.set_xlabel("小说类型")  # 设置横轴标签, 下同
    axes_1.set_ylabel("阅读量")   # 设置纵轴标签，下同
    axes_1.set_title("各类小说阅读量柱状图")  # 设置图表标题，下同
    plt.savefig("./histogram.png")  # 保存图表到本地，下同。
    plt.show()  # 图表展示，下同。

3. 饼状图

def make_pie_chart(self):
    """
    绘制小说类型饼状图，分析占比。
    :return:
    """
    novel_types = []
    novel_reading_data = []
    for novel_type in self.novel_type_data:
        if self.novel_type_data[novel_type]:
            novel_types.append(novel_type)
            novel_reading_data.append(self.novel_type_data[novel_type])

    axes_2 = plt.axes()
    axes_2.pie(novel_reading_data, labels=novel_types)
    axes_2.set_title("各小说阅读总量占比饼状图")
    plt.savefig("./pie_chart.png")
    plt.show()

4. 回归方程

def regression_equation(self):
    """
    建立回归函数模型
    :return:
    """
    x_reading, y_reward = self.data_clean()  # 获取数据清洗后的数据
    x_reading = np.array(x_reading)

    regression = linear_model.LinearRegression()
    regression.fit(x_reading.reshape(-1, 1), y_reward)  # 数据拟合
    a, b = regression.coef_, regression.intercept_  # 获取斜率、截距
    print(f"回归方程: y={a[0]}x+b")

    axes_4 = plt.axes()
    axes_4.scatter(x_reading, y_reward, s=5, color="red")
    axes_4.plot(x_reading, regression.predict(x_reading.reshape(-1, 1)), color="blue")
    axes_4.set_title("阅读量-打赏量回归方程魔性图")
    axes_4.set_xlabel("阅读量")
    axes_4.set_ylabel("打赏量")
    plt.savefig("./regression_equation.png")
    plt.show()

在main函数中分别调用novel_data_spider() make_histogram() make_pie_chart() make_scatter_plot() regression_equation() save_data()函数分别输出如下内容:

def main(self):
    """
    统一调用各个函数，
    :return:
    """
    self.novel_data_spider()
    self.make_histogram()
    self.make_pie_chart()
    self.make_scatter_plot()
    self.regression_equation()
    self.save_data()

if __name__ == "__main__":
    work = Work()
    work.main()