解决Gensim自带的corpora进行维基百科中文语料数据提取问题

创作时间:

作者:

@小白创作中心

解决Gensim自带的corpora进行维基百科中文语料数据提取问题

引用

CSDN

https://blog.csdn.net/chenghao1012/article/details/139144383

在使用Gensim库处理维基百科中文语料数据时，可能会遇到一些常见的错误和问题。本文将详细介绍这些错误的原因，并提供具体的解决方案，帮助读者顺利地完成语料数据的提取和处理。

错误提示

在尝试使用Gensim自带的corpora进行维基百科中文语料数据提取时，可能会遇到以下错误提示：

NotImplementedError: The lemmatize parameter is no longer supported. If you need to lemmatize, use e.g. <https://github.com/clips/pattern>. 
Perform lemmatization as part of your tokenization function and pass it as the tokenizer_func parameter to this initializer.

问题所在

lemmatize 参数：如果你在创建 WikiCorpus 实例时使用了 lemmatize 参数，你需要移除它，因为该参数已不再支持。
dictionary 参数：WikiCorpus 类在 gensim 的最新版本中不接受 dictionary 参数。如果你需要使用自定义词典，你需要在创建 WikiCorpus 对象后，手动处理文本。
get_texts 方法：WikiCorpus 类没有 get_texts 方法。如果你想要获取文本，你需要使用 WikiCorpus 类的 get_texts 静态方法。

解决方法

移除 lemmatize 和 dictionary 参数
使用 WikiCorpus 类的 get_texts 静态方法，用WikiCorpus.get_texts(wiki)替换wiki.get_texts()

其他问题

gensim 库在初始化 WikiCorpus 类时使用了 multiprocessing 来并行处理数据。错误信息表明 gensim 库尝试在主进程的引导阶段结束之前启动新的进程。

要解决这个问题，你可以尝试以下步骤：

确保你的Python脚本是作为主模块运行的，而不是作为子模块导入到其他脚本中。
如果你的脚本是通过其他Python脚本导入并运行的，确保导入脚本遵循上述模式。

最终运行成功代码

from gensim.corpora import WikiCorpus
import multiprocessing

if __name__ == '__main__':
    # 如果你的程序不是被冻结成可执行文件，可以省略下一行
    multiprocessing.freeze_support()
    space = ""
    with open('wiki-zh-article.txt', 'w', encoding="utf8") as f:
        wiki = WikiCorpus('zhwiki-latest-pages-articles.xml.bz2')  # 移除 lemmatize 和 dictionary 参数
        for text in WikiCorpus.get_texts(wiki):  # 使用静态方法 get_texts
            f.write(space.join(text) + "\n")

    print("Finished Saved")