如何使用大语言模型进行数据标注（非常详细）零基础入门到精通，收藏这一篇就够了

创作时间:

作者:

@小白创作中心

如何使用大语言模型进行数据标注（非常详细）零基础入门到精通，收藏这一篇就够了

引用

CSDN

https://blog.csdn.net/Python_0011/article/details/140781222

数据标注是大语言模型（LLMs）的一个重要应用。本文将分享使用ChatGPT（API版本3.5和4）进行Aspect-Based Sentiment Analysis（ABSA）即基于方面的情感分析的一些见解和知识。通过对比传统人工标注和基于LLM的标注方法，探讨其在成本、性能和效率等方面的差异。

ABSA任务和数据集

基于方面的情感分析（ABSA）是一项自然语言处理任务，旨在识别和提取产品或服务特定方面的情感。例如，以下是一条餐厅评论，其中包含四个方面的情感：

这个地方非常酷，装饰也很棒。饮料还可以，但有点贵。

氛围：正面

食物：中性

价格：负面

服务：未提及

我们为一个研究项目创建了一个酒店评论的ABSA数据集，该数据集包含大约200万条酒店评论，涉及三个方面：

员工服务
服务机器人的服务
人机互动

传统过程

一个ABSA任务通常包括以下步骤：

手动标注一部分数据，分两步进行：首先定义方向，然后标注相应的情感。检查标注的一致性，如果人手不够则需要更多的标注员。
使用标注的数据训练一个模型并检查性能（如有需要则标注更多数据）。
使用训练好的模型预测其余的数据集。

这个过程非常耗费人力和成本，例如，我们手动标注了约2.5万条评论，几位标注员花了几周时间完成。

基于LLM的过程

使用LLM进行ABSA的过程如下：

手动标注一个小的测试数据集，例如几百条评论。
编写一个few-shot标注提示词并标注100条评论。
审查初步的标注结果，将任何错误标注的示例，特别是有挑战性的案例，作为补充示例加入few-shot提示词。根据需要重复此过程。
使用最终提示词标注其余的数据集。

提示词工程

提示词按照之前概述的步骤构建，最终如下所示：

You are an experienced data labeling engineer with extensive experience in labeling hotel reviews. Your task is to classify a review based on three dimensions, with four categories: positive, negative, neutral, and not mentioned.

The definitions and examples of the three dimensions are as follows:

Dimension 1: Quality of hotel staff service
Definition: customer perceptions directly related to staff behavior or attitude, such as timely service, skilled, knowledgeable, professional, polite, caring, understanding, sincere, helpful, etc.
Examples:
Review: The cleaning lady cleans in a timely manner.
Sentiment: Positive
Review: Staff were testing robots in the hallway, the noise was very loud and annoying, and the front desk did nothing about it!
Sentiment: Negative
[more examples]...

Dimension 2: Quality of robot service
Definition: customer perceptions of robot functionality or perceptions of the service result after using the robot
Examples:
Review: The robot is very convenient
Sentiment: Positive
Review: The robot delivers too slowly
Sentiment: Negative
[more examples]...

Dimension 3: Human-robot interaction perception
Definition: customer perceptions other than robot functionality, such as robot social intelligence (communication understanding ability), robot social existence (making one feel it has human characteristics or experiences a human can bring), robot design and novelty (voice, and posture freshness, curiosity, advanced, coolness.
Examples:
Review: The little robot speaks adorably, too cute
Sentiment: Positive
Review: The robot's voice is too loud and noisy;
Sentiment: Negative
[more examples]...

Now, classify the sentiment of the following review into three dimensions using a JSON object as the output method, with "employee_service", "robot_service", "human_robot_interaction" as the keys and the value is one of positive/negative/neutral/unknown.

Here is the hotel review:

性能比较

最新的API定价如下（GPT4大约是GPT3.5的20倍贵，但是 OpenAI现在最新的GPT-4o的比之前的GPT4便宜了一半）：

上述提示词相当长（在此我们使用了tiktoken来计算token数量），大约1400个token。连同实际评论，输入OpenAI API大约1500个token。

所以，标注一条评论的成本大致如下（考虑到输出很短，我们可以忽略输出的成本估计）：

GPT-4：$0.015
GPT-3.5：$0.00075
GPT-4o：$0.0075

对于200万条评论，总成本如下：

GPT-4：$30K美元 —— 相当贵！
GPT-3.5：$1.5K美元
GPT-4o：$15K美元

为了降低成本，我们尝试了以下方法：

小批量处理：将100条评论（而不是1条）附加到提示词中，以减少API调用的总次数。注意：如果每批次放入太多评论，输出可能会超过最大4096个token。
监督微调（SFT）：尝试使用100条评论微调GPT 3.5，去掉上述few-shot提示词中的示例以减少输入token（从约1500减少到约500）。

然而，小批量处理和SFT的标注性能都要差得多，如下所示。