ChromaDB入门教程:从基础操作到LangChain集成
ChromaDB入门教程:从基础操作到LangChain集成
ChromaDB是一个开源的向量数据库,用于存储和检索文本的向量嵌入。随着大型语言模型(LLM)的兴起,向量数据库变得越来越重要。ChromaDB支持多种底层存储选项,提供Python和JavaScript/TypeScript的SDK,专注于简单性、速度和支持性分析。本文将详细介绍ChromaDB的使用方法,包括创建集合、添加文档、执行相似性搜索以及结合LangChain使用。
什么是向量存储?
向量存储是专门为有效地存储和检索向量嵌入而设计的数据库。传统的关系数据库不适合存储和搜索高维向量数据,而向量存储可以通过相似性算法对相似的向量进行索引和快速搜索。在个性化聊天机器人等应用中,向量存储可以将用户输入转换为嵌入,并在文档集合中搜索相似文本,从而生成高度个性化和准确的响应。
什么是ChromaDB?
ChromaDB是一个开源的向量存储,用于存储和检索文本的向量嵌入。它的主要用途是保存嵌入和元数据,以便以后由大型语言模型使用。ChromaDB的主要特点包括:
- 支持不同的底层存储选项,例如DuckDB和ClickHouse
- 提供Python和JavaScript/TypeScript的SDK
- 专注于简单性、速度和支持性分析
ChromaDB入门
在本节中,我们将创建一个向量数据库,添加集合,向集合添加文本,并执行查询搜索。首先,确保已经安装了ChromaDB和OpenAI API:
pip install chromadb openai
接下来,创建一个持久数据库:
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="db/"
))
使用客户端创建一个集合对象:
collection = client.create_collection(name="Students")
生成一些随机文本:
student_info = """
Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""
club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""
university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""
将文本添加到集合中:
collection.add(
documents=[student_info, club_info, university_info],
metadatas=[{"source": "student info"}, {"source": "club info"}, {'source': 'university info'}],
ids=["id1", "id2", "id3"]
)
执行相似性搜索:
results = collection.query(
query_texts=["What is the student name?"],
n_results=2
)
print(results)
使用其他模型
ChromaDB支持使用其他嵌入模型,例如OpenAI的text-embedding-ada-002:
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
model_name="text-embedding-ada-002"
)
students_embeddings = openai_ef([student_info, club_info, university_info])
print(students_embeddings)
更新和删除数据
更新集合中的值:
collection.update(
ids=["id1"],
documents=["Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA"],
metadatas=[{"source": "student info"}],
)
删除集合中的记录:
collection.delete(ids=['id1'])
其他管理操作
ChromaDB提供了丰富的管理操作,包括创建客户端、遍历集合、创建新集合、获取集合、删除集合等。以下是一些示例:
# 创建客户端
# client = chromadb.Client() # 内存模式
client = chromadb.PersistentClient(path="./chromac") # 数据保存在磁盘
# 遍历集合
print(client.list_collections())
# 创建新集合
collection = client.create_collection("testname")
# 获取集合
collection = client.get_collection("testname")
# 创建或获取集合
collection = client.get_or_create_collection("testname")
# 删除集合
client.delete_collection("testname")
# 获取集合中最新的5条数据
print(collection.peek())
# 添加数据
collection.add(
documents=["2022年2月2号,美国国防部宣布:将向欧洲增派部队,应对俄乌边境地区的紧张局势.", " 2月17号,乌克兰军方称:东部民间武装向政府军控制区发动炮击,而东部民间武装则指责乌政府军先动用了重型武器发动袭击,乌东地区紧张局势持续升级"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
# 更新数据
collection.update(
ids=["id1"],
documents=["Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA"],
metadatas=[{"source": "student info"}],
)
# 删除数据
collection.delete(ids=['id1'])
# 查询数据
results = collection.query(
query_texts=["俄乌战争发生在哪天?"],
n_results=2
)
print(results)
结合LangChain使用ChromaDB
LangChain是一个用于构建AI应用程序的框架,可以与ChromaDB很好地集成。以下是一个示例:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import os
os.environ["OPENAI_API_KEY"] = 'sk-xxxxxx'
loader = TextLoader('./russia.txt', encoding='gbk') # 中文必须带 encoding='gbk'
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
# 持久化数据
persist_directory = './chromadb'
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)
vectordb.persist()
# 直接加载数据
vectordb = Chroma(persist_directory="./chromadb", embedding_function=embedding)
query = "What did the president say about Ketanji Brown Jackson"
docs = vectordb.similarity_search(query)
print(docs[0].page_content)
# 直接加载数据库,然后查询相似度的文本
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
query = "On what date did the war between Russia and Ukraine take place?"
retriever = vectordb.as_retriever(search_type="mmr")
s = retriever.get_relevant_documents(query)
print(s[0].page_content)
# 直接用get获取数据
res = vectordb.get(limit=2)
print(res)
总结
ChromaDB正在成为大型语言模型系统的重要组成部分。通过提供专用的存储和向量嵌入的高效检索,它们能够快速访问相关的语义信息,从而为LLM提供支持。本文介绍了ChromaDB的基本操作和高级功能,帮助读者快速掌握ChromaDB的使用方法。