使用 Langchain 和许多 .txt 文件查询 GPT4All 本地模型

问：

python 3.8, Windows 10, neo4j==5.14.1, langchain==0.0.336

我正在尝试利用本地 Langchain 模型（GPT4All）来帮助我通过查询将加载文件的语料库转换为数据结构。我在下面提供了一个最小的可重现示例代码，以及对我试图模拟的文章/存储库的引用。我还提供了一个“上下文”，该上下文应与所有对象一起包含在查询中。我仍在学习如何使用，所以我真的不知道我在做什么，但我得到的当前回溯如下所示：.txtneo4jDocumentLangchain

Traceback (most recent call last):
  File ".\neo4jmain.py", line xx, in <module>
    prompt_template = PromptTemplate(
  File "C:\Users\chalu\AppData\Local\Programs\Python\Python38\lib\site-packages\langchain\load\serializable.py", line 97, in __init__
    super().__init__(**kwargs)
  File "pydantic\main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic\main.py", line 1102, in pydantic.main.validate_model
  File "C:\Users\chalu\AppData\Local\Programs\Python\Python38\lib\site-packages\langchain\schema\prompt_template.py", line 76, in validate_variable_names
    if "stop" in values["input_variables"]:
KeyError: 'input_variables'

正如你所看到的，我实际上没有在任何地方定义，所以我假设这是Langchain的默认行为，但同样，不确定。我也收到一个错误：input_variables

LLaMA ERROR: The prompt is 5161 tokens and the context window is 2048!
ERROR: The prompt size exceeds the context window size and cannot be processed.

...这显然是查询字符串本身太大的结果。我希望能够查询我的文档以获取答案，同时为模型提供要引用的文档。我该怎么做？Langchain文档对于这个领域的菜鸟来说不是很好，它到处都是，并且缺乏许多针对菜鸟的简单用例，所以我在这里问它。

# https://medium.com/neo4j/enhanced-qa-integrating-unstructured-and-graph-knowledge-using-neo4j-and-langchain-6abf6fc24c27
# https://github.com/sauravjoshi23/ai/blob/main/retrieval%20augmented%20generation/integrated-qa-neo4j-langchain.ipynb

# Script to convert a corpus of many text files into a neo4j graph

# Imports
import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms.gpt4all import GPT4All
from langchain.prompts import PromptTemplate
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def bert_len(text):
    """Return the length of a text in BERT tokens."""
    tokens = tokenizer.encode(text)
    return len(tokens)

def get_files(path: str) -> list:
    """Return a list of all files in a directory, recursively."""
    files = []
    for file in os.listdir(path):
        file_path = os.path.join(path, file)
        if os.path.isdir(file_path):
            files.extend(get_files(file_path))
        else:
            files.append(file_path)
    return files

# Get the text files
all_txt_files = get_files('data')
raw_txt_files = []
for current_file in all_txt_files:
    raw_txt_files.extend(TextLoader(current_file, encoding='utf-8').load())

# Create a text splitter object that will help us split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1024, # 200,
    chunk_overlap = 128, # 20
    length_function = bert_len,
    separators=['\n\n', '\n', ' ', ''],
)

# Split the text into "documents"
documents = text_splitter.create_documents([raw_txt_files[0].page_content])

# Utilizing these Document objects, we want to query the GPT4All model to help us create
# a JSON object that contains the ontology of terms mentioned in the given context,
# while mitigating "max_tokens" error.
# Create a PromptTemplate object that will help us create the prompt for GPT4All(?)
prompt_template = PromptTemplate(
    template = """
    You are a network graph maker who extracts terms and their relations from a given context.
    You are provided with a context chunk (delimited by ```). Your task is to extract the ontology
    of terms mentioned in the given context. These terms should represent the key concepts as per the context.
    
    Thought 1: While traversing through each sentence, Think about the key terms mentioned in it.
        Terms may include object, entity, location, organization, person,
        condition, acronym, documents, service, concept, etc.
        Terms should be as atomistic as possible
    
    Thought 2: Think about how these terms can have one on one relation with other terms.
        Terms that are mentioned in the same sentence or the same paragraph are typically related to each other.
        Terms can be related to many other terms
        
    Thought 3: Find out the relation between each such related pair of terms.

    Format your output as a list of json. Each element of the list contains
    a pair of terms and the relation between them, like the following:
    [Dict("node_1": "A concept from extracted ontology",
            "node_2": "A related concept from extracted ontology",
            "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences",
        ),
    Dict("node_1": "A concept from extracted ontology",
        "node_2": "A related concept from extracted ontology",
        "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences",
    ),
    Dict(...)]
    Context Documents: {documents}
    """,
    variables = {
        "documents": documents,
    }
)

# Create a GPT4All object that will help us query the GPT4All model
llm = GPT4All(
    model=r"C:\Users\chalu\AppData\Local\nomic.ai\GPT4All\gpt4all-falcon-q4_0.gguf",
    n_threads=3,
    max_tokens=5162, # <-- attempt to mitigate "max_tokens" error
    verbose=True,
)

# Get the response from GPT-4-All
response = llm(prompt_template)
print(response)

蟒蛇 neo4j langchain

文档到处都是，很难理解。但是，我将该行更改为，现在收到错误提示'应为字符串。而是找到了<类'langchain.prompts.prompt.PromptTemplate'>。如果你想在多个提示符上运行LLM，请改用。我现在该如何解决这个问题？GPT4All 没有生成方法？？input_variables=["documents"]ValueError: Argument generate

上一个：使用 opensearch_vector_search 筛选元数据

下一个：带有 Langchain 工具的 Llama 2

使用 Langchain 和许多 .txt 文件查询 GPT4All 本地模型 - KeyError： 'input_variables'

Query GPT4All local model with Langchain and many .txt files - KeyError: 'input_variables'

评论

评论