Skip to main content
Open on GitHub

Graph RAG

This guide provides an introduction to Graph RAG. For detailed documentation of all supported features and configurations, refer to the Graph RAG Project Page.

Overviewโ€‹

The GraphRetriever from the langchain-graph-retriever package provides a LangChain retriever that combines unstructured similarity search on vectors with structured traversal of metadata properties. This enables graph-based retrieval over an existing vector store.

Integration detailsโ€‹

RetrieverSourcePyPI PackageLatestProject Page
GraphRetrievergithub.com/datastax/graph-raglangchain-graph-retrieverPyPI - VersionGraph RAG

Benefitsโ€‹

  • Link based on existing metadata: Use existing metadata fields without additional processing. Retrieve more from an existing vector store!

  • Change links on demand: Edges can be specified on-the-fly, allowing different relationships to be traversed based on the question.

  • Pluggable Traversal Strategies: Use built-in traversal strategies like Eager or MMR, or define custom logic to select which nodes to explore.

  • Broad compatibility: Adapters are available for a variety of vector stores with support for additional stores easily added.

Setupโ€‹

Installationโ€‹

This retriever lives in the langchain-graph-retriever package.

pip install -qU langchain-graph-retriever

Instantiationโ€‹

The following examples will show how to perform graph traversal over some sample Documents about animals.

Prerequisitesโ€‹

Toggle for Details
  1. Ensure you have Python 3.10+ installed

  2. Install the following package that provides sample data.

    pip install -qU graph_rag_example_helpers
  3. Download the test documents:

    from graph_rag_example_helpers.datasets.animals import fetch_documents
    animals = fetch_documents()
  4. Select embeddings model:
  5. OpenAI
  6. Azure
  7. Google
  8. AWS
  9. HuggingFace
  10. Ollama
  11. Cohere
  12. MistralAI
  13. Nomic
  14. NVIDIA
  15. Voyage AI
  16. IBM
  17. Fake
pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Populating the Vector storeโ€‹

This section shows how to populate a variety of vector stores with the sample data.

For help on choosing one of the vector stores below, or to add support for your vector store, consult the documentation about Adapters and Supported Stores.

Install the langchain-graph-retriever package with the astra extra:

pip install "langchain-graph-retriever[astra]"

Then create a vector store and load the test documents:

from langchain_astradb import AstraDBVectorStore

vector_store = AstraDBVectorStore.from_documents(
documents=animals,
embedding=embeddings,
collection_name="animals",
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)

For the ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN credentials, consult the AstraDB Vector Store Guide.

note

For faster initial testing, consider using the InMemory Vector Store.

Graph Traversalโ€‹

This graph retriever starts with a single animal that best matches the query, then traverses to other animals sharing the same habitat and/or origin.

from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

traversal_retriever = GraphRetriever(
store = vector_store,
edges = [("habitat", "habitat"), ("origin", "origin")],
strategy = Eager(k=5, start_k=1, max_depth=2),
)

The above creates a graph traversing retriever that starts with the nearest animal (start_k=1), retrieves 5 documents (k=5) and limits the search to documents that are at most 2 steps away from the first animal (max_depth=2).

The edges define how metadata values can be used for traversal. In this case, every animal is connected to other animals with the same habitat and/or origin.

results = traversal_retriever.invoke("what animals could be found near a capybara?")

for doc in results:
print(f"{doc.id}: {doc.page_content}")
capybara: capybaras are the largest rodents in the world and are highly social animals.
heron: herons are wading birds known for their long legs and necks, often seen near water.
crocodile: crocodiles are large reptiles with powerful jaws and a long lifespan, often living over 70 years.
frog: frogs are amphibians known for their jumping ability and croaking sounds.
duck: ducks are waterfowl birds known for their webbed feet and quacking sounds.

Graph traversal improves retrieval quality by leveraging structured relationships in the data. Unlike standard similarity search (see below), it provides a clear, explainable rationale for why documents are selected.

In this case, the documents capybara, heron, frog, crocodile, and newt all share the same habitat=wetlands, as defined by their metadata. This should increase Document Relevance and the quality of the answer from the LLM.

Comparison to Standard Retrievalโ€‹

When max_depth=0, the graph traversing retriever behaves like a standard retriever:

standard_retriever = GraphRetriever(
store = vector_store,
edges = [("habitat", "habitat"), ("origin", "origin")],
strategy = Eager(k=5, start_k=5, max_depth=0),
)

This creates a retriever that starts with the nearest 5 animals (start_k=5), and returns them without any traversal (max_depth=0). The edge definitions are ignored in this case.

This is essentially the same as:

standard_retriever = vector_store.as_retriever(search_kwargs={"k":5})

For either case, invoking the retriever returns:

results = standard_retriever.invoke("what animals could be found near a capybara?")

for doc in results:
print(f"{doc.id}: {doc.page_content}")
capybara: capybaras are the largest rodents in the world and are highly social animals.
iguana: iguanas are large herbivorous lizards often found basking in trees and near water.
guinea pig: guinea pigs are small rodents often kept as pets due to their gentle and social nature.
hippopotamus: hippopotamuses are large semi-aquatic mammals known for their massive size and territorial behavior.
boar: boars are wild relatives of pigs, known for their tough hides and tusks.

These documents are joined based on similarity alone. Any structural data that existed in the store is ignored. As compared to graph retrieval, this can decrease Document Relevance because the returned results have a lower chance of being helpful to answer the query.

Usageโ€‹

Following the examples above, .invoke is used to initiate retrieval on a query.

Use within a chainโ€‹

Like other retrievers, GraphRetriever can be incorporated into LLM applications via chains.

pip install -qU "langchain[groq]"
import getpass
import os

if not os.environ.get("GROQ_API_KEY"):
os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("llama3-8b-8192", model_provider="groq")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)

def format_docs(docs):
return "\n\n".join(f"text: {doc.page_content} metadata: {doc.metadata}" for doc in docs)

chain = (
{"context": traversal_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("what animals could be found near a capybara?")
Animals that could be found near a capybara include herons, crocodiles, frogs,
and ducks, as they all inhabit wetlands.

API referenceโ€‹

To explore all available parameters and advanced configurations, refer to the Graph RAG API reference.


Was this page helpful?