RAG, but Better: Graph RAG – 1

RAG, but Better: Graph RAG – 1

Imagine stepping into a colossal library, its shelves stretching endlessly, filled with books. You ask, “What are the key trends in the tech industry?” Now, picture the librarian not only finding relevant books but also summarizing all the tech-related resources in the library to provide you with a concise, insightful answer. This is precisely what Graph RAG does—an innovative “smart librarian” for complex data.
 
Graph RAG takes traditional Retrieval-Augmented Generation (RAG) systems to the next level, designed to answer complex questions and summarize large-scale datasets. Moving beyond simple information retrieval, it organizes data using graph-based structures and condenses information into community-level summaries. This ensures a deeper understanding of the context and core of any query.
 
Today, we’ll explore how Graph RAG works, its research breakthroughs, and its real-world applications.

The Limitations and Necessity

RAG retrieves relevant information and generates answers effectively for straightforward questions like “Who won the Nobel Prize in 2023?” However, it faces challenges with complex queries requiring contextual understanding and summarization, such as “What are the main themes of this dataset?”

These issues stem from RAG’s tendency to treat data as isolated fragments, often overlooking broader connections. Moreover, large datasets can surpass the processing limits of LLMs. Graph RAG overcomes these limitations by restructuring data into graph-based formats and summarizing it at the community level, enabling more efficient and holistic data analysis.

How Graph RAG Works

Graph RAG goes beyond merely retrieving data—it divides and organizes it into meaningful units. Let’s take a closer look at how it works step by step.

Graph RAG pipeline.

Graph RAG pipeline.

1. Data Splitting (Source Documents → Text Chunks)  
First, we divide datasets into appropriately sized text chunks. If the chunks are too small, processing costs increase; if they are too large, they exceed the LLM’s processing capacity, leading to loss of information. Research shows that 600 tokens strike the optimal balance between efficiency and accuracy. For instance, 600-token chunks extracted nearly twice as many entities from the HotPotQA dataset compared to 2400-token chunks.
How the entity references detected in the HotPotQA dataset.

How the entity references detected varies with chunk size and gleanings.

2. Extracting Key Elements (Text Chunks → Element Instances)  
This step extracts key elements (nodes) and relationships (edges) from each text chunk. The LLM identifies:  
 
  • Nodes: Independent units of information, such as entities (e.g., people, places, or concepts)
  • Edges: Connections or relationships between entities (e.g., “A is part of B”)
 
To tailor the extraction process, the researchers used domain-specific prompts and few-shot learning. For example, relationships like diseases and symptoms in medical data or case precedents and legal arguments in legal data are identified. An iterative “gleaning” process ensures that initially overlooked entities are later captured.  
 
3. Generating Element Summaries (Element Instances → Element Summaries)  

Graph RAG summarizes extracted nodes and edges into concise descriptions, generating text that explains the nodes and their relationships to create meaningful summaries. For example, a node for “Renewable Energy” might have a summary such as, “Key trends include solar technology and policy support.”

To ensure clarity and avoid redundancy, variations of names (e.g., “AI” and “Artificial Intelligence”) are linked, and related elements are grouped together. This approach enhances data organization and improves the overall efficiency of information processing.

 
4. Creating Graphs and Communities (Element Summaries → Graph Communities)

The summarized nodes and edges are connected to form a graph, with related nodes grouped into communities using the Leiden algorithm.

For instance, in news data, a “Climate Change” community might cluster topics such as “Renewable Energy,” “Carbon Emissions,” and “Policy Reform.” This hierarchical structuring of communities facilitates efficient analysis of complex data, offering a clearer understanding of interrelated topics.

 
5. Summarizing Communities (Graph Communities → Community Summaries)  

Each community is summarized into a report using LLMs, highlighting key nodes and edges. These summaries can serve as:

  1. Indexes to answer specific questions.  
  2. Standalone insights to understand the dataset’s structure and meaning even without a query.
 
Here’s how community summaries are structured:  
  • Leaf-Level Communities:
    Detailed summaries prioritize key nodes, edges, and covariates, arranging them by importance before integrating them into the LLM’s context window.
  • Higher-Level Communities:
    The system aggregates summaries from leaf communities. When the context window limit is reached, it compresses lower-level summaries into concise text, retaining essential information.
 
6. Generating Responses (Community Summaries → Community Answers → Global Answer)
When a query is provided, Graph RAG generates intermediate responses using relevant community summaries. The LLM evaluates each response for utility (scored from 0 to 100). It then combines the highest-scoring responses to create a final global answer.

Back into the Library

Imagine an LLM receiving the question, “What are the key trends in the tech industry today?” and generating a response using Graph RAG.

Photo taken by Guillaume Henrotte

Community Summarization:

The dataset is divided into multiple communities. For instance:

  • An “AI Ethics” community might include nodes like “AI Transparency” and “Responsible Data Usage.”
Question Processing:
When a query is received, Graph RAG identifies the most relevant community summaries.
  • For the question “What are the key trends in the tech industry today?”, it references summaries from the “AI Ethics,” “Data Privacy,” and “Generative AI Applications” communities.
Final Response Construction:
Based on the relevant community summaries, the LLM generates individual responses:
  • AI Ethics Community: “AI transparency and responsible data usage are critical topics of discussion.”
  • Data Privacy Community: “Data protection regulations are tightening, driving advancements in technologies for secure personal data management.”
  • Generative AI Applications Community: “Generative AI is being applied across industries, including content creation, customer service, and product design.”

These responses are then combined to form a comprehensive and structured final answer:

“The key trends in the tech industry today are AI ethics, data privacy, and generative AI applications. AI transparency and responsible data usage are emphasized, while data protection regulations are being strengthened. Additionally, generative AI is expanding its applicability across various industries.”

Coming Soon

We examined how Graph RAG processes vast datasets by structuring them into communities. Unlike traditional methods that merely list data, Graph RAG leverages graph-based structures and community summaries to extract context and meaning, enabling efficient and insightful analysis of large datasets.

In our next article, we’ll explore the performance evaluation and real-world use cases of Graph RAG. Join us to uncover the full potential of this innovative tool and its transformative impact on data analysis and summarization. Stay tuned!

Your AI Data Standard

LLM Evaluation Platform
About Datumo
Related Posts