RAG, but Better: Graph RAG – 1

RAG, but Better: Graph RAG – 1

Imagine stepping into a colossal library, its shelves stretching endlessly, filled with books. You ask, “What are the key trends in the tech industry?” Now, picture the librarian not only finding relevant books but also summarizing all the tech-related resources in the library to provide you with a concise, insightful answer. This is precisely what Graph RAG does—an innovative “smart librarian” for complex data.
 
Graph RAG takes traditional Retrieval-Augmented Generation (RAG) systems to the next level, designed to answer complex questions and summarize large-scale datasets. Moving beyond simple information retrieval, it organizes data using graph-based structures and condenses information into community-level summaries. This ensures a deeper understanding of the context and core of any query.
 
Today, we’ll explore how Graph RAG works, its research breakthroughs, and its real-world applications.

The Limitations and Necessity

RAG is a system that retrieves relevant information and generates answers based on it. While it works well for straightforward questions like “Who won the Nobel Prize in 2023?”, it struggles with queries requiring contextual understanding and summarization, such as “What are the main themes of this dataset?”
 
This limitation arises because RAG processes data as isolated fragments, often missing the overall connections. Additionally, handling large datasets can exceed the processing capacity of LLMs (Large Language Models). To address these challenges, Graph RAG reorganizes data using graph-based structures and summarizes it at the community level, enabling more efficient and comprehensive data processing.

How Graph RAG Works

Graph RAG goes beyond merely retrieving data—it divides and organizes it into meaningful units. Let’s take a closer look at how it works step by step.

Graph RAG pipeline.

Graph RAG pipeline.

1. Data Splitting (Source Documents → Text Chunks)  
First, we divide datasets into appropriately sized text chunks. If the chunks are too small, processing costs increase; if they are too large, they exceed the LLM’s processing capacity, leading to loss of information. Research shows that 600 tokens strike the optimal balance between efficiency and accuracy. For instance, 600-token chunks extracted nearly twice as many entities from the HotPotQA dataset compared to 2400-token chunks.
How the entity references detected in the HotPotQA dataset.

How the entity references detected varies with chunk size and gleanings.

2. Extracting Key Elements (Text Chunks → Element Instances)  
This step extracts key elements (nodes) and relationships (edges) from each text chunk. The LLM identifies:  
 
  • Nodes: Independent units of information, such as entities (e.g., people, places, or concepts)
  • Edges: Connections or relationships between entities (e.g., “A is part of B”)
 
To tailor the extraction process, the researchers used domain-specific prompts and few-shot learning. For example, relationships like diseases and symptoms in medical data or case precedents and legal arguments in legal data are identified. An iterative “gleaning” process ensures that initially overlooked entities are later captured.  
 
3. Generating Element Summaries (Element Instances → Element Summaries)  
Extracted nodes and edges are summarized into concise descriptions. Graph RAG generates text explaining the nodes and their relationships, creating meaningful summaries. For instance, a node for “Renewable Energy” might have a summary like, “Key trends include solar technology and policy support.” To avoid redundancy, variations of names (e.g., “AI” and “Artificial Intelligence”) are linked, and related elements are grouped. This ensures clarity and efficient data organization.  
 
4. Creating Graphs and Communities (Element Summaries → Graph Communities)
The summarized nodes and edges are connected to form a graph. Using the Leiden algorithm, related nodes are grouped into communities.  
 
For example, in news data, a “Climate Change” community might group related topics like “Renewable Energy,” “Carbon Emissions,” and “Policy Reform.” Hierarchical structuring of these communities enables efficient analysis of complex data structures.
 
5. Summarizing Communities (Graph Communities → Community Summaries)  

Each community is summarized into a report using LLMs. These summaries include key nodes and edges and can function as:

  1. Indexes to answer specific questions.  
  2. Standalone insights to understand the dataset’s structure and meaning even without a query.
 
Here’s how community summaries are structured:  
  • Leaf-Level Communities: Detailed summaries include key nodes, edges, and covariates, ordered by importance and added to the LLM’s context window.
  • Higher-Level Communities: Summaries from leaf communities are aggregated. If context window limits are exceeded, lower-level summaries are compressed into shorter text while retaining critical information.  
 
6. Generating Responses (Community Summaries → Community Answers → Global Answer)
When a query is provided, Graph RAG generates intermediate responses based on relevant community summaries. Each response is scored by the LLM for utility (0–100). The responses with the highest scores are then combined to form a final global answer.

Back into the Library

Imagine an LLM receiving the question, “What are the key trends in the tech industry today?” and generating a response using Graph RAG.

Photo taken by Guillaume Henrotte

Community Summarization:

The dataset is divided into multiple communities. For instance:

  • An “AI Ethics” community might include nodes like “AI Transparency” and “Responsible Data Usage.”
Question Processing:
When a query is received, Graph RAG identifies the most relevant community summaries.
  • For the question “What are the key trends in the tech industry today?”, it references summaries from the “AI Ethics,” “Data Privacy,” and “Generative AI Applications” communities.
Final Response Construction:
Based on the relevant community summaries, the LLM generates individual responses:
  • AI Ethics Community: “AI transparency and responsible data usage are critical topics of discussion.”
  • Data Privacy Community: “Data protection regulations are tightening, driving advancements in technologies for secure personal data management.”
  • Generative AI Applications Community: “Generative AI is being applied across industries, including content creation, customer service, and product design.”

These responses are then combined to form a comprehensive and structured final answer:

“The key trends in the tech industry today are AI ethics, data privacy, and generative AI applications. AI transparency and responsible data usage are emphasized, while data protection regulations are being strengthened. Additionally, generative AI is expanding its applicability across various industries.”

Coming Soon

We explored how Graph RAG structures vast amounts of data and organizes it into communities. Rather than simply listing data, Graph RAG extracts context and meaning through graph-based structures and community summaries, enabling efficient and effective analysis of large datasets.  
 
In our next article, we’ll dive into the performance evaluation and use cases of Graph RAG. Let’s discover just how powerful this tool is and the innovations it brings to data analysis and summarization. Stay tuned! 🚀

Your AI Data Standard

LLM Evaluation Platform
About Datumo
Related Posts