Chunking & Indexing Data

Indexing is essential for effectively searching chunked data. It refers to the process of storing chunked data in a database in a way that allows for efficient retrieval.

Indexed data enables quick similarity comparisons with search queries, helping the system identify the most relevant information. This process builds the infrastructure needed to handle complex search requests with both accuracy and speed.

Data chunking and indexing

Chunking Strategy for Accurate Search

For AI to read and respond effectively, how data is divided into chunks is critically important. AI models cannot process large amounts of text all at once, and languages like Korean, with their structural complexity, pose an added challenge. This is why chunking is essential.

However, it’s not just about splitting text arbitrarily. Each chunk must contain enough meaningful information for the AI to accurately answer questions. If chunks lack density or relevance, responses may become fragmented or incorrect. The denser and more contextually rich each chunk is, the better AI can understand the context and deliver precise answers. Therefore, information density and semantic relevance must be prioritized during the chunking process.

Levels of Chunking Strategy: Level 1–5

Chunking can be categorized into different levels based on how data is segmented. Each level addresses varying levels of complexity and information density, enabling AI to efficiently search, retrieve, and utilize information.

Level 1–3: Basic Chunking

[Level 1] Fixed-Size Chunking:

Divides text into fixed character lengths (e.g., 500 characters).
Pros: Simple to implement.
Cons: Ignores context and structure, often breaking sentences awkwardly.

[Level 2] Recursive Chunking:

Splits text based on spaces or sentence boundaries.
Pros: Better reflects sentence structure.
Cons: May still disrupt semantic meaning across chunks.

[Level 3] Document-Based Chunking:

Divides text based on document structure (e.g., sections, titles).
Pros: Preserves the document’s natural flow and context.
Cons: Effectiveness depends on the clarity of the document’s structure.

Special Cases in Document-Based Chunking:

Markdown Documents:
Uses Markdown separators (e.g., #, -, ---) as chunking points.
Example: Each section marked by # becomes a separate chunk.
Python/JavaScript Documents:
Uses code structures like class and function as chunking boundaries.
Example: Each def or class block forms a separate chunk.
Table-Based Documents:
Fixed-size or sentence-based chunking isn’t ideal for tables. Instead:
- Convert tables into Markdown format.
- Summarize table contents into embedding vectors for semantic search.
Image-Based Documents (Multi-Modal):
AI cannot directly interpret images in traditional chunking methods. Instead:
- Use Multi-Modal LLMs to extract image descriptions.
- Convert these descriptions into embedding vectors for later search and retrieval.

Level 4–5: Advanced Chunking Strategies

As we move into advanced chunking techniques, the focus shifts from simple divisions based on size or structure to deep semantic analysis and proposition-driven organization. These methods enable AI to process and retrieve information with higher precision and context awareness.

[Level 4] Semantic Chunking

Semantic Chunking analyzes the semantic similarity between sentences to create meaningful chunks. Instead of relying on text length or document structure, this method identifies logical breakpoints based on shifts in meaning.

How It Works:

Sentence to Embedding Vectors:
- Each sentence (or group of sentences) is converted into an embedding vector—a numerical representation of its meaning.
Similarity Calculation:
- The cosine similarity between adjacent embeddings is calculated to measure how semantically related they are.
Identifying Breakpoints:
- When the cosine similarity drops sharply (a Breakpoint), a new chunk is created.
- This threshold is often visualized on a graph to identify natural semantic divisions.

Key Strength:
Semantic chunking ensures that each chunk contains logically connected ideas, enabling the AI to better understand and respond to context-specific queries.

[Level 5] Agentic Chunking

Agentic Chunking represents the most advanced chunking strategy, where AI itself actively identifies key propositions and organizes them into meaningful chunks.

How It Works:

Proposition Extraction:
- The AI scans the document and extracts key propositions—self-contained units of core ideas or facts.
Assigning Propositions to Chunks:
- The extracted propositions are analyzed to determine if they fit into existing chunks or if a new chunk needs to be created.
- This decision is made using LLM prompting techniques.
Chunk Property Updates:
- If the proposition fits into an existing chunk, the chunk’s summary and metadata are updated.
- If a new chunk is needed, it’s created with the extracted proposition as its core.

Key Strength:

Agentic chunking delivers the highest information density and logical clarity by centering chunks around meaningful propositions. It reduces the risk of misaligned chunks and ensures that critical information is preserved in each segment.

Key Differences Between Level 4 and Level 5

While Semantic Chunking (Level 4) focuses on logical divisions based on meaning shifts, Agentic Chunking (Level 5) takes it a step further by actively identifying and organizing core propositions.

Level 4: Primarily uses semantic similarity to define boundaries between chunks.
Level 5: Involves AI-driven analysis and extraction of propositions, creating highly optimized chunks tailored for advanced retrieval and reasoning tasks.

Both strategies are powerful tools for data structuring in RAG systems, ensuring AI can efficiently process, search, and generate accurate responses from vast and complex datasets.

Data structuring and chunking techniques are critical factors that determine the performance of RAG systems. The strategic approach to different chunking levels plays a key role in how precisely AI can respond to complex queries.

In our next article, we’ll explore another crucial aspect of data structuring: “Data Structures for Accurate Retrieval.” We’ll dive into advanced strategies like tree-based and graph-based approaches, uncovering how they further enhance and refine RAG systems.

Chunking & Indexing Data

Data chunking and indexing

Chunking Strategy for Accurate Search

Levels of Chunking Strategy: Level 1–5

Level 1–3: Basic Chunking

Level 4–5: Advanced Chunking Strategies

Key Differences Between Level 4 and Level 5

Your AI Data Standard

LLM Evaluation Platform

About Datumo

Related Posts

How to Measure LLMs: Key Metrics Explained

AI Red Teaming Methods and Strategies

Red Teaming for Gen AI

Data chunking and indexing

Chunking Strategy for Accurate Search

Levels of Chunking Strategy: Level 1–5

Level 1–3: Basic Chunking

Level 4–5: Advanced Chunking Strategies

Key Differences Between Level 4 and Level 5

Your AI Data Standard

LLM Evaluation Platform

About Datumo

Related Posts

How to Measure LLMs: Key Metrics Explained

AI Red Teaming Methods and Strategies

Red Teaming for Gen AI

Data chunking and indexing