langchain-data-handling

Load and chunk documents for LangChain 1.0 RAG pipelines correctly — language-aware splitters, table-safe PDF loaders, Cloudflare-compatible web loaders, chunk-boundary strategies that survive real-world structure. Use when building a RAG pipeline, diagnosing why retrieval misquotes a table, or debugging a crawler returning blank content. Trigger with "langchain document loader", "text splitter", "chunking strategy", "pdf loader", "markdown splitter", "webbaseloader".

claude-codecodex
5 Tools
langchain-py-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(python:*)Bash(pip:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0
View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain Data Handling — Loaders and Splitters (Python)

Overview

You have a RAG system over a Python docs site. A user asks "what does

trim_messages do?" and the retriever returns this chunk:


### `trim_messages(strategy="last", include_system=True)`

Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system

...and that's it. The chunk ends there. The code example showing the function

body — the actual thing the user wanted — is in a different chunk, retrieved

with a lower similarity score and dropped before the LLM sees it. The model

then hallucinates the function's behavior from the signature alone.

This is pain-catalog entry P13. RecursiveCharacterTextSplitter's default

separators are ["\n\n", "\n", " ", ""]. It splits on any blank line — including

inside triple-backtick code fences in Markdown. The fix is a one-line swap

to RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN), which

treats the fence as an atomic unit, but you have to know the bug exists.

The sibling failures this skill prevents:

  • P49PyPDFLoader splits by page. A 5-row financial table that spans

a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another

with no header. A RAG answer sourced from the second chunk misquotes the

numbers because the column meanings are in the first chunk. Fix: use

PyMuPDFLoader or UnstructuredPDFLoader, which detect tables and emit

them as distinct structured elements.

  • P50WebBaseLoader's default User-Agent is python-requests/2.x.

Cloudflare-protected sites flag this as a bot and return a **403 interstitial

HTML page** ("Checking your browser...") instead of real content. The crawler

indexes the challenge page. You notice weeks later when every retrieval from

that source returns the same Cloudflare text. Fix: set a realistic

header_template={"User-Agent": "Mozilla/5.0 ..."}, respect robots.txt,

and rate-limit per-host to 1 req/sec.

Pinned versions: langchain-core 1.0.x, langchain-community 1.0.x,

langchain-text-splitters 1.0.x, pymupdf, unstructured.

Pain-catalog anchors: P13, P49, P50, P15.

This skill is the upstream half of the RAG pipeline — load and chunk.

For the downstream half (embedding, scoring, reranking) see the pair skill

langchain-embeddings-search, which covers score semantics (P12), dim guards

(P14), and reranker filtering (P15). Do not re-implement chunking there.

Prerequisites

  • Python 3.10+
  • langchain-core >= 1.0, < 2.0 and langchain-community >= 1.0, < 2.0
  • langchain-text-splitters >= 1.0, < 2.0
  • PDF support: pip install pymupdf unstructured[pdf]
  • Web loading: pip install beautifulsoup4 requests
  • For corpus dedup (optional): pip install datasketch

Instructions

Step 1 — Choose a loader by source format

Loader selection is the first decision — get it wrong and no amount of

splitter tuning will recover. Use the decision table:

Source Use NOT Why
PDF with tables PyMuPDFLoader or UnstructuredPDFLoader PyPDFLoader Tables torn by page splits (P49)
PDF text-only PyPDFLoader Simple, fast, OK when no tables
Web page WebBaseLoader(header_template=...) Default UA Cloudflare 403 (P50)
Markdown docs UnstructuredMarkdownLoader Plain text read Preserves heading structure
HTML long-form WebBaseLoader + HTMLHeaderTextSplitter Plain text Keeps

/

context

Code repo GenericLoader with language parser DirectoryLoader as text Language-aware chunking
Corpus (1000+ docs) DirectoryLoader + glob filter One-by-one Parallel load, progress

from langchain_community.document_loaders import (
    PyMuPDFLoader,            # table-aware PDF
    WebBaseLoader,            # web pages (set custom UA)
    UnstructuredMarkdownLoader,
    DirectoryLoader,
)

# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()

# Web page — P50 fix
web_docs = WebBaseLoader(
    "https://example.com/article",
    header_template={
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
    },
).load()

# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()

# Corpus
corpus = DirectoryLoader(
    "./docs", glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
).load()

Hard limit: keep single-PDF ingestion under 5 MB per call. Larger files

should be pre-split with pdftk / qpdf to avoid OOM on PyMuPDFLoader's

full-document parse.

See Loader Selection Matrix for the

full per-format table with cost and accuracy notes.

Step 2 — Pick a splitter by content type

Content Splitter chunk_size chunk_overlap Why
Prose (docs, articles) RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN) 1000 100 Preserves code fences (P13)
Python source RecursiveCharacterTextSplitter.from_language(Language.PYTHON) 1500 150 Splits at def/class
FAQ / Q&A RecursiveCharacterTextSplitter with separators=["\n\n"] 500 50 One chunk per Q-A pair
HTML long-form HTMLHeaderTextSplitter Headers become metadata
Generic text RecursiveCharacterTextSplitter 1000 100 Safe default

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    Language,
    HTMLHeaderTextSplitter,
)

# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)

# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)

# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)

# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

See Language-Aware Splitters for the

full list of Language.* enum values, custom separator patterns, and the

code-fence-detection regex for when you need a custom splitter.

Step 3 — Tune chunk_size and overlap

Defaults from the table work for most corpora. Tune when:

  • Retrieval misses context: increase chunk_size (1000 → 1500) or

chunk_overlap (100 → 200). Overlap is what bridges a concept that crosses

chunk boundaries.

  • Retrieval too broad, answers wander: decrease chunk_size (1000 → 500).

Smaller chunks = more precise retrieval but more chunks to index.

  • Tables / structured data: do NOT tune — index them separately (step 4).

A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for

most prose. Code needs less overlap (10%) because function boundaries are

natural splits.

Step 4 — Detect and index tables as structured records

Tables are not text. If your corpus has financial filings, product specs,

or any tabular data, index tables as separate records with column metadata:


import fitz  # pymupdf directly for table detection

def extract_tables_as_records(pdf_path: str) -> list[dict]:
    """Extract tables as one record per row."""
    doc = fitz.open(pdf_path)
    records = []
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table in tables:
            rows = table.extract()
            if not rows:
                continue
            headers = rows[0]
            for row_idx, row in enumerate(rows[1:], start=1):
                record = {
                    "page": page_num,
                    "table_idx": tables.tables.index(table),
                    "row_idx": row_idx,
                    "content": " | ".join(f"{h}: {v}" for h, v in zip(headers, row)),
                    "metadata": dict(zip(headers, row)),
                }
                records.append(record)
    return records

Now a question like "what was Q3 revenue?" retrieves a single row with its

column headers attached, not half a table missing the column meanings. See

Table Preservation for the full pattern

including hybrid retrieval (prose + table records).

Step 5 — Preserve metadata through the pipeline

The loader attaches metadata (source, page, heading); the splitter propagates

it. Front-matter in Markdown, PDF page numbers, and web URLs should all end

up in doc.metadata so retrieval results are citable:


for doc in md_docs:
    # Markdown front-matter (if loader extracted it)
    print(doc.metadata.get("title"), doc.metadata.get("date"))

# Splitter-preserved metadata
chunks = md_splitter.split_documents(md_docs)
assert chunks[0].metadata == md_docs[0].metadata  # preserved

Custom metadata (tenant_id, version, confidence) should be added before

splitting so every chunk inherits it.

Step 6 — Deduplicate noisy corpora

Web crawls and scraped docs often contain near-duplicate pages (nav chrome,

footer boilerplate, syndicated posts). MinHash-based dedup at the chunk level

keeps the index clean:


from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.9, num_perm=128)
kept = []
for i, chunk in enumerate(all_chunks):
    mh = MinHash(num_perm=128)
    for tok in chunk.page_content.lower().split():
        mh.update(tok.encode())
    if not list(lsh.query(mh)):
        lsh.insert(str(i), mh)
        kept.append(chunk)

A threshold of 0.9 catches near-duplicates (minor wording differences) without

eating legitimate paraphrases.

Step 7 — Compose the pipeline


# Multi-stage: load → split → dedup → index
def build_rag_index(source_dir: str, store):
    # 1. Load
    docs = DirectoryLoader(
        source_dir, glob="**/*.md",
        loader_cls=UnstructuredMarkdownLoader,
    ).load()

    # 2. Clean (empty-content filter)
    docs = [d for d in docs if d.page_content.strip()]

    # 3. Split (language-aware)
    splitter = RecursiveCharacterTextSplitter.from_language(
        Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
    )
    chunks = splitter.split_documents(docs)

    # 4. Dedup (optional for noisy corpora)
    # chunks = dedup_minhash(chunks, threshold=0.9)

    # 5. Index — handoff to langchain-embeddings-search
    store.add_documents(chunks)
    return store

For the embedding + indexing + retrieval steps, see langchain-embeddings-search.

Output

  • Loader chosen from the selection matrix matching source format and table needs
  • Splitter chosen from the decision tree matching content type
  • Chunk size + overlap tuned from the defaults (1000/100 prose, 1500/150 code, 500/50 FAQ)
  • Tables extracted as structured records with column metadata (not text chunks)
  • Web loaders configured with realistic User-Agent and robots.txt respect
  • Metadata preserved through loader → splitter → index
  • Optional MinHash dedup (threshold 0.9) for noisy corpora

Error Handling

Error / symptom Cause Fix
RAG retrieves function signature without body RecursiveCharacterTextSplitter broke inside code fence (P13) Use from_language(Language.MARKDOWN) or add "`" as first separator
Table rows misquoted in RAG answer PyPDFLoader tore table by page (P49) Switch to PyMuPDFLoader; index tables as structured records
WebBaseLoader returns 403 / blank content Default UA flagged by Cloudflare (P50) Set header_template={"User-Agent": "Mozilla/5.0 ..."}; respect robots.txt
ValueError: expected str, NoneType found during split Empty page_content Filter [d for d in docs if d.page_content.strip()] before splitting
MemoryError loading PDF PDF > 5 MB ingested in one call Pre-split with pdftk / qpdf; process chunks separately
Chunks missing metadata after split Custom metadata added after loading but before splitting was lost Add metadata before split_documents(); verify chunks[0].metadata preserved
Retrieval quality low on FAQ corpus Chunks too large, one chunk holds multiple Q-A pairs Drop to chunksize=500, chunkoverlap=50 with separators=["\n\n"]
Web crawl indexes Cloudflare challenge page No check for HTTP status / response length Assert len(doc.page_content) > 500 and reject pages containing "Checking your browser"
Duplicate chunks eat retrieval slots Syndicated content, nav chrome not stripped MinHash dedup at threshold 0.9 before indexing
Reranker scores inconsistent across chunks Chunks of wildly different size change score distribution (P15) Normalize chunk size within a corpus; target ±20% of chunk_size

Examples

Ingesting a Markdown docs site with code examples

Markdown docs with Python code fences require Language.MARKDOWN to keep

fence boundaries intact. Chunk size 1000 with 100 overlap preserves one

function-sized example per chunk. Front-matter fields (title, date, author)

are attached as metadata for citation. See

Language-Aware Splitters.

Ingesting a PDF filing with financial tables

10-Q filings have dozens of multi-row tables. Use PyMuPDFLoader for the prose

and a direct fitz.find_tables() pass to extract tables as structured records.

Index prose with chunk_size=1000 and tables as one-row-per-record with the

header row concatenated. Questions like "what was Q3 revenue?" hit a single row

with column meanings attached. See

Table Preservation.

Crawling a documentation site behind Cloudflare

Set a realistic User-Agent, fetch robots.txt first and respect Disallow

rules, rate-limit to 1 req/sec per host, and prefer the site's sitemap or RSS

feed when available. Assert response length > 500 chars and reject known

interstitial patterns. See Crawler Hygiene.

Ingesting a Python code repo for code RAG

GenericLoader with LanguageParser(language=Language.PYTHON) preserves

function and class boundaries. Chunk size 1500 with 150 overlap gives enough

context for typical function-level queries. Imports and module docstrings

end up in their own chunks — tag them with metadata for higher precision

retrieval on "where is X imported from" queries.

Resources

Ready to use langchain-py-pack?