langchain-data-handling

"Load and chunk documents for LangChain 1.0 RAG pipelines correctly \u2014\

v2.5.0

Jeremy Longshore

MIT

Allowed Tools

ReadWriteEditBash(python:*)Bash(pip:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.5.0

View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain Data Handling — Loaders and Splitters (Python)

Overview

You have a RAG system over a Python docs site. A user asks "what does

trim_messages do?" and the retriever returns this chunk:


### `trim_messages(strategy="last", include_system=True)`

Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system

...and that's it. The chunk ends there. The code example showing the function

body — the actual thing the user wanted — is in a different chunk, retrieved

with a lower similarity score and dropped before the LLM sees it. The model

then hallucinates the function's behavior from the signature alone.

This is pain-catalog entry P13. RecursiveCharacterTextSplitter's default

separators are ["\n\n", "\n", " ", ""]. It splits on any blank line — including

inside triple-backtick code fences in Markdown. The fix is a one-line swap

to RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN), which

treats the fence as an atomic unit, but you have to know the bug exists.

The sibling failures this skill prevents:

P49 — PyPDFLoader splits by page. A 5-row financial table that spans

a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another

with no header. A RAG answer sourced from the second chunk misquotes the

numbers because the column meanings are in the first chunk. Fix: use

PyMuPDFLoader or UnstructuredPDFLoader, which detect tables and emit

them as distinct structured elements.

P50 — WebBaseLoader's default User-Agent is python-requests/2.x.

Cloudflare-protected sites flag this as a bot and return a **403 interstitial

HTML page** ("Checking your browser...") instead of real content. The crawler

indexes the challenge page. You notice weeks later when every retrieval from

that source returns the same Cloudflare text. Fix: set a realistic

header_template={"User-Agent": "Mozilla/5.0 ..."}, respect robots.txt,

and rate-limit per-host to 1 req/sec.

Pinned versions: langchain-core 1.0.x, langchain-community 1.0.x,

langchain-text-splitters 1.0.x, pymupdf, unstructured.

Pain-catalog anchors: P13, P49, P50, P15.

This skill is the upstream half of the RAG pipeline — load and chunk.

For the downstream half (embedding, scoring, reranking) see the pair skill

langchain-embeddings-search, which covers score semantics (P12), dim guards

(P14), and reranker filtering (P15). Do not re-implement chunking there.

Prerequisites

Python 3.10+
langchain-core >= 1.0, < 2.0 and langchain-community >= 1.0, < 2.0
langchain-text-splitters >= 1.0, < 2.0
PDF support: pip install pymupdf unstructured[pdf]
Web loading: pip install beautifulsoup4 requests
For corpus dedup (optional): pip install datasketch

Instructions

Step 1 — Choose a loader by source format

Loader selection is the first decision — get it wrong and no amount of

splitter tuning will recover. Use the decision table:

Source	Use	NOT	Why
PDF with tables	`PyMuPDFLoader` or `UnstructuredPDFLoader`	`PyPDFLoader`	Tables torn by page splits (P49)
PDF text-only	`PyPDFLoader`	—	Simple, fast, OK when no tables
Web page	`WebBaseLoader(header_template=...)`	Default UA	Cloudflare 403 (P50)
Markdown docs	`UnstructuredMarkdownLoader`	Plain text read	Preserves heading structure
HTML long-form	`WebBaseLoader` + `HTMLHeaderTextSplitter`	Plain text	Keeps / context
Code repo	`GenericLoader` with language parser	`DirectoryLoader` as text	Language-aware chunking
Corpus (1000+ docs)	`DirectoryLoader` + `glob` filter	One-by-one	Parallel load, progress


from langchain_community.document_loaders import (
    PyMuPDFLoader,            # table-aware PDF
    WebBaseLoader,            # web pages (set custom UA)
    UnstructuredMarkdownLoader,
    DirectoryLoader,
)

# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()

# Web page — P50 fix
web_docs = WebBaseLoader(
    "https://example.com/article",
    header_template={
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
    },
).load()

# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()

# Corpus
corpus = DirectoryLoader(
    "./docs", glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
).load()

Hard limit: keep single-PDF ingestion under 5 MB per call. Larger files

should be pre-split with pdftk / qpdf to avoid OOM on PyMuPDFLoader's

full-document parse.

See Loader Selection Matrix for the

full per-format table with cost and accuracy notes.

Step 2 — Pick a splitter by content type

Content	Splitter	chunk_size	chunk_overlap	Why
Prose (docs, articles)	`RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN)`	1000	100	Preserves code fences (P13)
Python source	`RecursiveCharacterTextSplitter.from_language(Language.PYTHON)`	1500	150	Splits at `def`/`class`
FAQ / Q&A	`RecursiveCharacterTextSplitter` with `separators=["\n\n"]`	500	50	One chunk per Q-A pair
HTML long-form	`HTMLHeaderTextSplitter`	—	—	Headers become metadata
Generic text	`RecursiveCharacterTextSplitter`	1000	100	Safe default


from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    Language,
    HTMLHeaderTextSplitter,
)

# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)

# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
    Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)

# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)

# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

See Language-Aware Splitters for the

full list of Language.* enum values, custom separator patterns, and the

code-fence-detection regex for when you need a custom splitter.

Step 3 — Tune chunk_size and overlap

Defaults from the table work for most corpora. Tune when:

Retrieval misses context: increase chunk_size (1000 → 1500) or

chunk_overlap (100 → 200). Overlap is what bridges a concept that crosses

chunk boundaries.

Retrieval too broad, answers wander: decrease chunk_size (1000 → 500).

Smaller chunks = more precise retrieval but more chunks to index.

Tables / structured data: do NOT tune — index them separately (step 4).

A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for

most prose. Code needs less overlap (10%) because function boundaries are

natural splits.

Step 4 — Detect and index tables as structured records

Tables are not text. If your corpus has financial filings, product specs,

or any tabular data, index tables as separate records with column metadata:


import fitz  # pymupdf directly for table detection

def extract_tables_as_records(pdf_path: str) -> list[dict]:
    """Extract tables as one record per row."""
    doc = fitz.open(pdf_path)
    records = []
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table in tables:
            rows = table.extract()
            if not rows:
                continue
            headers = rows[0]
            for row_idx, row in enumerate(rows[1:], start=1):
                record = {
                    "page": page_num,
                    "table_idx": tables.tables.index(table),
                    "row_idx": row_idx,
                    "content": " | ".join(f"{h}: {v}" for h, v in zip(headers, row)),
                    "metadata": dict(zip(headers, row)),
                }
                records.append(record)
    return records

Now a question like "what was Q3 revenue?" retrieves a single row with its

column headers attached, not half a table missing the column meanings. See

Table Preservation for the full pattern

including hybrid retrieval (prose + table records).

Step 5 — Preserve metadata through the pipeline

The loader attaches metadata (source, page, heading); the splitter propagates

it. Front-matter in Markdown, PDF page numbers, and web URLs should all end

up in doc.metadata so retrieval results are citable:


for doc in md_docs:
    # Markdown front-matter (if loader extracted it)
    print(doc.metadata.get("title"), doc.metadata.get("date"))

# Splitter-preserved metadata
chunks = md_splitter.split_documents(md_docs)
assert chunks[0].metadata == md_docs[0].metadata  # preserved

Custom metadata (tenant_id, version, confidence) should be added before

splitting so every chunk inherits it.

Step 6 — Deduplicate noisy corpora

Web crawls and scraped docs often contain near-duplicate pages (nav chrome,

footer boilerplate, syndicated posts). MinHash-based dedup at the chunk level

keeps the index clean:


from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.9, num_perm=128)
kept = []
for i, chunk in enumerate(all_chunks):
    mh = MinHash(num_perm=128)
    for tok in chunk.page_content.lower().split():
        mh.update(tok.encode())
    if not list(lsh.query(mh)):
        lsh.insert(str(i), mh)
        kept.append(chunk)

A threshold of 0.9 catches near-duplicates (minor wording differences) without

eating legitimate paraphrases.

Step 7 — Compose the pipeline


# Multi-stage: load → split → dedup → index
def build_rag_index(source_dir: str, store):
    # 1. Load
    docs = DirectoryLoader(
        source_dir, glob="**/*.md",
        loader_cls=UnstructuredMarkdownLoader,
    ).load()

    # 2. Clean (empty-content filter)
    docs = [d for d in docs if d.page_content.strip()]

    # 3. Split (language-aware)
    splitter = RecursiveCharacterTextSplitter.from_language(
        Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
    )
    chunks = splitter.split_documents(docs)

    # 4. Dedup (optional for noisy corpora)
    # chunks = dedup_minhash(chunks, threshold=0.9)

    # 5. Index — handoff to langchain-embeddings-search
    store.add_documents(chunks)
    return store

For the embedding + indexing + retrieval steps, see langchain-embeddings-search.

Output

Loader chosen from the selection matrix matching source format and table needs
Splitter chosen from the decision tree matching content type
Chunk size + overlap tuned from the defaults (1000/100 prose, 1500/150 code, 500/50 FAQ)
Tables extracted as structured records with column metadata (not text chunks)
Web loaders configured with realistic User-Agent and robots.txt respect
Metadata preserved through loader → splitter → index
Optional MinHash dedup (threshold 0.9) for noisy corpora

Error Handling

Error / symptom	Cause	Fix
RAG retrieves function signature without body	`RecursiveCharacterTextSplitter` broke inside code fence (P13)	Use `from_language(Language.MARKDOWN)` or add `"```"` as first separator
Table rows misquoted in RAG answer	`PyPDFLoader` tore table by page (P49)	Switch to `PyMuPDFLoader`; index tables as structured records
`WebBaseLoader` returns 403 / blank content	Default UA flagged by Cloudflare (P50)	Set `header_template={"User-Agent": "Mozilla/5.0 ..."}`; respect robots.txt
`ValueError: expected str, NoneType found` during split	Empty `page_content`	Filter `[d for d in docs if d.page_content.strip()]` before splitting
`MemoryError` loading PDF	PDF > 5 MB ingested in one call	Pre-split with `pdftk` / `qpdf`; process chunks separately
Chunks missing metadata after split	Custom metadata added after loading but before splitting was lost	Add metadata before `split_documents()`; verify `chunks[0].metadata` preserved
Retrieval quality low on FAQ corpus	Chunks too large, one chunk holds multiple Q-A pairs	Drop to `chunksize=500, chunkoverlap=50` with `separators=["\n\n"]`
Web crawl indexes Cloudflare challenge page	No check for HTTP status / response length	Assert `len(doc.page_content) > 500` and reject pages containing `"Checking your browser"`
Duplicate chunks eat retrieval slots	Syndicated content, nav chrome not stripped	MinHash dedup at threshold 0.9 before indexing
Reranker scores inconsistent across chunks	Chunks of wildly different size change score distribution (P15)	Normalize chunk size within a corpus; target ±20% of `chunk_size`

Examples

Ingesting a Markdown docs site with code examples

Markdown docs with Python code fences require Language.MARKDOWN to keep

fence boundaries intact. Chunk size 1000 with 100 overlap preserves one

function-sized example per chunk. Front-matter fields (title, date, author)

are attached as metadata for citation. See

Language-Aware Splitters.

Ingesting a PDF filing with financial tables

10-Q filings have dozens of multi-row tables. Use PyMuPDFLoader for the prose

and a direct fitz.find_tables() pass to extract tables as structured records.

Index prose with chunk_size=1000 and tables as one-row-per-record with the

header row concatenated. Questions like "what was Q3 revenue?" hit a single row

with column meanings attached. See

Table Preservation.

Crawling a documentation site behind Cloudflare

Set a realistic User-Agent, fetch robots.txt first and respect Disallow

rules, rate-limit to 1 req/sec per host, and prefer the site's sitemap or RSS

feed when available. Assert response length > 500 chars and reject known

interstitial patterns. See Crawler Hygiene.

Ingesting a Python code repo for code RAG

GenericLoader with LanguageParser(language=Language.PYTHON) preserves

function and class boundaries. Chunk size 1500 with 150 overlap gives enough

context for typical function-level queries. Imports and module docstrings

end up in their own chunks — tag them with metadata for higher precision

retrieval on "where is X imported from" queries.

Resources

LangChain Python: Text splitters
LangChain Python: Document loaders
LangChain Python: Retrievers
PyMuPDF (fitz) docs — table detection via page.find_tables()
Unstructured.io — table-aware PDF parsing
Cloudflare bot management — why default UAs fail
robots.txt spec — crawler obligations
Pair skill: langchain-embeddings-search (embed/score/rerank — downstream of this skill)
Pack pain catalog: docs/pain-catalog.md (entries P13, P49, P50, P15)

Allowed Tools

Provided by Plugin

langchain-py-pack

Installation

Instructions

LangChain Data Handling — Loaders and Splitters (Python)

Overview

Prerequisites

Instructions

Step 1 — Choose a loader by source format

/ context

context

Step 2 — Pick a splitter by content type

Step 3 — Tune chunk_size and overlap

Step 4 — Detect and index tables as structured records

Step 5 — Preserve metadata through the pipeline

Step 6 — Deduplicate noisy corpora

Step 7 — Compose the pipeline

Output

Error Handling

Examples

Ingesting a Markdown docs site with code examples

Ingesting a PDF filing with financial tables

Crawling a documentation site behind Cloudflare

Ingesting a Python code repo for code RAG

Resources

Ready to use langchain-py-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle

/
context