langchain-data-handling
Load and chunk documents for LangChain 1.0 RAG pipelines correctly — language-aware splitters, table-safe PDF loaders, Cloudflare-compatible web loaders, chunk-boundary strategies that survive real-world structure. Use when building a RAG pipeline, diagnosing why retrieval misquotes a table, or debugging a crawler returning blank content. Trigger with "langchain document loader", "text splitter", "chunking strategy", "pdf loader", "markdown splitter", "webbaseloader".
Allowed Tools
Provided by Plugin
langchain-py-pack
Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns
Installation
This skill is included in the langchain-py-pack plugin:
/plugin install langchain-py-pack@claude-code-plugins-plus
Click to copy
Instructions
LangChain Data Handling — Loaders and Splitters (Python)
Overview
You have a RAG system over a Python docs site. A user asks "what does
trim_messages do?" and the retriever returns this chunk:
### `trim_messages(strategy="last", include_system=True)`
Trim a message history to fit a token budget. The newest messages are kept;
older messages are dropped. Pass `include_system=True` to preserve the system
...and that's it. The chunk ends there. The code example showing the function
body — the actual thing the user wanted — is in a different chunk, retrieved
with a lower similarity score and dropped before the LLM sees it. The model
then hallucinates the function's behavior from the signature alone.
This is pain-catalog entry P13. RecursiveCharacterTextSplitter's default
separators are ["\n\n", "\n", " ", ""]. It splits on any blank line — including
inside triple-backtick code fences in Markdown. The fix is a one-line swap
to RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN), which
treats the fence as an atomic unit, but you have to know the bug exists.
The sibling failures this skill prevents:
- P49 —
PyPDFLoadersplits by page. A 5-row financial table that spans
a page break gets torn in half; rows 1-3 go in one chunk, rows 4-5 in another
with no header. A RAG answer sourced from the second chunk misquotes the
numbers because the column meanings are in the first chunk. Fix: use
PyMuPDFLoader or UnstructuredPDFLoader, which detect tables and emit
them as distinct structured elements.
- P50 —
WebBaseLoader's default User-Agent ispython-requests/2.x.
Cloudflare-protected sites flag this as a bot and return a **403 interstitial
HTML page** ("Checking your browser...") instead of real content. The crawler
indexes the challenge page. You notice weeks later when every retrieval from
that source returns the same Cloudflare text. Fix: set a realistic
header_template={"User-Agent": "Mozilla/5.0 ..."}, respect robots.txt,
and rate-limit per-host to 1 req/sec.
Pinned versions: langchain-core 1.0.x, langchain-community 1.0.x,
langchain-text-splitters 1.0.x, pymupdf, unstructured.
Pain-catalog anchors: P13, P49, P50, P15.
This skill is the upstream half of the RAG pipeline — load and chunk.
For the downstream half (embedding, scoring, reranking) see the pair skill
langchain-embeddings-search, which covers score semantics (P12), dim guards
(P14), and reranker filtering (P15). Do not re-implement chunking there.
Prerequisites
- Python 3.10+
langchain-core >= 1.0, < 2.0andlangchain-community >= 1.0, < 2.0langchain-text-splitters >= 1.0, < 2.0- PDF support:
pip install pymupdf unstructured[pdf] - Web loading:
pip install beautifulsoup4 requests - For corpus dedup (optional):
pip install datasketch
Instructions
Step 1 — Choose a loader by source format
Loader selection is the first decision — get it wrong and no amount of
splitter tuning will recover. Use the decision table:
| Source | Use | NOT | Why |
|---|---|---|---|
| PDF with tables | PyMuPDFLoader or UnstructuredPDFLoader |
PyPDFLoader |
Tables torn by page splits (P49) |
| PDF text-only | PyPDFLoader |
— | Simple, fast, OK when no tables |
| Web page | WebBaseLoader(header_template=...) |
Default UA | Cloudflare 403 (P50) |
| Markdown docs | UnstructuredMarkdownLoader |
Plain text read | Preserves heading structure |
| HTML long-form | WebBaseLoader + HTMLHeaderTextSplitter |
Plain text | Keeps / context |
| Code repo | GenericLoader with language parser |
DirectoryLoader as text |
Language-aware chunking |
| Corpus (1000+ docs) | DirectoryLoader + glob filter |
One-by-one | Parallel load, progress |
from langchain_community.document_loaders import (
PyMuPDFLoader, # table-aware PDF
WebBaseLoader, # web pages (set custom UA)
UnstructuredMarkdownLoader,
DirectoryLoader,
)
# PDF with tables — P49 fix
pdf_docs = PyMuPDFLoader("10-Q-filing.pdf").load()
# Web page — P50 fix
web_docs = WebBaseLoader(
"https://example.com/article",
header_template={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
},
).load()
# Markdown docs site
md_docs = UnstructuredMarkdownLoader("docs/guide.md").load()
# Corpus
corpus = DirectoryLoader(
"./docs", glob="**/*.md",
loader_cls=UnstructuredMarkdownLoader,
show_progress=True,
).load()
Hard limit: keep single-PDF ingestion under 5 MB per call. Larger files
should be pre-split with pdftk / qpdf to avoid OOM on PyMuPDFLoader's
full-document parse.
See Loader Selection Matrix for the
full per-format table with cost and accuracy notes.
Step 2 — Pick a splitter by content type
| Content | Splitter | chunk_size | chunk_overlap | Why |
|---|---|---|---|---|
| Prose (docs, articles) | RecursiveCharacterTextSplitter.from_language(Language.MARKDOWN) |
1000 | 100 | Preserves code fences (P13) |
| Python source | RecursiveCharacterTextSplitter.from_language(Language.PYTHON) |
1500 | 150 | Splits at def/class |
| FAQ / Q&A | RecursiveCharacterTextSplitter with separators=["\n\n"] |
500 | 50 | One chunk per Q-A pair |
| HTML long-form | HTMLHeaderTextSplitter |
— | — | Headers become metadata |
| Generic text | RecursiveCharacterTextSplitter |
1000 | 100 | Safe default |
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
Language,
HTMLHeaderTextSplitter,
)
# GOOD — P13 fix for Markdown
md_splitter = RecursiveCharacterTextSplitter.from_language(
Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)
# GOOD — Python code
py_splitter = RecursiveCharacterTextSplitter.from_language(
Language.PYTHON, chunk_size=1500, chunk_overlap=150,
)
# GOOD — HTML long-form with heading-as-metadata
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
)
# BAD — breaks inside code fences (P13)
bad = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
See Language-Aware Splitters for the
full list of Language.* enum values, custom separator patterns, and the
code-fence-detection regex for when you need a custom splitter.
Step 3 — Tune chunk_size and overlap
Defaults from the table work for most corpora. Tune when:
- Retrieval misses context: increase
chunk_size(1000 → 1500) or
chunk_overlap (100 → 200). Overlap is what bridges a concept that crosses
chunk boundaries.
- Retrieval too broad, answers wander: decrease
chunk_size(1000 → 500).
Smaller chunks = more precise retrieval but more chunks to index.
- Tables / structured data: do NOT tune — index them separately (step 4).
A 1% overlap-to-size ratio is too low (200/20000); 20% is the sweet spot for
most prose. Code needs less overlap (10%) because function boundaries are
natural splits.
Step 4 — Detect and index tables as structured records
Tables are not text. If your corpus has financial filings, product specs,
or any tabular data, index tables as separate records with column metadata:
import fitz # pymupdf directly for table detection
def extract_tables_as_records(pdf_path: str) -> list[dict]:
"""Extract tables as one record per row."""
doc = fitz.open(pdf_path)
records = []
for page_num, page in enumerate(doc):
tables = page.find_tables()
for table in tables:
rows = table.extract()
if not rows:
continue
headers = rows[0]
for row_idx, row in enumerate(rows[1:], start=1):
record = {
"page": page_num,
"table_idx": tables.tables.index(table),
"row_idx": row_idx,
"content": " | ".join(f"{h}: {v}" for h, v in zip(headers, row)),
"metadata": dict(zip(headers, row)),
}
records.append(record)
return records
Now a question like "what was Q3 revenue?" retrieves a single row with its
column headers attached, not half a table missing the column meanings. See
Table Preservation for the full pattern
including hybrid retrieval (prose + table records).
Step 5 — Preserve metadata through the pipeline
The loader attaches metadata (source, page, heading); the splitter propagates
it. Front-matter in Markdown, PDF page numbers, and web URLs should all end
up in doc.metadata so retrieval results are citable:
for doc in md_docs:
# Markdown front-matter (if loader extracted it)
print(doc.metadata.get("title"), doc.metadata.get("date"))
# Splitter-preserved metadata
chunks = md_splitter.split_documents(md_docs)
assert chunks[0].metadata == md_docs[0].metadata # preserved
Custom metadata (tenant_id, version, confidence) should be added before
splitting so every chunk inherits it.
Step 6 — Deduplicate noisy corpora
Web crawls and scraped docs often contain near-duplicate pages (nav chrome,
footer boilerplate, syndicated posts). MinHash-based dedup at the chunk level
keeps the index clean:
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.9, num_perm=128)
kept = []
for i, chunk in enumerate(all_chunks):
mh = MinHash(num_perm=128)
for tok in chunk.page_content.lower().split():
mh.update(tok.encode())
if not list(lsh.query(mh)):
lsh.insert(str(i), mh)
kept.append(chunk)
A threshold of 0.9 catches near-duplicates (minor wording differences) without
eating legitimate paraphrases.
Step 7 — Compose the pipeline
# Multi-stage: load → split → dedup → index
def build_rag_index(source_dir: str, store):
# 1. Load
docs = DirectoryLoader(
source_dir, glob="**/*.md",
loader_cls=UnstructuredMarkdownLoader,
).load()
# 2. Clean (empty-content filter)
docs = [d for d in docs if d.page_content.strip()]
# 3. Split (language-aware)
splitter = RecursiveCharacterTextSplitter.from_language(
Language.MARKDOWN, chunk_size=1000, chunk_overlap=100,
)
chunks = splitter.split_documents(docs)
# 4. Dedup (optional for noisy corpora)
# chunks = dedup_minhash(chunks, threshold=0.9)
# 5. Index — handoff to langchain-embeddings-search
store.add_documents(chunks)
return store
For the embedding + indexing + retrieval steps, see langchain-embeddings-search.
Output
- Loader chosen from the selection matrix matching source format and table needs
- Splitter chosen from the decision tree matching content type
- Chunk size + overlap tuned from the defaults (1000/100 prose, 1500/150 code, 500/50 FAQ)
- Tables extracted as structured records with column metadata (not text chunks)
- Web loaders configured with realistic User-Agent and
robots.txtrespect - Metadata preserved through loader → splitter → index
- Optional MinHash dedup (threshold 0.9) for noisy corpora
Error Handling
| Error / symptom | Cause | Fix |
|---|---|---|
| RAG retrieves function signature without body | RecursiveCharacterTextSplitter broke inside code fence (P13) |
Use from_language(Language.MARKDOWN) or add "`" as first separator |
| Table rows misquoted in RAG answer | PyPDFLoader tore table by page (P49) |
Switch to PyMuPDFLoader; index tables as structured records |
WebBaseLoader returns 403 / blank content |
Default UA flagged by Cloudflare (P50) | Set header_template={"User-Agent": "Mozilla/5.0 ..."}; respect robots.txt |
ValueError: expected str, NoneType found during split |
Empty page_content |
Filter [d for d in docs if d.page_content.strip()] before splitting |
MemoryError loading PDF |
PDF > 5 MB ingested in one call | Pre-split with pdftk / qpdf; process chunks separately |
| Chunks missing metadata after split | Custom metadata added after loading but before splitting was lost | Add metadata before split_documents(); verify chunks[0].metadata preserved |
| Retrieval quality low on FAQ corpus | Chunks too large, one chunk holds multiple Q-A pairs | Drop to chunksize=500, chunkoverlap=50 with separators=["\n\n"] |
| Web crawl indexes Cloudflare challenge page | No check for HTTP status / response length | Assert len(doc.page_content) > 500 and reject pages containing "Checking your browser" |
| Duplicate chunks eat retrieval slots | Syndicated content, nav chrome not stripped | MinHash dedup at threshold 0.9 before indexing |
| Reranker scores inconsistent across chunks | Chunks of wildly different size change score distribution (P15) | Normalize chunk size within a corpus; target ±20% of chunk_size |
Examples
Ingesting a Markdown docs site with code examples
Markdown docs with Python code fences require Language.MARKDOWN to keep
fence boundaries intact. Chunk size 1000 with 100 overlap preserves one
function-sized example per chunk. Front-matter fields (title, date, author)
are attached as metadata for citation. See
Ingesting a PDF filing with financial tables
10-Q filings have dozens of multi-row tables. Use PyMuPDFLoader for the prose
and a direct fitz.find_tables() pass to extract tables as structured records.
Index prose with chunk_size=1000 and tables as one-row-per-record with the
header row concatenated. Questions like "what was Q3 revenue?" hit a single row
with column meanings attached. See
Crawling a documentation site behind Cloudflare
Set a realistic User-Agent, fetch robots.txt first and respect Disallow
rules, rate-limit to 1 req/sec per host, and prefer the site's sitemap or RSS
feed when available. Assert response length > 500 chars and reject known
interstitial patterns. See Crawler Hygiene.
Ingesting a Python code repo for code RAG
GenericLoader with LanguageParser(language=Language.PYTHON) preserves
function and class boundaries. Chunk size 1500 with 150 overlap gives enough
context for typical function-level queries. Imports and module docstrings
end up in their own chunks — tag them with metadata for higher precision
retrieval on "where is X imported from" queries.
Resources
- LangChain Python: Text splitters
- LangChain Python: Document loaders
- LangChain Python: Retrievers
- PyMuPDF (fitz) docs — table detection via
page.find_tables() - Unstructured.io — table-aware PDF parsing
- Cloudflare bot management — why default UAs fail
- robots.txt spec — crawler obligations
- Pair skill:
langchain-embeddings-search(embed/score/rerank — downstream of this skill) - Pack pain catalog:
docs/pain-catalog.md(entries P13, P49, P50, P15)