langchain-ci-integration

Wire LangChain 1.0 / LangGraph 1.0 tests into a GitHub Actions pipeline — unit tests with FakeListChatModel, VCR-gated integration tests, warning-filter policy, and eval-regression merge gates. Complements langchain-local-dev-loop (F23) which covers the inner loop; THIS covers the CI wire-up. Use when setting up GHA for a new LLM service, after a VCR cassette leak incident, or hardening an existing pipeline. Trigger with "langchain ci", "langchain github actions", "langchain test pipeline", "vcr ci", "langchain eval gate", "pytest -W error langchain".

v2.0.0

Jeremy Longshore

MIT

claude-codecodex

5 Tools

langchain-py-pack Plugin

saas packs Category

Allowed Tools
        ReadWriteEditBash(python:*)Bash(pytest:*)
      

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0

View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain CI Integration (Python)

Overview

A PR passes every test on your laptop. You push. GHA runs pytest and aborts

during collection — before a single test executes — with:


PytestUnraisableExceptionWarning: Exception ignored in: ...
DeprecationWarning: langchain_community.llms ...

The org runs pytest -W error and a provider SDK emitted a DeprecationWarning

at import time, which the warning filter promoted to an exception while pytest

was still walking the test tree. This is P45 and it blocks every PR for the

team until someone pins a filterwarnings config.

Meanwhile the integration suite has its own failure mode: a VCR cassette

recorded three months ago at temperature=0 against Anthropic is now flaking

against a snapshot. temperature=0 is not deterministic on Claude — it still

nucleus-samples (P05) — so the cassette captured one valid completion, not

the valid completion. And yesterday a reviewer caught

Authorization: Bearer sk-ant-... in a cassette file that had been committed

six weeks ago (P44) because vcrpy records all request headers by default.

This skill covers the outer loop: the GitHub Actions workflow, the unit /

integration / eval gate separation, VCR cassette hygiene, pytest warning

policy, and a merge-blocking eval regression gate. The inner loop — fake

model fixtures, VCR recording workflow, local determinism tricks — lives in

langchain-local-dev-loop (F23); cross-reference it, do not duplicate it.

Pin: langchain-core 1.0.x, langgraph 1.0.x, actions/checkout@v4,

actions/setup-python@v5, vcrpy 6.x. Pain-catalog anchors: P05, P43, P44, P45.

Prerequisites

Python 3.10, 3.11, or 3.12 (matrix)
langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
pytest >= 8, pytest-asyncio, vcrpy >= 6 (integration)
langchain-local-dev-loop (F23) applied locally — fixtures and recording workflow
GitHub repo with Actions enabled; secrets set for any live-API nightly job

Instructions

Step 1 — GHA workflow skeleton with four jobs

Single workflow at .github/workflows/tests.yml. Matrix on unit only; keep

integration and eval single-version to control cost.


name: tests

on:
  pull_request:
  push:
    branches: [main]
  schedule:
    - cron: "0 6 * * *"  # nightly live-API re-record check (06:00 UTC)

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  unit:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python }}
          cache: pip
          cache-dependency-path: |
            pyproject.toml
            requirements*.txt
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ -W error --timeout=30 -q

  integration:
    needs: unit
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-integration')
    runs-on: ubuntu-latest
    env:
      RUN_INTEGRATION: "1"
      VCR_MODE: "none"  # replay-only; nightly cron flips to "once"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,integration]"
      - run: pytest tests/integration/ -W error --timeout=60 -q

  eval:
    needs: unit
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # need base ref for delta comparison
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,eval]"
      - run: python scripts/run_eval.py --baseline origin/${{ github.base_ref }} --head HEAD --n 100
      # run_eval.py posts a PR comment and exits nonzero on regression > threshold

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[dev]"
      - run: ruff check .
      - run: python scripts/dryrun_load_chains.py   # catches ImportError migration regressions

See GHA Workflow Reference for the full

job definitions including the secret-injection pattern, the matrix caching

nuance, and the softprops/action-gh-release-style PR comment action used by

the eval job.

Step 2 — Unit job: `-W error` + `filterwarnings` to neutralize P45

Root cause of the collection abort: pytest collects tests by importing them.

Some provider SDKs emit DeprecationWarning on import. With -W error those

become exceptions during collection. Fix at the filter level, not by dropping

-W error (which would mask real warnings).

In pyproject.toml:


[tool.pytest.ini_options]
filterwarnings = [
    "error",
    # P45 — neutralize known import-time noise; scoped per module so new
    # warnings from YOUR code still fail the build.
    "ignore::DeprecationWarning:langchain_community.*",
    "ignore::DeprecationWarning:pydantic.*",
    "ignore:Pydantic serializer warnings:UserWarning",
]
asyncio_mode = "auto"
testpaths = ["tests"]

The ordering matters — "error" first, specific "ignore" entries after, so

the filters override the global promote-to-error. Keep the list narrow: a

blanket ignore::DeprecationWarning hides regressions you need to see.

Unit tests use FakeListChatModel fixtures from F23 (do not redefine them

here). One CI-specific gotcha (P43): FakeListChatModel does not emit

responsemetadata["tokenusage"], so any callback that asserts on token counts

will break. Either subclass the fake and inject generation_info, or gate the

assertion:


def test_chain_uses_tokens(patched_chat_model):
    result = chain.invoke({"input": "hi"})
    if patched_chat_model.__class__.__name__ == "FakeListChatModel":
        pytest.skip("fake model doesn't emit token_usage (P43)")
    assert result.response_metadata["token_usage"]["total_tokens"] > 0

Budget: unit job should finish in <2 minutes across the 3-version matrix.

If it doesn't, something is calling out to a real provider — check with

pytest --collect-only -q | wc -l and audit which tests lack fake-model

fixtures.

Step 3 — Integration job: VCR replay + `filter_headers` (P44)

Integration tests replay pre-recorded VCR cassettes. Three rules:

Gate the job. if: contains(github.event.pullrequest.labels.*.name, 'run-integration') or env.RUNINTEGRATION == "1", plus a nightly cron that flips to VCR_MODE=once and re-records against live APIs. PRs default to pure replay.
Enforce filter_headers at the fixture level — not per-test. A single conftest.py prevents any contributor from recording a cassette with raw credentials.
Pre-commit + CI both scan cassettes for leaked keys. Belt and suspenders.

Fixture (lives in tests/integration/conftest.py, owned by this skill's

pipeline concern — F23 owns the recording workflow):


import vcr
import pytest

@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",
            "x-api-key",
            "anthropic-version",
            ("openai-organization", "REDACTED"),
        ],
        "filter_post_data_parameters": ["api_key"],
        "record_mode": "none",  # CI default: replay only
        "match_on": ["method", "scheme", "host", "port", "path", "query"],
    }

Integration suite must finish in <5 minutes wall-clock on the runner, or

you will start getting cancellation flakes from the concurrency block. If

you exceed 5 minutes, split into a nightly-only long tier.

See Integration Gating for the full

live-vs-replay decision tree, cost-per-run budget worksheet, and the

VCR_MODE flip pattern.

Step 4 — Eval-regression gate: merge-blocking PR comment

The eval job runs the langchain-eval-harness harness (see that skill for the

harness itself — this skill only covers the CI wire-up) against both the PR

branch and the merge base. Post a comment; block merge on regression.

scripts/run_eval.py is a thin CI wrapper: check out baseline and head via

git worktree, run the harness at each ref, diff the results, post a PR

comment, exit nonzero on regression. Full implementation in

Eval Regression Gate.

Thresholds:

Gate	Threshold	Rationale
Aggregate score	drop >2%	One-sigma noise on n=100 with well-behaved evals
Per-example score	drop >5% on any single case	Catches quiet regressions masked by aggregate averaging
Sample size floor	n ≥ 100	Below this, aggregate delta is dominated by noise

The PR comment is a Markdown table with before / after / Δ per metric plus a

bold red line if the gate failed. Required-status-check on the eval job

completes the enforcement. See Eval Regression Gate

for the comment template and the noise-budget calculation.

Step 5 — Pre-commit hooks: secret scan + prompt lint

Two layers: local (pre-commit) and CI (re-runs the same hooks as a final

catch). Local alone is not sufficient — contributors can skip with -n. CI

alone is slow. Run both.

.pre-commit-config.yaml:


repos:
  - repo: local
    hooks:
      - id: vcr-secret-scan
        name: VCR cassette secret scan (P44)
        entry: python scripts/scan_cassettes.py
        language: system
        files: "tests/integration/cassettes/.*\\.ya?ml$"
        pass_filenames: true

      - id: prompt-convention-lint
        name: prompt-convention lint
        entry: python scripts/lint_prompts.py
        language: system
        files: "prompts/.*\\.j2$|src/.*prompts?\\.py$"

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.9
    hooks:
      - id: ruff
      - id: ruff-format

  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.5.0
    hooks:
      - id: detect-secrets
        args: ["--baseline", ".secrets.baseline"]

scancassettes.py greps for sk-[A-Za-z0-9]{20,}, sk-ant-[A-Za-z0-9-]{20,},

AIza[A-Za-z0-9-]{35} (Google), xoxb-, and Bearer [A-Za-z0-9.-]{20,}.

Fail on any match. This is your last line of defense before P44 ships to

main. See Pre-Commit Hooks for the full

pattern list, the prompt-convention lint rules (aligned with

claude-prompt-conventions), and the detect-secrets baseline-rotation policy.

Step 6 — Dry-run chain loader: catch ImportError migration breaks

LangChain 0.x → 1.0 moved integrations into provider packages. A chain that

imports from langchain.chat_models import ChatOpenAI works in local dev if

you still have the old compat shim installed, and explodes in CI. Dry-run-load

every chain module at lint time:


# scripts/dryrun_load_chains.py
import importlib, pathlib, sys, traceback

failures = []
for py in pathlib.Path("src/chains").rglob("*.py"):
    mod = str(py.with_suffix("")).replace("/", ".")
    try:
        importlib.import_module(mod)
    except Exception:
        failures.append((mod, traceback.format_exc()))

if failures:
    for mod, tb in failures:
        print(f"::error::chain {mod} failed to import\n{tb}")
    sys.exit(1)

Runs in the lint job. Costs ~5 seconds. Catches every ImportError and

every top-level NameError from a bad rename before a single unit test fires.

Output

GHA workflow with four isolated jobs (unit / integration / eval / lint)
pyproject.toml filterwarnings config that survives -W error (P45)
VCR conftest.py fixture with enforced filter_headers (P44)
run_eval.py CI wrapper that posts PR comments and blocks merge on regression
.pre-commit-config.yaml with cassette secret scan + prompt lint + ruff
Dry-run chain loader that catches migration ImportErrors

Gate policy

Gate	Required?	Target speed	On failure
unit (3 Python versions)	yes, every PR	<2 min	block PR
lint + dryrun-load	yes, every PR	<30 s	block PR
integration (VCR replay)	on `run-integration` label or nightly	<5 min	block merge when run
integration (live, nightly cron)	no	<15 min	open issue on fail
eval regression (n≥100)	yes, every PR	<10 min	block merge if agg >2% or per-example >5%
pre-commit (local)	yes	<10 s	reject commit

Error Handling

Error	Cause	Fix
`PytestUnraisableExceptionWarning` during collection	`-W error` + SDK import-time `DeprecationWarning` (P45)	Add scoped `filterwarnings = ["ignore::DeprecationWarning:langchain_community.*"]` to `pyproject.toml`
VCR replay mismatch after weeks of passing	Cassette recorded at `temp=0` on Anthropic (P05); model drift	Re-record on nightly cron with `VCR_MODE=once`; treat replay mismatches as eval-gate concerns, not unit failures
`sk-ant-...` in cassette flagged by reviewer	`vcrpy` records all headers by default (P44)	Enforce `filterheaders` in `conftest.py`; add `scan``cassettes.py` to pre-commit AND CI
Callback `AssertionError: 'tokenusage' not in responsemetadata`	`FakeListChatModel` doesn't emit metadata (P43)	Subclass the fake to inject `generation_info`, or `pytest.skip` on fake-model detection
`ImportError: cannot import name 'ChatOpenAI' from 'langchain.chat_models'` in CI only	Legacy compat shim installed locally, not in CI	Add `dryrunloadchains.py` to lint job; fail at lint, not at test
Eval job times out at 10 min	n too large or harness not using `asyncio` concurrency	Cap at n=100 for PRs; run n=500 nightly; see F23 for async harness pattern
Concurrency block cancels integration run	Long job + rapid pushes	Do not disable; keep integration <5 min or split long tier to nightly

Examples

Wiring a new repo from scratch

Copy the Step 1 workflow, the Step 2 pyproject.toml block, and the Step 5

pre-commit config. Create tests/unit/, tests/integration/cassettes/,

scripts/runeval.py, scripts/dryrunload_chains.py,

scripts/scan_cassettes.py. Apply langchain-local-dev-loop (F23) first so

fake-model fixtures exist before the unit job runs. Enable required status

checks: unit (3.10), unit (3.11), unit (3.12), lint, eval.

Integration stays optional (label-gated).

See GHA Workflow Reference for the

complete copy-pasteable workflow.

Hardening after a P44 cassette-leak incident

Rotate every leaked key first (not a CI concern — incident response).

Then: add scan_cassettes.py to pre-commit, re-scan the full history with

git log -p -- tests/integration/cassettes/, rewrite history with

git-filter-repo if keys hit main, enforce the filter_headers fixture

going forward. See Pre-Commit Hooks for the

full pattern list and the detect-secrets baseline-rotation playbook.

Wiring the eval harness into an existing repo

The harness itself lives in langchain-eval-harness. THIS skill only supplies

run_eval.py (the CI wrapper that reads the harness output, computes deltas,

and posts PR comments) plus the gate thresholds. Drop in the Step 4 script,

add the eval job to .github/workflows/tests.yml, make eval a required

status check. See Eval Regression Gate

for the PR-comment Markdown template and the n≥100 noise-budget derivation.

Resources

LangChain Python: Testing
FakeListChatModel API
vcrpy docs — filtering sensitive data
GitHub Actions docs
pytest filterwarnings
Pair skill: langchain-local-dev-loop (F23) — fake fixtures, local recording
Pair skill: langchain-eval-harness — eval suite the gate runs against
Pack pain catalog: docs/pain-catalog.md (entries P05, P43, P44, P45)

Allowed Tools

Provided by Plugin

langchain-py-pack

Installation

Instructions

LangChain CI Integration (Python)

Overview

Prerequisites

Instructions

Step 1 — GHA workflow skeleton with four jobs

Step 2 — Unit job: -W error + filterwarnings to neutralize P45

Step 3 — Integration job: VCR replay + filter_headers (P44)

Step 4 — Eval-regression gate: merge-blocking PR comment

Step 5 — Pre-commit hooks: secret scan + prompt lint

Step 6 — Dry-run chain loader: catch ImportError migration breaks

Output

Gate policy

Error Handling

Examples

Wiring a new repo from scratch

Hardening after a P44 cassette-leak incident

Wiring the eval harness into an existing repo

Resources

Ready to use langchain-py-pack?

Related Skills

"cursor-advanced-composer"

"cursor-ai-chat"

"cursor-api-key-management"

"cursor-codebase-indexing"

"cursor-common-errors"

"cursor-compliance-audit"

Step 2 — Unit job: `-W error` + `filterwarnings` to neutralize P45

Step 3 — Integration job: VCR replay + `filter_headers` (P44)