langchain-ci-integration

Wire LangChain 1.0 / LangGraph 1.0 tests into a GitHub Actions pipeline — unit tests with FakeListChatModel, VCR-gated integration tests, warning-filter policy, and eval-regression merge gates. Complements langchain-local-dev-loop (F23) which covers the inner loop; THIS covers the CI wire-up. Use when setting up GHA for a new LLM service, after a VCR cassette leak incident, or hardening an existing pipeline. Trigger with "langchain ci", "langchain github actions", "langchain test pipeline", "vcr ci", "langchain eval gate", "pytest -W error langchain".

claude-codecodex
5 Tools
langchain-py-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(python:*)Bash(pytest:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0
View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain CI Integration (Python)

Overview

A PR passes every test on your laptop. You push. GHA runs pytest and aborts

during collection — before a single test executes — with:


PytestUnraisableExceptionWarning: Exception ignored in: ...
DeprecationWarning: langchain_community.llms ...

The org runs pytest -W error and a provider SDK emitted a DeprecationWarning

at import time, which the warning filter promoted to an exception while pytest

was still walking the test tree. This is P45 and it blocks every PR for the

team until someone pins a filterwarnings config.

Meanwhile the integration suite has its own failure mode: a VCR cassette

recorded three months ago at temperature=0 against Anthropic is now flaking

against a snapshot. temperature=0 is not deterministic on Claude — it still

nucleus-samples (P05) — so the cassette captured one valid completion, not

the valid completion. And yesterday a reviewer caught

Authorization: Bearer sk-ant-... in a cassette file that had been committed

six weeks ago (P44) because vcrpy records all request headers by default.

This skill covers the outer loop: the GitHub Actions workflow, the unit /

integration / eval gate separation, VCR cassette hygiene, pytest warning

policy, and a merge-blocking eval regression gate. The inner loop — fake

model fixtures, VCR recording workflow, local determinism tricks — lives in

langchain-local-dev-loop (F23); cross-reference it, do not duplicate it.

Pin: langchain-core 1.0.x, langgraph 1.0.x, actions/checkout@v4,

actions/setup-python@v5, vcrpy 6.x. Pain-catalog anchors: P05, P43, P44, P45.

Prerequisites

  • Python 3.10, 3.11, or 3.12 (matrix)
  • langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
  • pytest >= 8, pytest-asyncio, vcrpy >= 6 (integration)
  • langchain-local-dev-loop (F23) applied locally — fixtures and recording workflow
  • GitHub repo with Actions enabled; secrets set for any live-API nightly job

Instructions

Step 1 — GHA workflow skeleton with four jobs

Single workflow at .github/workflows/tests.yml. Matrix on unit only; keep

integration and eval single-version to control cost.


name: tests

on:
  pull_request:
  push:
    branches: [main]
  schedule:
    - cron: "0 6 * * *"  # nightly live-API re-record check (06:00 UTC)

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  unit:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python }}
          cache: pip
          cache-dependency-path: |
            pyproject.toml
            requirements*.txt
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ -W error --timeout=30 -q

  integration:
    needs: unit
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-integration')
    runs-on: ubuntu-latest
    env:
      RUN_INTEGRATION: "1"
      VCR_MODE: "none"  # replay-only; nightly cron flips to "once"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,integration]"
      - run: pytest tests/integration/ -W error --timeout=60 -q

  eval:
    needs: unit
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # need base ref for delta comparison
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[test,eval]"
      - run: python scripts/run_eval.py --baseline origin/${{ github.base_ref }} --head HEAD --n 100
      # run_eval.py posts a PR comment and exits nonzero on regression > threshold

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: pip }
      - run: pip install -e ".[dev]"
      - run: ruff check .
      - run: python scripts/dryrun_load_chains.py   # catches ImportError migration regressions

See GHA Workflow Reference for the full

job definitions including the secret-injection pattern, the matrix caching

nuance, and the softprops/action-gh-release-style PR comment action used by

the eval job.

Step 2 — Unit job: -W error + filterwarnings to neutralize P45

Root cause of the collection abort: pytest collects tests by importing them.

Some provider SDKs emit DeprecationWarning on import. With -W error those

become exceptions during collection. Fix at the filter level, not by dropping

-W error (which would mask real warnings).

In pyproject.toml:


[tool.pytest.ini_options]
filterwarnings = [
    "error",
    # P45 — neutralize known import-time noise; scoped per module so new
    # warnings from YOUR code still fail the build.
    "ignore::DeprecationWarning:langchain_community.*",
    "ignore::DeprecationWarning:pydantic.*",
    "ignore:Pydantic serializer warnings:UserWarning",
]
asyncio_mode = "auto"
testpaths = ["tests"]

The ordering matters — "error" first, specific "ignore" entries after, so

the filters override the global promote-to-error. Keep the list narrow: a

blanket ignore::DeprecationWarning hides regressions you need to see.

Unit tests use FakeListChatModel fixtures from F23 (do not redefine them

here). One CI-specific gotcha (P43): FakeListChatModel does not emit

responsemetadata["tokenusage"], so any callback that asserts on token counts

will break. Either subclass the fake and inject generation_info, or gate the

assertion:


def test_chain_uses_tokens(patched_chat_model):
    result = chain.invoke({"input": "hi"})
    if patched_chat_model.__class__.__name__ == "FakeListChatModel":
        pytest.skip("fake model doesn't emit token_usage (P43)")
    assert result.response_metadata["token_usage"]["total_tokens"] > 0

Budget: unit job should finish in <2 minutes across the 3-version matrix.

If it doesn't, something is calling out to a real provider — check with

pytest --collect-only -q | wc -l and audit which tests lack fake-model

fixtures.

Step 3 — Integration job: VCR replay + filter_headers (P44)

Integration tests replay pre-recorded VCR cassettes. Three rules:

  1. Gate the job. if: contains(github.event.pullrequest.labels.*.name, 'run-integration') or env.RUNINTEGRATION == "1", plus a nightly cron that flips to VCR_MODE=once and re-records against live APIs. PRs default to pure replay.
  2. Enforce filter_headers at the fixture level — not per-test. A single conftest.py prevents any contributor from recording a cassette with raw credentials.
  3. Pre-commit + CI both scan cassettes for leaked keys. Belt and suspenders.

Fixture (lives in tests/integration/conftest.py, owned by this skill's

pipeline concern — F23 owns the recording workflow):


import vcr
import pytest

@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",
            "x-api-key",
            "anthropic-version",
            ("openai-organization", "REDACTED"),
        ],
        "filter_post_data_parameters": ["api_key"],
        "record_mode": "none",  # CI default: replay only
        "match_on": ["method", "scheme", "host", "port", "path", "query"],
    }

Integration suite must finish in <5 minutes wall-clock on the runner, or

you will start getting cancellation flakes from the concurrency block. If

you exceed 5 minutes, split into a nightly-only long tier.

See Integration Gating for the full

live-vs-replay decision tree, cost-per-run budget worksheet, and the

VCR_MODE flip pattern.

Step 4 — Eval-regression gate: merge-blocking PR comment

The eval job runs the langchain-eval-harness harness (see that skill for the

harness itself — this skill only covers the CI wire-up) against both the PR

branch and the merge base. Post a comment; block merge on regression.

scripts/run_eval.py is a thin CI wrapper: check out baseline and head via

git worktree, run the harness at each ref, diff the results, post a PR

comment, exit nonzero on regression. Full implementation in

Eval Regression Gate.

Thresholds:

Gate Threshold Rationale
Aggregate score drop >2% One-sigma noise on n=100 with well-behaved evals
Per-example score drop >5% on any single case Catches quiet regressions masked by aggregate averaging
Sample size floor n ≥ 100 Below this, aggregate delta is dominated by noise

The PR comment is a Markdown table with before / after / Δ per metric plus a

bold red line if the gate failed. Required-status-check on the eval job

completes the enforcement. See Eval Regression Gate

for the comment template and the noise-budget calculation.

Step 5 — Pre-commit hooks: secret scan + prompt lint

Two layers: local (pre-commit) and CI (re-runs the same hooks as a final

catch). Local alone is not sufficient — contributors can skip with -n. CI

alone is slow. Run both.

.pre-commit-config.yaml:


repos:
  - repo: local
    hooks:
      - id: vcr-secret-scan
        name: VCR cassette secret scan (P44)
        entry: python scripts/scan_cassettes.py
        language: system
        files: "tests/integration/cassettes/.*\\.ya?ml$"
        pass_filenames: true

      - id: prompt-convention-lint
        name: prompt-convention lint
        entry: python scripts/lint_prompts.py
        language: system
        files: "prompts/.*\\.j2$|src/.*prompts?\\.py$"

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.9
    hooks:
      - id: ruff
      - id: ruff-format

  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.5.0
    hooks:
      - id: detect-secrets
        args: ["--baseline", ".secrets.baseline"]

scancassettes.py greps for sk-[A-Za-z0-9]{20,}, sk-ant-[A-Za-z0-9-]{20,},

AIza[A-Za-z0-9-]{35} (Google), xoxb-, and Bearer [A-Za-z0-9.-]{20,}.

Fail on any match. This is your last line of defense before P44 ships to

main. See Pre-Commit Hooks for the full

pattern list, the prompt-convention lint rules (aligned with

claude-prompt-conventions), and the detect-secrets baseline-rotation policy.

Step 6 — Dry-run chain loader: catch ImportError migration breaks

LangChain 0.x → 1.0 moved integrations into provider packages. A chain that

imports from langchain.chat_models import ChatOpenAI works in local dev if

you still have the old compat shim installed, and explodes in CI. Dry-run-load

every chain module at lint time:


# scripts/dryrun_load_chains.py
import importlib, pathlib, sys, traceback

failures = []
for py in pathlib.Path("src/chains").rglob("*.py"):
    mod = str(py.with_suffix("")).replace("/", ".")
    try:
        importlib.import_module(mod)
    except Exception:
        failures.append((mod, traceback.format_exc()))

if failures:
    for mod, tb in failures:
        print(f"::error::chain {mod} failed to import\n{tb}")
    sys.exit(1)

Runs in the lint job. Costs ~5 seconds. Catches every ImportError and

every top-level NameError from a bad rename before a single unit test fires.

Output

  • GHA workflow with four isolated jobs (unit / integration / eval / lint)
  • pyproject.toml filterwarnings config that survives -W error (P45)
  • VCR conftest.py fixture with enforced filter_headers (P44)
  • run_eval.py CI wrapper that posts PR comments and blocks merge on regression
  • .pre-commit-config.yaml with cassette secret scan + prompt lint + ruff
  • Dry-run chain loader that catches migration ImportErrors

Gate policy

Gate Required? Target speed On failure
unit (3 Python versions) yes, every PR <2 min block PR
lint + dryrun-load yes, every PR <30 s block PR
integration (VCR replay) on run-integration label or nightly <5 min block merge when run
integration (live, nightly cron) no <15 min open issue on fail
eval regression (n≥100) yes, every PR <10 min block merge if agg >2% or per-example >5%
pre-commit (local) yes <10 s reject commit

Error Handling

Error Cause Fix
PytestUnraisableExceptionWarning during collection -W error + SDK import-time DeprecationWarning (P45) Add scoped filterwarnings = ["ignore::DeprecationWarning:langchain_community.*"] to pyproject.toml
VCR replay mismatch after weeks of passing Cassette recorded at temp=0 on Anthropic (P05); model drift Re-record on nightly cron with VCR_MODE=once; treat replay mismatches as eval-gate concerns, not unit failures
sk-ant-... in cassette flagged by reviewer vcrpy records all headers by default (P44) Enforce filterheaders in conftest.py; add scancassettes.py to pre-commit AND CI
Callback AssertionError: 'tokenusage' not in responsemetadata FakeListChatModel doesn't emit metadata (P43) Subclass the fake to inject generation_info, or pytest.skip on fake-model detection
ImportError: cannot import name 'ChatOpenAI' from 'langchain.chat_models' in CI only Legacy compat shim installed locally, not in CI Add dryrunloadchains.py to lint job; fail at lint, not at test
Eval job times out at 10 min n too large or harness not using asyncio concurrency Cap at n=100 for PRs; run n=500 nightly; see F23 for async harness pattern
Concurrency block cancels integration run Long job + rapid pushes Do not disable; keep integration <5 min or split long tier to nightly

Examples

Wiring a new repo from scratch

Copy the Step 1 workflow, the Step 2 pyproject.toml block, and the Step 5

pre-commit config. Create tests/unit/, tests/integration/cassettes/,

scripts/runeval.py, scripts/dryrunload_chains.py,

scripts/scan_cassettes.py. Apply langchain-local-dev-loop (F23) first so

fake-model fixtures exist before the unit job runs. Enable required status

checks: unit (3.10), unit (3.11), unit (3.12), lint, eval.

Integration stays optional (label-gated).

See GHA Workflow Reference for the

complete copy-pasteable workflow.

Hardening after a P44 cassette-leak incident

Rotate every leaked key first (not a CI concern — incident response).

Then: add scan_cassettes.py to pre-commit, re-scan the full history with

git log -p -- tests/integration/cassettes/, rewrite history with

git-filter-repo if keys hit main, enforce the filter_headers fixture

going forward. See Pre-Commit Hooks for the

full pattern list and the detect-secrets baseline-rotation playbook.

Wiring the eval harness into an existing repo

The harness itself lives in langchain-eval-harness. THIS skill only supplies

run_eval.py (the CI wrapper that reads the harness output, computes deltas,

and posts PR comments) plus the gate thresholds. Drop in the Step 4 script,

add the eval job to .github/workflows/tests.yml, make eval a required

status check. See Eval Regression Gate

for the PR-comment Markdown template and the n≥100 noise-budget derivation.

Resources

Ready to use langchain-py-pack?