palantir-observability

'Set up observability for Palantir Foundry integrations with metrics,

v2.0.0

Jeremy Longshore

MIT

3 Tools

palantir-pack Plugin

saas packs Category

Allowed Tools
        ReadWriteEdit
      

Provided by Plugin

palantir-pack

Claude Code skill pack for Palantir (24 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the palantir-pack plugin:

/plugin install palantir-pack@claude-code-plugins-plus

Click to copy

Instructions

Palantir Observability

Overview

Set up comprehensive observability for Foundry integrations: structured logging with request IDs, Prometheus metrics for API latency/errors, health check endpoints, and alert rules.

Prerequisites

Working Foundry integration
Prometheus + Grafana (or equivalent monitoring stack)
Familiarity with palantir-prod-checklist

Instructions

Step 1: Structured Logging


import logging, json, time, uuid

class FoundryLogger:
    def __init__(self):
        self.logger = logging.getLogger("foundry")
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log_api_call(self, method: str, endpoint: str, status: int, duration_ms: float):
        self.logger.info(json.dumps({
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "request_id": str(uuid.uuid4())[:8],
            "service": "foundry",
            "method": method,
            "endpoint": endpoint,
            "status": status,
            "duration_ms": round(duration_ms, 2),
            "level": "error" if status >= 400 else "info",
        }))

Step 2: Prometheus Metrics


from prometheus_client import Counter, Histogram, Gauge

foundry_requests = Counter(
    "foundry_api_requests_total",
    "Total Foundry API requests",
    ["method", "endpoint", "status"],
)
foundry_latency = Histogram(
    "foundry_api_latency_seconds",
    "Foundry API request latency",
    ["endpoint"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
foundry_health = Gauge(
    "foundry_api_healthy",
    "1 if Foundry API is reachable, 0 otherwise",
)

def instrumented_call(client, method, *args, **kwargs):
    endpoint = method.__qualname__
    start = time.monotonic()
    try:
        result = method(*args, **kwargs)
        status = 200
        return result
    except foundry.ApiError as e:
        status = e.status_code
        raise
    finally:
        duration = time.monotonic() - start
        foundry_requests.labels(method="API", endpoint=endpoint, status=str(status)).inc()
        foundry_latency.labels(endpoint=endpoint).observe(duration)

Step 3: Health Check with Metrics


import time

async def foundry_health_check():
    start = time.monotonic()
    try:
        list(client.ontologies.Ontology.list())
        latency = (time.monotonic() - start) * 1000
        foundry_health.set(1)
        return {"status": "healthy", "latency_ms": round(latency, 1)}
    except Exception as e:
        foundry_health.set(0)
        return {"status": "unhealthy", "error": str(e)}

Step 4: Alert Rules (Prometheus)


groups:
  - name: foundry
    rules:
      - alert: FoundryAPIDown
        expr: foundry_api_healthy == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Foundry API unreachable for 2+ minutes"

      - alert: FoundryHighErrorRate
        expr: rate(foundry_api_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning

      - alert: FoundryHighLatency
        expr: histogram_quantile(0.99, foundry_api_latency_seconds_bucket) > 10
        for: 10m
        labels:
          severity: warning

Step 5: Dashboard Queries (Grafana)


# Request rate by status
rate(foundry_api_requests_total[5m])

# P99 latency
histogram_quantile(0.99, rate(foundry_api_latency_seconds_bucket[5m]))

# Error ratio
sum(rate(foundry_api_requests_total{status=~"[45].."}[5m]))
/ sum(rate(foundry_api_requests_total[5m]))

Output

Structured JSON logging with request IDs
Prometheus metrics for requests, latency, and health
Alert rules for API downtime, error rate, and latency
Grafana dashboard queries

Error Handling

Alert	Threshold	Action
API Down	Health check fails 2min	Page on-call, check `palantir-incident-runbook`
High Error Rate	5xx > 5% for 5min	Check Foundry status, review logs
High Latency	p99 > 10s for 10min	Review query complexity, check Foundry load
Rate Limited	429 count spike	Tune rate limiter settings

Resources

Next Steps

For multi-environment setup, see palantir-multi-env-setup.

Allowed Tools

Provided by Plugin

palantir-pack

Installation

Instructions

Palantir Observability

Overview

Prerequisites

Instructions

Step 1: Structured Logging

Step 2: Prometheus Metrics

Step 3: Health Check with Metrics

Step 4: Alert Rules (Prometheus)

Step 5: Dashboard Queries (Grafana)

Output

Error Handling

Resources

Next Steps

Ready to use palantir-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle