glean-data-handling

'PII filtering: strip emails, phone numbers, SSNs from document body

6 Tools
glean-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(npm:*)Bash(curl:*)Grep

Provided by Plugin

glean-pack

Claude Code skill pack for Glean (24 skills)

saas packs v1.0.0
View Plugin

Installation

This skill is included in the glean-pack plugin:

/plugin install glean-pack@claude-code-plugins-plus

Click to copy

Instructions

Glean Data Handling

Overview

Glean enterprise search ingests documents from dozens of connectors (Google Drive, Confluence, Slack, Jira, Salesforce, etc.) and builds a unified search index with permission-aware access control. Data types include indexed document content, connector metadata, user permission maps, query logs, and search analytics. All document content must be PII-filtered before indexing, permission boundaries must be preserved to prevent data leakage across teams, and retention policies must be enforced to comply with corporate governance and GDPR/CCPA obligations.

Data Classification

Data Type Sensitivity Retention Encryption
Indexed document content High (may contain PII) Per source retention policy AES-256 at rest
User permission maps High (access control) Sync lifecycle TLS + at rest
Connector metadata Medium Until connector removed AES-256 at rest
Search query logs Medium (reveals intent) 90 days default AES-256 at rest
Search analytics/aggregates Low 1 year TLS in transit

Data Import


interface GleanDocument {
  id: string; datasource: string; title: string;
  body: string; permissions: { allowedUsers?: string[]; allowAnonymousAccess?: boolean };
  updatedAt: string; url: string;
}

async function indexDocuments(docs: GleanDocument[], datasource: string) {
  // PII strip before indexing
  const sanitized = docs.map(doc => ({
    ...doc,
    body: stripPII(doc.body),
  }));
  // Batch upload with pagination (max 100 per request)
  for (let i = 0; i < sanitized.length; i += 100) {
    const batch = sanitized.slice(i, i + 100);
    await fetch(`https://customer-be.glean.com/api/index/v1/bulkindexdocuments`, {
      method: 'POST',
      headers: { Authorization: `Bearer ${process.env.GLEAN_INDEXING_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ datasource, documents: batch }),
    });
  }
}

function stripPII(text: string): string {
  return text
    .replace(/\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, '[EMAIL_REDACTED]')
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE_REDACTED]')
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN_REDACTED]');
}

Data Export


async function exportSearchAnalytics(startDate: string, endDate: string) {
  const res = await fetch(`https://customer-be.glean.com/api/v1/analytics`, {
    method: 'POST',
    headers: { Authorization: `Bearer ${process.env.GLEAN_API_TOKEN}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ startDate, endDate, metrics: ['query_count', 'click_through', 'zero_results'] }),
  });
  const data = await res.json();
  // Redact user identifiers from analytics export
  return data.results.map((r: any) => ({ ...r, userId: undefined, query: r.query?.length > 3 ? r.query : '[SHORT_QUERY_REDACTED]' }));
}

Data Validation


function validateDocument(doc: GleanDocument): string[] {
  const errors: string[] = [];
  if (!doc.id || doc.id.length > 512) errors.push('Invalid document ID');
  if (!doc.datasource) errors.push('Missing datasource identifier');
  if (!doc.title || doc.title.length > 1000) errors.push('Title missing or exceeds 1000 chars');
  if (!doc.body || doc.body.length === 0) errors.push('Empty document body');
  if (!doc.permissions) errors.push('Missing permissions — defaults to deny-all');
  if (doc.updatedAt && isNaN(Date.parse(doc.updatedAt))) errors.push('Invalid updatedAt timestamp');
  return errors;
}

Compliance

  • [ ] PII stripped from document body before indexing (emails, phones, SSNs)
  • [ ] Permission boundaries enforced: allowedUsers scope matches source system ACLs
  • [ ] Connector credentials stored in secret manager, rotated quarterly
  • [ ] Search query logs retained max 90 days, purged via automated job
  • [ ] GDPR right-to-erasure: delete all indexed content referencing a specific user on request
  • [ ] CCPA: honor do-not-sell signals for search analytics data
  • [ ] SOC 2 Type II audit trail for all indexing and deletion operations

Error Handling

Issue Cause Fix
403 on bulk index Expired or insufficient indexing token Rotate token, verify datasource permissions
Permission mismatch in search Stale ACL sync from connector Force re-sync connector permissions via admin API
PII detected in indexed content New PII pattern not in strip regex Add pattern to stripPII, re-index affected datasource
Zero-result queries spike Connector sync failure, stale index Check connector health dashboard, trigger manual re-crawl
Rate limit 429 on indexing Batch size too large or too frequent Reduce batch to 50 docs, add 500ms delay between batches

Resources

Next Steps

See glean-security-basics.

Ready to use glean-pack?