Detecting Data Anomalies
Overview
Identify anomalies and outliers in datasets using statistical and machine learning algorithms including Isolation Forest, One-Class SVM, Local Outlier Factor, and autoencoders. This skill handles the full detection pipeline from data ingestion and feature scaling through algorithm selection, threshold tuning, and result interpretation with anomaly scoring.
Prerequisites
- Python 3.9+ with scikit-learn >= 1.3 (
pip install scikit-learn)
- pandas and NumPy for data manipulation (
pip install pandas numpy)
- matplotlib or seaborn for anomaly visualizations (
pip install matplotlib seaborn)
- Dataset in CSV, JSON, Parquet, or database-queryable format
- Minimum 500 data points for statistical significance (1000+ recommended)
- Optional: PyTorch or TensorFlow for autoencoder-based detection on complex patterns
Instructions
- Load the dataset using the Read tool and verify schema, column types, and row count
- Profile feature distributions using descriptive statistics to understand baseline behavior
- Handle missing values via imputation (median for numeric, mode for categorical) or row exclusion
- Apply StandardScaler or MinMaxScaler to numeric features to normalize magnitude differences
- Select the detection algorithm based on data characteristics:
- Isolation Forest: high-dimensional data, no assumptions on distribution
- One-Class SVM: well-defined normal class with clear decision boundary
- Local Outlier Factor: density-varying data with local anomaly patterns
- Autoencoder: complex temporal or image data with non-linear relationships
- Set the contamination parameter to the expected anomaly proportion (start with 0.01-0.05)
- Fit the model on the training partition and generate anomaly scores for each data point
- Apply the decision threshold to classify points as normal (-1) or anomalous (1)
- Analyze flagged anomalies for common characteristics, temporal clusters, or feature correlations
- Generate a summary report with detection counts, score distributions, and visualization plots
See ${CLAUDESKILLDIR}/references/implementation.md for the detailed implementation guide.
Output
- Anomaly detection summary: total points, anomaly count, contamination rate
- Per-record anomaly scores with classification labels
- Algorithm configuration: model type, contamination, distance metric, threshold
- Feature importance ranking showing which dimensions drive anomaly flags
- Visualization: scatter plot of anomaly scores, distribution histogram, t-SNE cluster plot
- CSV export of flagged records with anomaly scores and con