Event Clustering Project
This repo contains a set of JSON data files and Jupyter notebooks for exploring and clustering log entries from multiple tenants. The primary goal is to group similar log events while leveraging various clustering approaches (HDBSCAN, KMeans, MiniBatchKMeans) and embedding techniques (TF-IDF, BERT).
Files and Structure
-
train2_fixed.json
- Pre-processed log data (14k rows) ready for clustering pipelines.
-
dataClustering.json
- The original, raw event data before cleaning/parsing.
-
EDA.ipynb
- Notebook performing Exploratory Data Analysis (EDA): distributions, time-based counts, category/severity summaries, and visualizations (line charts, bar charts, UMAP/t-SNE plots, etc.).
-
PipelineAllMiniLM.ipynb
- Demonstrates clustering using HDBSCAN + BERT (with the all-MiniLM-L6-v2 model).
- Shows how to generate embeddings, one-hot-encode other features (category, severity, producer), and cluster them with HDBSCAN.
-
PipelineHDBSCAN.ipynb
- Similar to above but with TF-IDF + HDBSCAN instead of BERT embeddings.
-
PipelineKMeans.ipynb
- Example KMeans clustering pipeline with TF-IDF and numeric features (hour_of_day, message_length).
-
PipelineMiniBatchKMeans.ipynb
- Demonstrates the MiniBatchKMeans approach for partial-fitting new data (“online” clustering).
- Uses TF-IDF + numeric features, updated in small batches.
Key Features
-
Tenant-Specific Pipelines
Each tenant’s logs are handled separately to avoid overshadowing smaller tenants’ patterns. -
Text Embeddings
- TF-IDF approach (pipelinehdbscan.ipynb, pipelinekmeans.ipynb)
- BERT approach (pipelineAllminilm.ipynb) -
Numeric Columns
-hour_of_day
(extracted fromreceived_at
)
-message_length
(characters inmessage
)
Including these numerical attributes alongside text embeddings enhances cluster quality by considering time-of-day and log message size.
-
New Incoming Data
Demonstrations of approximate predict (HDBSCAN) or partial_fit (MiniBatchKMeans) for logs with known or new tenant IDs.
How to Use
-
Data: Ensure
train2_fixed.json
anddataClustering.json
are in place. -
EDA: Run
EDA_last.ipynb
to get an overview of data distributions and relationships. -
Choose a Pipeline:
-
BERT + HDBSCAN (
pipelineAllminilm.ipynb
), -
TF-IDF + HDBSCAN (
pipelinehdbscan.ipynb
), -
KMeans (
pipelinekmeans.ipynb
), or -
MiniBatchKMeans (
pipelineminibatchkmeans.ipynb
).
-
BERT + HDBSCAN (
-
Clustering Results: Check
cluster_label
columns in the notebooks for each tenant. - Adapt: Modify scripts for additional tenants, different hyperparameters, or streaming scenarios.