Abdullah Veli Ozcan
Event Clustering Initial Analysis

Repository



Event Clustering Project

This repo contains a set of JSON data files and Jupyter notebooks for exploring and clustering log entries from multiple tenants. The primary goal is to group similar log events while leveraging various clustering approaches (HDBSCAN, KMeans, MiniBatchKMeans) and embedding techniques (TF-IDF, BERT).

Files and Structure


train2_fixed.json

Pre-processed log data (14k rows) ready for clustering pipelines.


dataClustering.json

The original, raw event data before cleaning/parsing.


EDA.ipynb

Notebook performing Exploratory Data Analysis (EDA): distributions, time-based counts, category/severity summaries, and visualizations (line charts, bar charts, UMAP/t-SNE plots, etc.).


PipelineAllMiniLM.ipynb

Demonstrates clustering using HDBSCAN + BERT (with the all-MiniLM-L6-v2 model).
Shows how to generate embeddings, one-hot-encode other features (category, severity, producer), and cluster them with HDBSCAN.


PipelineHDBSCAN.ipynb

Similar to above but with TF-IDF + HDBSCAN instead of BERT embeddings.


PipelineKMeans.ipynb

Example KMeans clustering pipeline with TF-IDF and numeric features (hour_of_day, message_length).


PipelineMiniBatchKMeans.ipynb

Demonstrates the MiniBatchKMeans approach for partial-fitting new data (“online” clustering).
Uses TF-IDF + numeric features, updated in small batches.


Key Features


Tenant-Specific Pipelines

Each tenant’s logs are handled separately to avoid overshadowing smaller tenants’ patterns.


Text Embeddings

- TF-IDF approach (pipelinehdbscan.ipynb, pipelinekmeans.ipynb)

- BERT approach (pipelineAllminilm.ipynb)


Numeric Columns

- hour_of_day (extracted from received_at)

- message_length (characters in message)


Including these numerical attributes alongside text embeddings enhances cluster quality by considering time-of-day and log message size.


New Incoming Data

Demonstrations of approximate predict (HDBSCAN) or partial_fit (MiniBatchKMeans) for logs with known or new tenant IDs.


How to Use


Data: Ensure train2_fixed.json and dataClustering.json are in place.

EDA: Run EDA_last.ipynb to get an overview of data distributions and relationships.

Choose a Pipeline:


BERT + HDBSCAN (pipelineAllminilm.ipynb),

TF-IDF + HDBSCAN (pipelinehdbscan.ipynb),

KMeans (pipelinekmeans.ipynb), or

MiniBatchKMeans (pipelineminibatchkmeans.ipynb).


Clustering Results: Check cluster_label columns in the notebooks for each tenant.

Adapt: Modify scripts for additional tenants, different hyperparameters, or streaming scenarios.