A general-purpose clustering framework for mixed-type tabular data — built on a single, uncomfortable observation: the space where traditional clustering happens is the wrong space.
Every clustering method you have ever used operates in feature-encoded space: raw numbers, one-hot vectors, label integers. This is where K-Means minimises inertia. This is where DBSCAN draws epsilon-balls. This is where Silhouette scores are computed. And this is, fundamentally, the wrong place to look for meaningful human groupings in real-world tabular data.
Here is the problem. Suppose your data has a column JobRole with values:
Sales Executive → [1, 0, 0, 0, 0, 0, 0, 0, 0]
Manager → [0, 1, 0, 0, 0, 0, 0, 0, 0]
Research Director → [0, 0, 1, 0, 0, 0, 0, 0, 0]
HR Representative → [0, 0, 0, 1, 0, 0, 0, 0, 0]
In one-hot space, every pair of roles is exactly equidistant — every pair has Euclidean distance √2. A Sales Executive is as far from a Manager as from an HR Representative. The encoder has no idea that both sales roles involve targets and quotas, while HR has none of that. That semantic geometry simply does not exist in the encoded feature space.
Now look at what happens in the embedding space of a pretrained language model:
"Sales Executive" → [0.55, 0.21, -0.44, ...] ┐ close together
"Sales Representative" → [0.53, 0.24, -0.41, ...] ┘
"Manager" → [0.12, 0.67, -0.08, ...] ┐ different region
"Research Director" → [0.09, 0.71, -0.05, ...] ┘
"HR Representative" → [-0.34, 0.44, 0.33, ...] separate region
Roles that mean similar things live close together. Roles that are conceptually distant are far apart. The geometry is real. The distances correspond to something.
SIMBA's argument is this: if you want to find groups of records that are genuinely similar in the way a human expert would recognise, you should cluster in the space where human conceptual similarity already has metric structure — the embedding space. Not after. Not alongside. Before everything else.
This is not a preprocessing trick. It is a philosophical flip. Traditional methods encode first, then try to cluster in a broken space. SIMBA maps everything into a space where meaning is already metrically organised, then clusters there. We are not improving the clustering algorithm. We are replacing the space it runs in.
The statistics themselves flip:
| What you compute | Feature-encoded space | Embedding space |
|---|---|---|
| Distance between "Sales Executive" and "Manager" | √2 (same as any other pair) | Small (both are mid-seniority business roles) |
| Distance between "HR" and "Sales" | √2 (same as any other pair) | Large (conceptually unrelated domains) |
| Silhouette of discovered clusters | 0.12 – 0.14 (traditional K-Means) | 0.86 (SIMBA) |
| Clusters that make business sense | Sometimes | Consistently |
Real tabular data is almost never purely numeric. A typical HR record has age, salary (numbers), job role, department, education field (categorical text), satisfaction scores (ordinal integers), and marital status (binary category). Traditional clustering pipelines require you to:
- One-hot encode the categoricals (exploding dimensionality)
- Normalise the numerics
- Pick a distance metric that somehow respects both
- Hope the clusters that emerge mean something
Each of these steps discards information. One-hot encoding loses the fact that "Life Sciences" and "Medical" are adjacent fields. Normalisation loses the fact that an age of 22 in a senior-role employee is far more unusual than an age of 22 in an entry-level one. And the final distance metric treats all these patched-together features as if they belong to the same geometric space — which they do not.
SIMBA sidesteps all of this. Every cell — regardless of type — becomes a natural-language string "column_name: value" and is passed through a pretrained language encoder. The encoder already understands that "age: 22" and "age: 58" are far apart on a meaningful axis. It already understands that "education_field: Life Sciences" and "education_field: Medical" are closer to each other than either is to "education_field: Marketing". You do not have to tell it. It learned this from billions of text tokens.
Each cell is embedded individually:
row 0, col "Department" → "Department: Sales" → 384-dim vector
row 0, col "JobRole" → "JobRole: Sales Executive" → 384-dim vector
row 0, col "Age" → "Age: 41" → 384-dim vector
row 0, col "Education" → "Education: Life Sciences" → 384-dim vector
... ... ...
Result shape: (N_rows × N_cols × 384)
Why not embed the whole row as one long string? Two reasons. First, language models lose focus on long inputs — the early fields get diluted. Second, and more importantly, cell-level embedding gives each column its own clean representation independent of what appears in other columns of the same row. This means the weighting step (below) can amplify or suppress individual columns without contamination.
Not all columns deserve equal say in a person's (or record's) identity. A PerformanceRating column with only two possible values carries far less information than a JobRole column with nine semantically distinct categories.
SIMBA computes a global weight for each column:
global_weight_j = semantic_variance_j × (1 - uniqueness_ratio_j)
- semantic_variance: how spread out are the 384-dim embeddings for this column across all rows? A column whose values are semantically diverse (many different meanings) has high variance.
- uniqueness_ratio: n_unique_values / n_rows. Penalises ID-like columns (employee number, row index) that are unique per row and carry zero clustering signal.
Actual weights learned on IBM HR (1470 employees, 29 columns):
| Column | Weight | Reason |
|---|---|---|
| JobRole | 12.7% | 9 semantically diverse roles |
| EducationField | 8.6% | 6 distinct fields with real distance structure |
| Department | 7.7% | 3 meaningful departments |
| Age | 7.1% | wide numeric range with real semantics |
| WorkLifeBalance | 0.7% | only 4 ordinal values, low variance |
| PerformanceRating | 0.5% | only 2 unique values |
| Attrition | 0.4% | 84% No — nearly constant, useless for clustering |
This is not manual feature selection. It is automatic, driven by the semantic geometry of the data itself.
Beyond global weights, each row gets per-column local weights:
local_weight(row_i, col_j) = L2 distance of row_i's embedding from the column mean
A cell that is unusual compared to the rest of its column pulls far from the column mean — it is a more distinctive cell, and it should dominate that row's identity. A cell that is typical sits close to the column mean — it is background noise for this row.
The combined weight is:
final_weight(i, j) = global_weight_j × local_weight(row_i, col_j)
Normalised per row to sum to 1. Each row ends up with a personalised weighting that emphasises its own unusual features.
After computing one 384-dim weighted row vector per record:
row_vector_i = Σ_j final_weight(i,j) × cell_embedding(i,j)
We do not apply K-Means. K-Means assumes spherical clusters and requires the number of clusters k to be specified in advance. Real-world semantic groups are not spherical and their number is unknown.
Instead:
- UMAP compresses 384D → 10D (preserving local neighbourhood structure, using cosine metric)
- HDBSCAN finds natural dense groupings without requiring k — the number of clusters is discovered from the data geometry
- Points that belong to no cluster are labelled noise (−1) rather than forced into the wrong group
Raw DataFrame (N rows × M columns, any mix of types)
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 0: Drop degenerate columns │
│ Remove constants, near-unique IDs, zero-variance cols │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 1: Cell-level embedding │
│ "col_name: value" → MiniLM-L6-v2 → 384-dim vector │
│ Shape: (N, M, 384) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 2: Global column weights │
│ weight_j = semantic_variance_j × (1 − uniqueness_j) │
│ Shape: (M,) — one scalar per column │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 3: Local row weights │
│ local(i,j) = L2 dist of cell (i,j) from column mean │
│ combined = global_j × local(i,j), normalised per row │
│ Shape: (N, M) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 4: Weighted pooling │
│ row_vec_i = Σ_j weight(i,j) × embedding(i,j) │
│ Shape: (N, 384) — one vector per record │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 5: UMAP compression │
│ 384D → 10D (cosine, for clustering) │
│ 384D → 2D (cosine, for visualisation only) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 6: HDBSCAN clustering │
│ Finds k automatically · labels noise as −1 │
│ Output: cluster labels for every row │
└─────────────────────────────────────────────────────────┘
| Method | Clusters | Noise % | Silhouette ↑ | Davies-Bouldin ↓ | Calinski-Harabasz ↑ |
|---|---|---|---|---|---|
| SIMBA — HDBSCAN (ours) | 15 | 3.2% | 0.865 | 0.207 | 14325 |
| K-Means on SIMBA embeddings | 8 | 0.0% | 0.740 | 0.490 | 2681 |
| K-Means + full encoding | 2 | 0.0% | 0.121 | 2.687 | 186 |
| K-Means numeric-only | 2 | 0.0% | 0.139 | 2.466 | 219 |
| Gower distance + K-Medoids | 4 | 0.0% | 0.156 | 1.873 | 241 |
| Agglomerative (Ward) | 4 | 0.0% | 0.148 | 1.912 | 228 |
| HDBSCAN on one-hot | noise only | — | — | — | — |
SIMBA achieves 6× higher Silhouette and 65× higher Calinski-Harabasz than the best traditional baseline. Crucially, it also discovers 15 distinct groups where traditional methods collapse to 2 or 4.
Semantic validation — Attrition rates by cluster (attrition was never used as a clustering input):
| Cluster | Attrition Rate | vs Overall Average (16%) |
|---|---|---|
| Cluster 3 | 36.3% | 2.3× above — HIGH RISK |
| Cluster 11 | 26.8% | 1.7× above |
| Cluster 14 | 5.3% | 3.0× below — VERY STABLE |
| Cluster 4 | 4.9% | 3.3× below |
A 7× spread in attrition rates across clusters, despite attrition never appearing in the input, proves the method found genuinely meaningful groups. This is not a metric artefact — it is real structure.
This is the hardest test case for any encoding-based method. All 6 columns are ordinal categories — buying ∈ {low, med, high, vhigh}, safety ∈ {low, med, high}, etc. One-hot encoding makes every pair of adjacent levels equidistant. SIMBA inherits ordinal geometry from the pretrained encoder with zero ordinal-specific engineering.
| Method | Clusters | Silhouette ↑ | Adjusted Rand vs Class ↑ |
|---|---|---|---|
| SIMBA — HDBSCAN (ours) | 38 | 0.829 | 0.412 |
| K-Means on SIMBA embeddings | 4 | 0.701 | 0.238 |
| K-Modes (native ordinal) | 4 | 0.143 | 0.118 |
| Agglomerative on one-hot | 4 | 0.211 | 0.151 |
| HDBSCAN on one-hot | 3 | 0.198 | 0.130 |
| Gower distance | 4 | 0.167 | 0.129 |
SIMBA finds 38 fine-grained ordinal combinations; K-Modes (the method designed for ordinal data) finds 4 coarse groups with 6× lower Silhouette.
| Method | Clusters | Noise % | Silhouette ↑ |
|---|---|---|---|
| SIMBA — HDBSCAN (ours) | 54 | 13.3% | 0.692 |
| K-Means on SIMBA embeddings | 8 | 0.0% | 0.611 |
| K-Means + full encoding | 8 | 0.0% | 0.142 |
| Gower + K-Medoids | 6 | 0.0% | 0.181 |
| HDBSCAN on one-hot | noise only | — | — |
Income rate spread across discovered clusters (income was never used as input): cluster 46 has 92.2% >50K income, clusters 20 and 37 have 0.0% — a spread that no traditional method replicates.
Beyond standard clustering metrics, which are computed in UMAP-compressed Euclidean space, we validate in the encoder's own 384-dimensional space using cosine similarity. This is metric-gaming-proof: the clustering is evaluated in a space it was never optimised for.
For each cluster c, we compute:
gap(c) = mean_cosine_sim(within c) − mean_cosine_sim(c, nearest other cluster)
A positive gap means: members of this cluster are more similar to each other than to the nearest outside cluster — in the raw embedding space, before any dimensionality reduction.
| Dataset | Clusters | Mean intra-sim | Mean nearest-inter | Mean gap | Positive gaps |
|---|---|---|---|---|---|
| IBM HR | 15 | 0.977 | 0.961 | 0.016 | 15/15 (100%) |
| Car Evaluation | 38 | 0.986 | 0.973 | 0.013 | 38/38 (100%) |
| Adult Income | 54 | 0.938 | 0.895 | 0.043 | 52/54 (96%) |
All 15 IBM HR clusters, all 38 Car Evaluation clusters, and 52 of 54 Adult Income clusters show strictly positive gaps. The two negative gaps in Adult Income correspond to heterogeneous boundary clusters flagged as mixed by the clustering itself (low density, high noise rate).
Put plainly: records that are described similarly have been grouped together — confirmed in the space where descriptions live.
scc/
├── notebooks/
│ ├── ibm_hr_comparison.ipynb # IBM HR — full pipeline + 14 baselines
│ ├── adult_income_comparison.ipynb # Adult Income — stress test at scale
│ └── car_evaluation.ipynb # Car Evaluation — purely ordinal challenge
├── semantic_silhouette_ibm.py # Semantic silhouette analysis (IBM HR)
├── semantic_silhouette_car.py # Semantic silhouette analysis (Car Eval)
├── semantic_silhouette_adult.py # Semantic silhouette analysis (Adult Income)
├── arxiv_paper/
│ └── scc_paper.tex # Full LaTeX paper
├── requirements.txt
└── README.md
git clone https://github.com/<your-username>/scc.git
cd scc
pip install -r requirements.txtrequirements.txt:
sentence-transformers>=2.7
umap-learn>=0.5
hdbscan>=0.8
pandas>=2.0
numpy>=1.26
matplotlib>=3.8
scikit-learn>=1.4
tqdm>=4.66
jupyter
kmodes
gower
Python 3.10+ recommended. All embedding computation is local and offline — no API keys required. The model (all-MiniLM-L6-v2, 22M parameters) downloads automatically from Hugging Face on first run.
Download from Kaggle: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
Place archive.zip (or the extracted WA_Fn-UseC_-HR-Employee-Attrition.csv) in the scc/ root directory (one level above notebooks/).
Downloaded automatically from OpenML on first notebook run — no manual step required. Requires internet on the first run only.
Download from UCI: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Or via command line:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.dataPlace car_evaluation.zip (or car.data) in the scc/ root directory.
jupyter notebook notebooks/ibm_hr_comparison.ipynbFirst run: embeddings are computed and cached to embeddings_ibm.npy (~90 seconds on CPU for IBM HR, ~30s for Car Eval, ~2 minutes for Adult Income). Every subsequent run loads from cache — instant.
Each notebook is self-contained. Run all cells top to bottom.
| Dataset | Rows | Columns | Embedding time (CPU) | Cached runs |
|---|---|---|---|---|
| IBM HR | 1470 | 29 | ~90 seconds | Instant |
| Car Evaluation | 1728 | 6 | ~30 seconds | Instant |
| Adult Income | 5000 | 12 | ~2 minutes | Instant |
The embedding step is the only expensive step. Everything else (UMAP, HDBSCAN, baselines, metrics) runs in seconds.
Good fit:
- HR data, CRM data, customer profiles with mixed types
- Product catalogues (name, category, attributes, price tier)
- Survey responses (Likert scales + free text + demographics)
- Any dataset where you would describe record similarity in words
- When you don't know how many clusters to expect
Poor fit:
- Pure numeric sensor data or scientific measurements
- Datasets with >100k rows (embedding cost becomes significant without GPU)
- Tasks where exact numeric precision matters more than conceptual proximity
Simple test: Can you describe what makes two records similar using natural language? → SIMBA. Would you describe it as closeness on a numeric scale? → K-Means.
SIMBA uses existing components (sentence-transformers, UMAP, HDBSCAN) but the combination and framing are new:
- Row serialisation methods (TAPAS, TabPFN) embed entire rows as one token sequence — SIMBA embeds at cell level to preserve per-column geometric identity
- Gower distance handles mixed types but stays in feature-encoded space — no semantic understanding
- K-Prototypes / K-Modes are designed for categorical data but treat all categories as equally distant
- Deep clustering (DEC, DCEC) learns embeddings from the clustering objective itself — not transferable to new datasets without retraining
- SIMBA uses a pretrained general encoder as-is — zero training, zero labelled data, zero domain-specific engineering
If you use SIMBA in your work, please cite:
@misc{scc2024,
title = {Semantic Similarity-Based Aggregator: A General-Purpose Framework
for Mixed-Type Tabular Data},
year = {2024},
note = {Preprint available at Zenodo}
}MIT