Semantic Similarity-Based Aggregator (SIMBA)

A general-purpose clustering framework for mixed-type tabular data — built on a single, uncomfortable observation: the space where traditional clustering happens is the wrong space.

The Core Idea — We Are Flipping the Statistics

Every clustering method you have ever used operates in feature-encoded space: raw numbers, one-hot vectors, label integers. This is where K-Means minimises inertia. This is where DBSCAN draws epsilon-balls. This is where Silhouette scores are computed. And this is, fundamentally, the wrong place to look for meaningful human groupings in real-world tabular data.

Here is the problem. Suppose your data has a column JobRole with values:

Sales Executive   →  [1, 0, 0, 0, 0, 0, 0, 0, 0]
Manager           →  [0, 1, 0, 0, 0, 0, 0, 0, 0]
Research Director →  [0, 0, 1, 0, 0, 0, 0, 0, 0]
HR Representative →  [0, 0, 0, 1, 0, 0, 0, 0, 0]

In one-hot space, every pair of roles is exactly equidistant — every pair has Euclidean distance √2. A Sales Executive is as far from a Manager as from an HR Representative. The encoder has no idea that both sales roles involve targets and quotas, while HR has none of that. That semantic geometry simply does not exist in the encoded feature space.

Now look at what happens in the embedding space of a pretrained language model:

"Sales Executive"   →  [0.55, 0.21, -0.44, ...]   ┐ close together
"Sales Representative" → [0.53, 0.24, -0.41, ...]  ┘
"Manager"           →  [0.12, 0.67, -0.08, ...]   ┐ different region
"Research Director" →  [0.09, 0.71, -0.05, ...]   ┘
"HR Representative" →  [-0.34, 0.44, 0.33, ...]     separate region

Roles that mean similar things live close together. Roles that are conceptually distant are far apart. The geometry is real. The distances correspond to something.

SIMBA's argument is this: if you want to find groups of records that are genuinely similar in the way a human expert would recognise, you should cluster in the space where human conceptual similarity already has metric structure — the embedding space. Not after. Not alongside. Before everything else.

This is not a preprocessing trick. It is a philosophical flip. Traditional methods encode first, then try to cluster in a broken space. SIMBA maps everything into a space where meaning is already metrically organised, then clusters there. We are not improving the clustering algorithm. We are replacing the space it runs in.

The statistics themselves flip:

What you compute	Feature-encoded space	Embedding space
Distance between "Sales Executive" and "Manager"	√2 (same as any other pair)	Small (both are mid-seniority business roles)
Distance between "HR" and "Sales"	√2 (same as any other pair)	Large (conceptually unrelated domains)
Silhouette of discovered clusters	0.12 – 0.14 (traditional K-Means)	0.86 (SIMBA)
Clusters that make business sense	Sometimes	Consistently

Why This Matters for Mixed-Type Data

Real tabular data is almost never purely numeric. A typical HR record has age, salary (numbers), job role, department, education field (categorical text), satisfaction scores (ordinal integers), and marital status (binary category). Traditional clustering pipelines require you to:

One-hot encode the categoricals (exploding dimensionality)
Normalise the numerics
Pick a distance metric that somehow respects both
Hope the clusters that emerge mean something

Each of these steps discards information. One-hot encoding loses the fact that "Life Sciences" and "Medical" are adjacent fields. Normalisation loses the fact that an age of 22 in a senior-role employee is far more unusual than an age of 22 in an entry-level one. And the final distance metric treats all these patched-together features as if they belong to the same geometric space — which they do not.

SIMBA sidesteps all of this. Every cell — regardless of type — becomes a natural-language string "column_name: value" and is passed through a pretrained language encoder. The encoder already understands that "age: 22" and "age: 58" are far apart on a meaningful axis. It already understands that "education_field: Life Sciences" and "education_field: Medical" are closer to each other than either is to "education_field: Marketing". You do not have to tell it. It learned this from billions of text tokens.

The Three Core Ideas

1 — Cell-Level Embedding (not row serialisation)

Each cell is embedded individually:

row 0, col "Department"   →  "Department: Sales"       →  384-dim vector
row 0, col "JobRole"      →  "JobRole: Sales Executive" →  384-dim vector
row 0, col "Age"          →  "Age: 41"                  →  384-dim vector
row 0, col "Education"    →  "Education: Life Sciences"  →  384-dim vector
         ...                         ...                          ...

Result shape: (N_rows × N_cols × 384)

Why not embed the whole row as one long string? Two reasons. First, language models lose focus on long inputs — the early fields get diluted. Second, and more importantly, cell-level embedding gives each column its own clean representation independent of what appears in other columns of the same row. This means the weighting step (below) can amplify or suppress individual columns without contamination.

2 — Variance-Aware Column Weighting

Not all columns deserve equal say in a person's (or record's) identity. A PerformanceRating column with only two possible values carries far less information than a JobRole column with nine semantically distinct categories.

SIMBA computes a global weight for each column:

global_weight_j  =  semantic_variance_j  ×  (1 - uniqueness_ratio_j)

semantic_variance: how spread out are the 384-dim embeddings for this column across all rows? A column whose values are semantically diverse (many different meanings) has high variance.
uniqueness_ratio: n_unique_values / n_rows. Penalises ID-like columns (employee number, row index) that are unique per row and carry zero clustering signal.

Actual weights learned on IBM HR (1470 employees, 29 columns):

Column	Weight	Reason
JobRole	12.7%	9 semantically diverse roles
EducationField	8.6%	6 distinct fields with real distance structure
Department	7.7%	3 meaningful departments
Age	7.1%	wide numeric range with real semantics
WorkLifeBalance	0.7%	only 4 ordinal values, low variance
PerformanceRating	0.5%	only 2 unique values
Attrition	0.4%	84% No — nearly constant, useless for clustering

This is not manual feature selection. It is automatic, driven by the semantic geometry of the data itself.

Beyond global weights, each row gets per-column local weights:

local_weight(row_i, col_j)  =  L2 distance of row_i's embedding from the column mean

A cell that is unusual compared to the rest of its column pulls far from the column mean — it is a more distinctive cell, and it should dominate that row's identity. A cell that is typical sits close to the column mean — it is background noise for this row.

The combined weight is:

final_weight(i, j)  =  global_weight_j  ×  local_weight(row_i, col_j)

Normalised per row to sum to 1. Each row ends up with a personalised weighting that emphasises its own unusual features.

3 — Density-Based Clustering in Embedding Space

After computing one 384-dim weighted row vector per record:

row_vector_i  =  Σ_j  final_weight(i,j)  ×  cell_embedding(i,j)

We do not apply K-Means. K-Means assumes spherical clusters and requires the number of clusters k to be specified in advance. Real-world semantic groups are not spherical and their number is unknown.

Instead:

UMAP compresses 384D → 10D (preserving local neighbourhood structure, using cosine metric)
HDBSCAN finds natural dense groupings without requiring k — the number of clusters is discovered from the data geometry
Points that belong to no cluster are labelled noise (−1) rather than forced into the wrong group

Full Pipeline

Raw DataFrame  (N rows × M columns, any mix of types)
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 0: Drop degenerate columns                        │
│  Remove constants, near-unique IDs, zero-variance cols  │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 1: Cell-level embedding                           │
│  "col_name: value"  →  MiniLM-L6-v2  →  384-dim vector │
│  Shape: (N, M, 384)                                     │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 2: Global column weights                          │
│  weight_j = semantic_variance_j × (1 − uniqueness_j)   │
│  Shape: (M,)  — one scalar per column                   │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 3: Local row weights                              │
│  local(i,j) = L2 dist of cell (i,j) from column mean   │
│  combined = global_j × local(i,j),  normalised per row  │
│  Shape: (N, M)                                          │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 4: Weighted pooling                               │
│  row_vec_i = Σ_j weight(i,j) × embedding(i,j)          │
│  Shape: (N, 384)  — one vector per record               │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 5: UMAP compression                               │
│  384D → 10D  (cosine, for clustering)                   │
│  384D →  2D  (cosine, for visualisation only)           │
└─────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────┐
│  Step 6: HDBSCAN clustering                             │
│  Finds k automatically  ·  labels noise as −1           │
│  Output: cluster labels for every row                   │
└─────────────────────────────────────────────────────────┘

Results

IBM HR Employee Attrition (1470 employees, 29 columns after cleaning)

Method	Clusters	Noise %	Silhouette ↑	Davies-Bouldin ↓	Calinski-Harabasz ↑
SIMBA — HDBSCAN (ours)	15	3.2%	0.865	0.207	14325
K-Means on SIMBA embeddings	8	0.0%	0.740	0.490	2681
K-Means + full encoding	2	0.0%	0.121	2.687	186
K-Means numeric-only	2	0.0%	0.139	2.466	219
Gower distance + K-Medoids	4	0.0%	0.156	1.873	241
Agglomerative (Ward)	4	0.0%	0.148	1.912	228
HDBSCAN on one-hot	noise only	—	—	—	—

SIMBA achieves 6× higher Silhouette and 65× higher Calinski-Harabasz than the best traditional baseline. Crucially, it also discovers 15 distinct groups where traditional methods collapse to 2 or 4.

Semantic validation — Attrition rates by cluster (attrition was never used as a clustering input):

Cluster	Attrition Rate	vs Overall Average (16%)
Cluster 3	36.3%	2.3× above — HIGH RISK
Cluster 11	26.8%	1.7× above
Cluster 14	5.3%	3.0× below — VERY STABLE
Cluster 4	4.9%	3.3× below

A 7× spread in attrition rates across clusters, despite attrition never appearing in the input, proves the method found genuinely meaningful groups. This is not a metric artefact — it is real structure.

UCI Car Evaluation (1728 cars, 6 purely ordinal columns)

This is the hardest test case for any encoding-based method. All 6 columns are ordinal categories — buying ∈ {low, med, high, vhigh}, safety ∈ {low, med, high}, etc. One-hot encoding makes every pair of adjacent levels equidistant. SIMBA inherits ordinal geometry from the pretrained encoder with zero ordinal-specific engineering.

Method	Clusters	Silhouette ↑	Adjusted Rand vs Class ↑
SIMBA — HDBSCAN (ours)	38	0.829	0.412
K-Means on SIMBA embeddings	4	0.701	0.238
K-Modes (native ordinal)	4	0.143	0.118
Agglomerative on one-hot	4	0.211	0.151
HDBSCAN on one-hot	3	0.198	0.130
Gower distance	4	0.167	0.129

SIMBA finds 38 fine-grained ordinal combinations; K-Modes (the method designed for ordinal data) finds 4 coarse groups with 6× lower Silhouette.

UCI Adult Income (5000 rows sampled, 12 columns)

Method	Clusters	Noise %	Silhouette ↑
SIMBA — HDBSCAN (ours)	54	13.3%	0.692
K-Means on SIMBA embeddings	8	0.0%	0.611
K-Means + full encoding	8	0.0%	0.142
Gower + K-Medoids	6	0.0%	0.181
HDBSCAN on one-hot	noise only	—	—

Income rate spread across discovered clusters (income was never used as input): cluster 46 has 92.2% >50K income, clusters 20 and 37 have 0.0% — a spread that no traditional method replicates.

Semantic Silhouette — Proof in the Encoder's Own Space

Beyond standard clustering metrics, which are computed in UMAP-compressed Euclidean space, we validate in the encoder's own 384-dimensional space using cosine similarity. This is metric-gaming-proof: the clustering is evaluated in a space it was never optimised for.

For each cluster c, we compute:

gap(c)  =  mean_cosine_sim(within c)  −  mean_cosine_sim(c, nearest other cluster)

A positive gap means: members of this cluster are more similar to each other than to the nearest outside cluster — in the raw embedding space, before any dimensionality reduction.

Dataset	Clusters	Mean intra-sim	Mean nearest-inter	Mean gap	Positive gaps
IBM HR	15	0.977	0.961	0.016	15/15 (100%)
Car Evaluation	38	0.986	0.973	0.013	38/38 (100%)
Adult Income	54	0.938	0.895	0.043	52/54 (96%)

All 15 IBM HR clusters, all 38 Car Evaluation clusters, and 52 of 54 Adult Income clusters show strictly positive gaps. The two negative gaps in Adult Income correspond to heterogeneous boundary clusters flagged as mixed by the clustering itself (low density, high noise rate).

Put plainly: records that are described similarly have been grouped together — confirmed in the space where descriptions live.

Repository Structure

scc/
├── notebooks/
│   ├── ibm_hr_comparison.ipynb       # IBM HR — full pipeline + 14 baselines
│   ├── adult_income_comparison.ipynb # Adult Income — stress test at scale
│   └── car_evaluation.ipynb          # Car Evaluation — purely ordinal challenge
├── semantic_silhouette_ibm.py        # Semantic silhouette analysis (IBM HR)
├── semantic_silhouette_car.py        # Semantic silhouette analysis (Car Eval)
├── semantic_silhouette_adult.py      # Semantic silhouette analysis (Adult Income)
├── arxiv_paper/
│   └── scc_paper.tex                 # Full LaTeX paper
├── requirements.txt
└── README.md

Installation

git clone https://github.com/<your-username>/scc.git
cd scc
pip install -r requirements.txt

requirements.txt:

sentence-transformers>=2.7
umap-learn>=0.5
hdbscan>=0.8
pandas>=2.0
numpy>=1.26
matplotlib>=3.8
scikit-learn>=1.4
tqdm>=4.66
jupyter
kmodes
gower

Python 3.10+ recommended. All embedding computation is local and offline — no API keys required. The model (all-MiniLM-L6-v2, 22M parameters) downloads automatically from Hugging Face on first run.

Datasets

IBM HR Employee Attrition

Download from Kaggle: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

Place archive.zip (or the extracted WA_Fn-UseC_-HR-Employee-Attrition.csv) in the scc/ root directory (one level above notebooks/).

UCI Adult Income

Downloaded automatically from OpenML on first notebook run — no manual step required. Requires internet on the first run only.

UCI Car Evaluation

Download from UCI: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Or via command line:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data

Place car_evaluation.zip (or car.data) in the scc/ root directory.

Running the Notebooks

jupyter notebook notebooks/ibm_hr_comparison.ipynb

First run: embeddings are computed and cached to embeddings_ibm.npy (~90 seconds on CPU for IBM HR, ~30s for Car Eval, ~2 minutes for Adult Income). Every subsequent run loads from cache — instant.

Each notebook is self-contained. Run all cells top to bottom.

Runtime and Embedding Cost

Dataset	Rows	Columns	Embedding time (CPU)	Cached runs
IBM HR	1470	29	~90 seconds	Instant
Car Evaluation	1728	6	~30 seconds	Instant
Adult Income	5000	12	~2 minutes	Instant

The embedding step is the only expensive step. Everything else (UMAP, HDBSCAN, baselines, metrics) runs in seconds.

When to Use SIMBA

Good fit:

HR data, CRM data, customer profiles with mixed types
Product catalogues (name, category, attributes, price tier)
Survey responses (Likert scales + free text + demographics)
Any dataset where you would describe record similarity in words
When you don't know how many clusters to expect

Poor fit:

Pure numeric sensor data or scientific measurements
Datasets with >100k rows (embedding cost becomes significant without GPU)
Tasks where exact numeric precision matters more than conceptual proximity

Simple test: Can you describe what makes two records similar using natural language? → SIMBA. Would you describe it as closeness on a numeric scale? → K-Means.

How SIMBA Relates to Existing Work

SIMBA uses existing components (sentence-transformers, UMAP, HDBSCAN) but the combination and framing are new:

Row serialisation methods (TAPAS, TabPFN) embed entire rows as one token sequence — SIMBA embeds at cell level to preserve per-column geometric identity
Gower distance handles mixed types but stays in feature-encoded space — no semantic understanding
K-Prototypes / K-Modes are designed for categorical data but treat all categories as equally distant
Deep clustering (DEC, DCEC) learns embeddings from the clustering objective itself — not transferable to new datasets without retraining
SIMBA uses a pretrained general encoder as-is — zero training, zero labelled data, zero domain-specific engineering

Citation

If you use SIMBA in your work, please cite:

@misc{scc2024,
  title  = {Semantic Similarity-Based Aggregator: A General-Purpose Framework
             for Mixed-Type Tabular Data},
  year   = {2024},
  note   = {Preprint available at Zenodo}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Similarity-Based Aggregator (SIMBA)

The Core Idea — We Are Flipping the Statistics

Why This Matters for Mixed-Type Data

The Three Core Ideas

1 — Cell-Level Embedding (not row serialisation)

2 — Variance-Aware Column Weighting

3 — Density-Based Clustering in Embedding Space

Full Pipeline

Results

IBM HR Employee Attrition (1470 employees, 29 columns after cleaning)

UCI Car Evaluation (1728 cars, 6 purely ordinal columns)

UCI Adult Income (5000 rows sampled, 12 columns)

Semantic Silhouette — Proof in the Encoder's Own Space

Repository Structure

Installation

Datasets

IBM HR Employee Attrition

UCI Adult Income

UCI Car Evaluation

Running the Notebooks

Runtime and Embedding Cost

When to Use SIMBA

How SIMBA Relates to Existing Work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Similarity-Based Aggregator (SIMBA)

The Core Idea — We Are Flipping the Statistics

Why This Matters for Mixed-Type Data

The Three Core Ideas

1 — Cell-Level Embedding (not row serialisation)

2 — Variance-Aware Column Weighting

3 — Density-Based Clustering in Embedding Space

Full Pipeline

Results

IBM HR Employee Attrition (1470 employees, 29 columns after cleaning)

UCI Car Evaluation (1728 cars, 6 purely ordinal columns)

UCI Adult Income (5000 rows sampled, 12 columns)

Semantic Silhouette — Proof in the Encoder's Own Space

Repository Structure

Installation

Datasets

IBM HR Employee Attrition

UCI Adult Income

UCI Car Evaluation

Running the Notebooks

Runtime and Embedding Cost

When to Use SIMBA

How SIMBA Relates to Existing Work

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages