SmartKNN vs Weighted_KNN & KNN: A Practical Benchmark on Real Regression Datasets
K-Nearest Neighbours is still widely used in industry because it’s simple, interpretable, and surprisingly strong on tabular data.
But vanilla KNN falls apart when the real world kicks in — noise, irrelevant features, skewed scales, and mixed data types.
**
This benchmark compares three variants across 31 real OpenML regression datasets:**
-
KNN
-
Weighted_KNN
-
SmartKNN
All experiments were performed on ~31 real regression datasets across two batches.
1. Benchmark Summary
| Metric | Weighted_KNN | SmartKNN |
|---|---|---|
| Avg MSE (Batch 1) | 4.146e7 | 4.181e7 |
| Avg MSE (Batch 2) | 2.354e6 | 1.423e6 |
| Typical R² | 0.10 – 0.50 | 0.50 – 0.88 |
| RMSE trend | Higher | Lower |
Interpretation:
Batch 1 was influenced by several outlier datasets with huge variance, which inflated MSE for both models.
Batch 2 shows the true behaviour: SmartKNN produces more accurate, stable, and variance-aware predictions.
SmartKNN consistently achieves higher R² and noticeably lower RMSE on realistic tabular tasks.
2. SmartKNN vs Weighted_KNN — Where Each One Shines
SmartKNN Strong Wins
Datasets [openML]: 622, 634, 637, 638, 645, 654, 656, 657, 659, 695, 712
SmartKNN performs exceptionally well on:
-
Medium-to-large tabular data
-
Mixed numeric/categorical datasets
-
High-variance or noisy features
-
Datasets with uneven feature importance or irrelevant columns
Weighted_KNN tends to break when noise or skewed scaling appears. SmartKNN stays stable due to feature weighting + filtering.
Weighted_KNN Wins
Datasets [openML] : 675, 683, 687, 690
Weighted_KNN has a slight advantage when:
SmartKNN introduces small overhead on trivial datasets, which can slightly reduce performance.
SmartKNN vs Vanilla KNN
| Metric | SmartKNN | KNN |
|---|---|---|
| Avg MSE (Batch 1) | 1.304e6 | 1.613e6 |
| Avg MSE (Batch 2) | 4.622e7 | 4.649e7 |
| R² trend | Higher in complex datasets | Higher in trivial datasets |
-
SmartKNN Wins: 24
-
KNN wins: 7
SmartKNN is a substantially better baseline for regression tasks in modern ML pipelines.
Why SmartKNN Works Better
| SmartKNN Component | Effect |
|---|---|
| Weighted neighbour influence | Handles noisy & imbalanced features |
| Adaptive feature scaling | Reduces collapse on high variance datasets |
| Noise-aware preprocessing | Boosts resilience to outliers |
| Feature filtering | Removes weak or non-informative dimensions to improve signal clarity |
| Weighted Euclidean distance | Improves neighbour ranking through data-driven feature importance |
**
SmartKNN keeps the interpretability of KNN but fixes its long-standing weaknesses**
Code Example
from smart_knn import SmartKNN
model = SmartKNN(k=8, weight_threshold=0.009)
model.fit(X_train, y_train)
preds = model.predict(X_test)
Install
pip install smart-knn
Conclusion
SmartKNN isn’t trying to replace neural networks or ensemble methods.
It’s designed to be a modern, robust upgrade to classical KNN:
-
Handles noisy features
-
Adapts to complex datasets
-
Improves R² and RMSE significantly in most real-world scenarios
-
More stable on almost every real-world regression task
A modern KNN baseline for regression tasks.
Useful Links
GitHub Repo — https://github.com/thatipamula-jashwanth/smart-knn
Kaggle Notebook
DOI – https://doi.org/10.5281/zenodo.17713746
Huggingface – https://huggingface.co/JashuXo/smart-knn
Feedback & Collaboration
If you’d like to benchmark SmartKNN on your own dataset or contribute to upcoming features (classification mode, automated hyperparameter search), you’re welcome to connect and collaborate.
