Machine learning, spicy comments, and one stubborn class imbalance problem

Teaching machines to read the room… or at least the comment section.

This project tackles multi-class comment category prediction using NLP, metadata, and a stacking ensemble that behaved like a well-coached superhero team. The dataset was huge, the classes were imbalanced, and class 3 basically played hide-and-seek the entire time.

See the winning model Explore the pipeline

Plot twist: the final stacking ensemble reached a validation macro F1 of 0.82, which was the best score in the project.

Why this project exists

Because the internet does not come with a pause button.

Online platforms collect mountains of comments every day, and expecting humans to manually categorize all of them is a recipe for burnout and caffeine dependency. This project builds an automated classifier that predicts one of four labels for each comment using both text and structured metadata.

1The mission: classify online comments into labels 0 to 3.
2The challenge: class imbalance made minority classes much harder to learn.
3The strategy: feature engineering, classical ML, and a stacking ensemble finale.

Pipeline breakdown

From messy comments to machine-learned confidence.

The pipeline mixes text preprocessing, metadata engineering, feature selection, and multiple classifiers. Nothing sci-fi, everything practical, and somehow that made it work really well.

Step 01

EDA with receipts

We checked duplication, missing values, skewed engagement counts, temporal patterns, and feature-label relationships. `if_2` quietly became the mysterious overachiever of the structured features.

Step 02

Feature engineering

We extracted time buckets, comment length stats, punctuation counts, lexical diversity, reaction features, and emoticon density. Yes, even emoticons got audited.

Step 03

Text to vectors

Comments were cleaned, transformed with TF-IDF, and trimmed from 149,597 terms down to 30,000 using chi-square selection. The feature space stayed rich without becoming a computational horror movie.

What went into the model

Structured features plus text features = better instincts

TText branch: cleaned comment text, TF-IDF vectorization, and chi-square feature selection.
SStructured branch: scaling for numeric features, one-hot encoding for categorical data, and pass-through for binary flags.
FFinal feature size: 30,057 dimensions per sample after concatenation.

Models tried

The classifier audition was very competitive.

LLogistic Regression came in strong and balanced.
SSGD and Passive Aggressive added speed and diversity.
VLinearSVC handled sparse text features like a pro.
GLightGBM and XGBoost tested the tree-based route.
★Stacking Ensemble took the crown with the best validation score.

Performance board

Who cooked, who almost cooked, and who needed seasoning.

The final comparison shows a tight race near the top, with the stacking ensemble edging out the rest. LightGBM was close enough to keep the competition interesting.

Model	Train F1	Validation F1	Quick vibe check
Logistic Regression	0.886	0.805	Reliable, balanced, no drama.
SGD Classifier	0.824	0.792	Fast and useful in the ensemble squad.
Passive Aggressive	0.872	0.792	A little intense, but respectable.
LinearSVC	0.904	0.806	Sparse-text specialist energy.
LightGBM	0.917	0.814	Almost stole the show solo.
XGBoost	0.844	0.787	Not bad, just not the main character.
Stacking Ensemble	0.898	0.820	Avengers assemble, but for classifiers.

Validation F1 bars

Macro F1 at a glance

The ensemble wins by a small but meaningful margin, which is exactly the kind of tiny edge that matters in competitive ML work.

Logistic Regression

0.805

SGD Classifier

0.792

Passive Aggressive

0.792

LinearSVC

0.806

LightGBM

0.814

XGBoost

0.787

Stacking Ensemble

0.820

What made this hard

Class imbalance was the final boss.

The biggest pain point was minority-class performance, especially class 3. A lot of those examples got confused with class 2, suggesting overlapping language patterns that were not fully captured by surface-level text features.

!Problem: rare classes had fewer examples and weaker separability.
?Observation: class weighting helped, but it did not fully solve the issue.
+Outcome: ensembling gave the strongest overall balance.

Future directions

How this project could get even more dangerous.

BTransformers: try BERT, RoBERTa, or DistilBERT for deeper contextual understanding.
SOversampling: test SMOTE or ADASYN for minority classes.
IInterpretability: investigate `if_2`, the mysterious feature MVP.
CCalibration: align predicted probabilities with real-world confidence.