My First scikit-learn Classifier and Everything I Got Wrong

I've been working on my MSc thesis for about six months now. The topic is automated testing for mobile apps that generate sensory data, and at some point it made sense to add a classification component to categorize the data outputs.

So I picked up scikit-learn, watched a few hours of Andrew Ng's Coursera course, and built a text classifier. Trained it, evaluated it, got 94% accuracy, and felt pretty good about myself.

Then I actually tried using it on real app output and it fell apart immediately.

Here's what I got wrong.

The dataset was too clean

I collected training samples myself, from a controlled setup. Consistent formatting, no malformed inputs, predictable timing. The classifier learned those patterns well.

Real app output doesn't look like that. Sensor readings come in at irregular intervals. The formatting isn't always clean. Some records are partial. My classifier had never seen any of that during training.

94% on my neat test set. Maybe 60% on what the app actually produces.

I chased the wrong number

Accuracy went up when I tuned things, so I kept tuning. The problem is my classes were imbalanced. A classifier that always predicted the most common class would have hit around 80% without learning anything useful at all.

I should have been looking at precision and recall. I learned that the hard way.

I had no idea what the model was actually doing

I treated it like a black box. Feed data in, number comes out, number goes up, done. I never looked at which samples were being misclassified or why.

That was a mistake. The model had learned some patterns in my training data that had nothing to do with what I actually cared about.

The whole pipeline was manual

Every step from raw data to trained model involved me doing something by hand in an IPython notebook. No way to reproduce it reliably if I changed something early in the process.

Fine for a one-time experiment. Not fine for a thesis I have to defend.

What I'm doing differently now

Starting with the data. Making sure it looks like real data before touching a model. Picking metrics that actually reflect what I'm trying to measure. Writing down every step so I can reproduce it.

The 94% was a real number. It just wasn't measuring the right thing.

scikit-learn is a good library. The problem was entirely me.

With gusto, Fatih.