I’m a big fan of Amazon Web Services. So scalable. So intuitive. So click-able. Maybe I just feel like a kid again, playing with my digital LEGOs. Naturally, I had to try out Amazon’s latest toy–Machine Learning.
Admittedly, Amazon’s embracing of cloud-based ML is perhaps a bit tardy (announced in April 2015). After all, Google appears to have launched their Prediction API in 2010 and Microsoft Azure ML pre-dates Amazon’s offering by about a year. Furthermore, at the time of writing, Prediction API and Azure ML are arguably more complete offerings. For example, Azure ML features modules for neural nets, tree-based methods, SVMs, etc. Here’s the most complete list I could find. But I digress.
The inspiration for this post comes from Guy Ernest over at AWS. (If you don’t already, do yourself a favor and follow the AWS Big Data Blog.)
The Kaggle competition
While perusing the current Kaggle competitions, I happened upon the San Francisco Crime Prediction challenge. The aim of the competition is to predict which type of crime will occur at a given place and time. [ Imagine bad, obligatory Minority Report joke here. ] Specifically, we’re given date, time, day of week, police district, address, latitude, and longitude. From these inputs, we attempt to predict one of several classes (burglary, prostitution, forgery/counterfeiting, …)
Pre-processing
The data needs a little loving. First of all, date and time are smooshed together in a single timestamp. Is the specific time or date even relevant? Let’s turn those into coarser and hopefully more significant units of time: month, year, and hour. After implementing this minimal pre-processing in Python, I was ready to upload. As an aside, I’d recommend following Guy Ernest’s advice to shuffle the data and upload to S3 with the AWS CLI.
year | month | hour | DayOfWeek | PdDistrict | Address | X | Y | Category |
2014 | 9 | 3 | Thursday | BAYVIEW | 500 Block of 16TH ST | -122.3899696 | 37.76690677 | BURGLARY |
2003 | 3 | 15 | Thursday | BAYVIEW | 400 Block of HAMILTON ST | -122.409338 | 37.72545523 | MISSING PERSON |
2011 | 4 | 3 | Saturday | NORTHERN | 3000 Block of BUCHANAN ST | -122.4322319 | 37.79799064 | OTHER OFFENSES |
2004 | 9 | 17 | Sunday | SOUTHERN | MABINI ST / BONIFACIO ST | -122.3998374 | 37.78221275 | LARCENY/THEFT |
Running Amazon ML
When you point to a new data source in Amazon ML, you’re required to specify a schema and select a target variable (which later determines the machine learning algorithm). They’ve clearly thought this process through–evidenced by helpful graphs, such as the target variable distribution in the training set.
Model results
With the data source defined, we create a new ML model. The algorithm is automatically determined by the target variable (multi-class classification). We can customize a few options–the test-training split, regularization type, and do some advanced feature-engineering using a “recipe”. I wanted to see how well this performed out-of-the-box, so I chose default settings.
Amazon ML takes a few minutes to run on the 90 MB, 878,000-row training data; insofar as I’m able to divine from the log files, it’s running atop Amazon EMR (i.e. Hadoop). Still waiting. Wonder if I still have one of those Lagunitas Little Sumpin’ Extras in my fridge?
OK, it’s done. And it really didn’t take that long. For multi-class classification problems, F1 is the metric reported. And our report card doesn’t look very good.
With an F1 score of 0.17 (the theoretical maximum is 1), we’re not impressing anyone. Amazon reassures me that this is better than “baseline” (likely the F1 score achieved by random guessing). It was a nice gesture, Amazon ML. But I have beer, and you’re trapped inside a box. Who should be consoling whom? Let’s see what went wrong; we can pull up the confusion matrix with a click:
The problem is apparent; we’re predicting too many larceny/theft crimes. Far more (51%) than were present in the training data (20%). But there’s a glimmer of hope: this Kaggle competition doesn’t use F1 as the evaluation metric. It uses a log-loss function. However, the two are likely to be strongly correlated.
Submission
To generate predictions on the Kaggle-provided test.csv, we need to give it the same pre-processing we gave the training data, upload it to S3, designate it as a datasource in Amazon ML, then run a Batch Prediction on it. Here’s a snippet of the output:
tag | BURGLARY | PROSTITUTION | SECONDARY CODES | NON-CRIMINAL | DRUG/NARCOTIC | ROBBERY |
0 | 5.37E-02 | 1.31E-04 | 2.02E-02 | 5.98E-02 | 1.84E-02 | 3.76E-02 |
1 | 7.70E-04 | 3.52E-04 | 1.95E-02 | 1.00E-01 | 2.25E-02 | 5.94E-02 |
2 | 8.04E-02 | 1.63E-04 | 9.50E-03 | 6.94E-02 | 9.45E-03 | 1.17E-02 |
3 | 2.26E-02 | 4.16E-05 | 2.09E-02 | 1.23E-01 | 1.66E-02 | 5.15E-02 |
The output is more informative than a simple one-prediction-per row. The entries are one-against-all probabilities (see Amazon ML documentation), so to arrive at a single category prediction, we simply select the column with the maximum value (cue more Python finessing). Finally, the predictions are in the format Kaggle requires. Upload. Wait. 262nd place out of 330 with a log-loss metric of 24.9 (lower is better).
Amazon ML may not be the right tool for this job; or perhaps it just requires more clever feature-engineering than I had devised. Perhaps some precision in the latitude and longitude is being lost due to rounding. Would I go to a client with this sort of accuracy? Absolutely not. But Amazon ML is about speed, convenience, and scalability. For the SF crime problem, it’s a way to “fail quickly” and move on to approach #2.