Amazon’s Machine Learning is so easy, it should be a crime

I’m a big fan of Amazon Web Services.  So scalable. So intuitive.  So click-able.  Maybe I just feel like a kid again, playing with my digital LEGOs.  Naturally, I had to try out Amazon’s latest toy–Machine Learning.

Admittedly, Amazon’s embracing of cloud-based ML is perhaps a bit tardy (announced in April 2015).  After all, Google appears to have launched their Prediction API in 2010 and Microsoft Azure ML pre-dates Amazon’s offering by about a year.  Furthermore, at the time of writing, Prediction API and Azure ML are arguably more complete offerings.  For example, Azure ML features modules for neural nets, tree-based methods, SVMs, etc.  Here’s the most complete list I could find.  But I digress.

The inspiration for this post comes from Guy Ernest over at AWS.  (If you don’t already, do yourself a favor and follow the AWS Big Data Blog.)

The Kaggle competition

While perusing the current Kaggle competitions, I happened upon the San Francisco Crime Prediction challenge.  The aim of the competition is to predict which type of crime will occur at a given place and time.  [ Imagine bad, obligatory Minority Report joke here. ] Specifically, we’re given date, time, day of week, police district, address, latitude, and longitude.  From these inputs, we attempt to predict one of several classes (burglary, prostitution, forgery/counterfeiting, …)

Pre-processing

The data needs a little loving.  First of all, date and time are smooshed together in a single timestamp.  Is the specific time or date even relevant? Let’s turn those into coarser and hopefully more significant units of time: month, year, and hour.  After implementing this minimal pre-processing in Python, I was ready to upload.  As an aside, I’d recommend following Guy Ernest’s advice to shuffle the data and upload to S3 with the AWS CLI.

year month hour DayOfWeek PdDistrict Address X Y Category
2014 9 3 Thursday BAYVIEW 500 Block of 16TH ST -122.3899696 37.76690677 BURGLARY
2003 3 15 Thursday BAYVIEW 400 Block of HAMILTON ST -122.409338 37.72545523 MISSING PERSON
2011 4 3 Saturday NORTHERN 3000 Block of BUCHANAN ST -122.4322319 37.79799064 OTHER OFFENSES
2004 9 17 Sunday SOUTHERN MABINI ST / BONIFACIO ST -122.3998374 37.78221275 LARCENY/THEFT

Running Amazon ML

When you point to a new data source in Amazon ML, you’re required to specify a schema and select a target variable (which later determines the machine learning algorithm).  They’ve clearly thought this process through–evidenced by helpful graphs, such as the target variable distribution in the training set.

targetVar_distro

Model results

With the data source defined, we create a new ML model.  The algorithm is automatically determined by the target variable (multi-class classification).  We can customize a few options–the test-training split, regularization type, and do some advanced feature-engineering using a “recipe”.  I wanted to see how well this performed out-of-the-box, so I chose default settings.

Amazon ML takes a few minutes to run on the 90 MB, 878,000-row training data; insofar as I’m able to divine from the log files, it’s running atop Amazon EMR (i.e. Hadoop).  Still waiting.  Wonder if I still have one of those Lagunitas Little Sumpin’ Extras in my fridge?
LSE-12oz-v02OK, it’s done. And it really didn’t take that long.  For multi-class classification problems, F1 is the metric reported.  And our report card doesn’t look very good.

F1_score

With an F1 score of 0.17 (the theoretical maximum is 1), we’re not impressing anyone.  Amazon reassures me that this is better than “baseline” (likely the F1 score achieved by random guessing).  It was a nice gesture, Amazon ML.  But I have beer, and you’re trapped inside a box. Who should be consoling whom?  Let’s see what went wrong; we can pull up the confusion matrix with a click:

confusion_matrix

The problem is apparent; we’re predicting too many larceny/theft crimes.  Far more (51%) than were present in the training data (20%).  But there’s a glimmer of hope: this Kaggle competition doesn’t use F1 as the evaluation metric.  It uses a log-loss function.  However, the two are likely to be strongly correlated.

Submission

To generate predictions on the Kaggle-provided test.csv, we need to give it the same pre-processing we gave the training data, upload it to S3, designate it as a datasource in Amazon ML, then run a Batch Prediction on it.  Here’s a snippet of the output:

tag BURGLARY PROSTITUTION SECONDARY CODES NON-CRIMINAL DRUG/NARCOTIC ROBBERY
0 5.37E-02 1.31E-04 2.02E-02 5.98E-02 1.84E-02 3.76E-02
1 7.70E-04 3.52E-04 1.95E-02 1.00E-01 2.25E-02 5.94E-02
2 8.04E-02 1.63E-04 9.50E-03 6.94E-02 9.45E-03 1.17E-02
3 2.26E-02 4.16E-05 2.09E-02 1.23E-01 1.66E-02 5.15E-02

The output is more informative than a simple one-prediction-per row.  The entries are one-against-all probabilities (see Amazon ML documentation), so to arrive at a single category prediction, we simply select the column with the maximum value (cue more Python finessing).  Finally, the predictions are in the format Kaggle requires.  Upload.  Wait.  262nd place out of 330 with a log-loss metric of 24.9 (lower is better).

Amazon ML may not be the right tool for this job; or perhaps it just requires more clever feature-engineering than I had devised.  Perhaps some precision in the latitude and longitude is being lost due to rounding.  Would I go to a client with this sort of accuracy?  Absolutely not.  But Amazon ML is about speed, convenience, and scalability.  For the SF crime problem, it’s a way to “fail quickly” and move on to approach #2.

Pyomo Meets Fantasy Football

Intro

So-called “daily fantasy leagues” are popping up on the internet en masse.  This form of gambling gets classified as a ‘game of skill’ and hence remains legal (for now) in all states but Arizona, Iowa, Louisiana, Montana and Washington according one website’s disclaimers.  The 30-second explanation of a daily fantasy contest follows:

You, the team manager, must select a team from a pool of players; your choices are constrained principally by a salary cap and the positional requirements of the roster.  The pool of available players might consist, for example, of all NFL players who have a game on Sunday, November 30.  Each player has an associated cost–a healthy Peyton Manning might cost $9,000 in a league with a $50,000 salary cap.  If we can obtain player point projections from a third-party site, an interesting albeit recreational optimization problem emerges.

Pyomo

Pyomo is a flexible modeling framework for Python which is written and maintained by Sandia National Labs.  Furthermore, it’s supported by the COIN-OR Foundation (under the now-deprecated name Coopr).  For an open-source project, it possesses an unusually good level of support and documentation.  I’ve worked with a few specialized algebraic modeling languages (MathProg in conjunction with GLPK and Xpress-Mosel) and I came to realize some inherent limitations in their static treatment of data.  Sure, there are control structures in Mosel and difficult-to-learn APIs for GLPK, but nothing I’ve encountered offers the fluid model-building paradigm of Pyomo.  Do you want to read a static data set, pull in real-time S&P 500 data, build dynamic constraints, pause to cook an omelet, then throw out all data lines that contains the word “aardvark”?  Pyomo has you covered.  Model building can be stopped and started at will because it’s all performed in-line with your other Python code.  For more information about supported solvers and additional components, such as the stochastic programming extension, see http://www.pyomo.org/about/.

The Code

Let’s conjure up a fantasy league in which we’d like to choose 1 quarterback, 2 wide receivers, 2 running backs, 2 tight ends, 1 defense, and 1 kicker.  The salary cap is $50,000 and we’re keen on maximizing the sum of projected points.

The first few rows of input data:

name position team cost proj_pts
Shayne Graham K NO 4800 9.13
New Orleans D/ST D NO 4600 7.11
Drew Brees QB NO 9100 22.07
Luke McCown QB NO 5000 0.84
Jimmy Graham TE NO 7000 13.16
Mark Ingram RB NO 6100 10.75
Marques Colston WR NO 6700 9.45

Without further ado, the Python code:

from __future__ import division
import pandas as pd
from pyomo.environ import *
from pyomo.opt import SolverFactory

## IMPORT DATA ##
fileName = "ff_data.csv"
df = pd.read_csv(fileName)

## SETTINGS ##
max_salary = 50000

## DATA PREP ##
POSITIONS = ['QB', 'RB', 'WR', 'TE', 'D', 'K']
psn_limits = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 2, 'D': 1, 'K': 1}
PLAYERS = list(set(df['name']))
proj = df.set_index(['name'])['proj_pts'].to_dict()
cost = df.set_index(['name'])['cost'].to_dict()
pos = df.set_index(['name'])['position'].to_dict()

## DEFINE MODEL ##
model = ConcreteModel()

# decision variable
model.x = Var(PLAYERS, domain=Boolean, initialize=0)

# constraint: salary cap
def constraint_cap_rule(model):
 salary = sum(model.x[p] * cost[p] for p in PLAYERS)
 return salary <= max_salary
model.constraint_cap = Constraint(rule=constraint_cap_rule)

# constraint: positional limits
def constraint_position_rule(model, psn):
 psn_count = sum(model.x[p] for p in PLAYERS if pos[p] == psn)
 return psn_count == psn_limits[psn]
model.constraint_position = Constraint(POSITIONS, rule=constraint_position_rule)

# objective function: maximize projected points
def obj_expression(model):
 return summation(model.x, proj, index=PLAYERS)
model.OBJ = Objective(rule=obj_expression, sense=maximize)

# good for debugging the model
#model.pprint()

## SOLVE ##
opt = SolverFactory('glpk')

# create model instance, solve
instance = model.create()
results = opt.solve(instance)
instance.load(results) #load results back into model framework

## REPORT ##
print("status=" + str(results.Solution.Status))
print("solution=" + str(results.Solution.Objective.x1.value) + "\n")
print("*******optimal roster********")
P = [p for p in PLAYERS if instance.x[p].value==1]
for p in P:
 print(p + "\t" + pos[p] + "\t cost=" + str(cost[p]) + "\t projection=" + str(proj[p]))
print("roster cost=" + str(sum(cost[p] for p in P)))

The first four lines warrant brief commentary: I’m using Python 2.7.9, and I want to make sure that integer division returns a floating point number, hence line 1. Also noteworthy is

from pyomo.opt import SolverFactory

The SolverFactory sub-module interacts directly with the solver (GLPK in this case) and returns the results directly to create a self-contained Python script.

Next, the program imports the player data using pandas in lines 6-8.

By convention, index sets are capitalized while data vectors are presented in lower-case. The dictionary psn_limits prescribes roster limits for each position.

POSITIONS = ['QB', 'RB', 'WR', 'TE', 'D', 'K']
psn_limits = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 2, 'D': 1, 'K': 1}

Next, the player names (index set), projected points (data), cost (data), and position (data) are extracted from the pandas data frame.

PLAYERS = list(set(df['name']))
proj = df.set_index(['name'])['proj_pts'].to_dict()
cost = df.set_index(['name'])['cost'].to_dict()
pos = df.set_index(['name'])['position'].to_dict()

When I first wrote this script, Pyomo preferred Python dictionaries and lists as model inputs; this may no longer be the case, but I encourage the reader to consult the extensive documentation for him or herself. The intermediate set() function in line 16 removes potential duplicate player names.

model = ConcreteModel()

Pyomo users may formulate concrete or abstract models. Abstract models are purely algebraic constructs which are later populated wholesale by static datasets in much same way as MathProg in GLPK or Xpress-Mosel. Concrete models, to me, better illustrate the power of Pyomo–namely, the ability to dynamically load data from native Python structures.

For example, we take the recently extracted set PLAYERS to index the vector of decision variables:

model.x = Var(PLAYERS, domain=Boolean, initialize=0)

Pyomo constraints can be defined by rules which are merely a function returning the desired constraint expression.

# constraint: salary cap
def constraint_cap_rule(model):
 salary = sum(model.x[p] * cost[p] for p in PLAYERS)
 return salary <= max_salary
model.constraint_cap = Constraint(rule=constraint_cap_rule)

The second constraint is really a set of constraints, one for each element in POSITIONS. Note the extra argument in the rule-function definition and the index set passed to the Constraint object.

# constraint: positional limits
def constraint_position_rule(model, psn):
 psn_count = sum(model.x[p] for p in PLAYERS if pos[p] == psn)
 return psn_count == psn_limits[psn]
model.constraint_position = Constraint(POSITIONS, rule=constraint_position_rule)

Approaching the end, we define the objective function with syntax similar to constraint definition. The summation() function provides a useful shorthand for element-wise multiplication followed by summing (i.e. a dot product).

# objective function: maximize projected points
def obj_expression(model):
 return summation(model.x, proj, index=PLAYERS)
model.OBJ = Objective(rule=obj_expression, sense=maximize)

The remainder of the code runs the optimization model and prints the results. Line 53 offers a helpful trick: the optimization results are returned in a raw form from the solver; instance.load() pulls the raw data back into the model framework where we can more easily query the results by player name.

Here’s the output:

status=optimal
solution=100.42

*******optimal roster********
Cleveland D/ST D  cost=4800  projection=10.18
Jermaine Kearse WR  cost=4700  projection=7.33
Heath Miller TE  cost=5400  projection=8.72
Lamar Miller RB  cost=7300  projection=16.15
Billy Cundiff K  cost=4600  projection=8.67
Jeremy Hill RB  cost=5200  projection=10.37
Brian Hoyer QB  cost=6200  projection=17.41
Miles Austin WR  cost=4800  projection=8.43
Jimmy Graham TE  cost=7000  projection=13.16
roster cost=50000

 

Final Remarks

Some leagues have a “FLEX” position which any one of a running back, wide receiver, or tight end may occupy. The implementation of one or more FLEX positions is left as an exercise. (Hint: constrain the total number of WR + RB + TE.)

Disclaimer: it is the author’s wish not to encourage gambling, but rather the pursuit of statistical education.

The 48th Mersenne Prime

The 48th Mersenne prime has been discovered through the Great Internet Mersenne Prime Search (GIMPS)!  The full story can be found here:

http://www.mersenne.org/various/57885161.htm

Also, here is some helpful background reading for the mathematically inclined:

http://mathworld.wolfram.com/MersennePrime.html

Over lunch yesterday, a coworker remarked, “I wonder if the digits are uniformly distributed?”  We had no reason to believe they wouldn’t be, but I decided to investigate.

First, I downloaded and cleaned the data from here.  The 17,425,170 digits of the 48th Mersenne prime take up about 35MB on my hard drive in comma-separated form.  Next, I imported the data in R and used a humble hist() command to generate a histogram.  The graph itself isn’t terribly informative, but below are the frequencies which resulted.  The whole process consumed about 6 real-time seconds on an AMD Phenom II 3 GHz quad-core.

0       1       2       3       4       5       6       7       8       9
1739652 1743497 1739844 1745602 1743528 1739641 1742677 1743436 1743298 1743995

At a glance, the digits do appear to be uniformly distributed.  After all, we’d expect to see 1,742,517 of each digit.  Satisfied?  Hardly.  There are all sorts of statistical tests available for assessing a frequency distribution’s similarity to a known distribution.  For this task, I chose the χ² goodness-of-fit test.  The null hypothesis supposes that observed distribution of digits is consistent with a uniform distribution.

\chi^2 = \sum_{i=1}^{10} \frac{(O_i - E_i)^2}{E_i}

Where O_i = observed number of digit i.  E_i = expected number of digit i.

The result is χ² = 22.2603 and a p-value = 0.008, illustrated below.

Image

In conclusion, the goodness-of-fit test causes us to reject the null hypothesis for “typical” values of α (0.01, 0.05).  Against intuition, the test rejects the notion that the digits are uniformly distributed.

Here’s the R code.  Feel free to modify and explore!

-Nate

## Author: Nate De Jong
## Date: 2/12/13
## Description: a new Mersenne prime was discovered in Feb. 2013.
## a coworker wondered aloud at lunch "I wonder if the digits are 
   uniformly distributed?"
## this is the result

# begin timing
ptm = proc.time()

digits = scan("M57885161_digits.csv", sep=",")

#plot histogram
bins=c(-0.5, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5)
digit_hist = hist(digits, breaks=bins)

#display freq. counts
obs = digit_hist$counts

# chi-square test for uniformity
k = 10
N = sum(obs)
expected = rep(N/k, 10)
test_stat = sum((obs - expected)^2 / expected)
p_val = pchisq(test_stat, df=k-1, lower.tail=FALSE)

# end timing
proc.time() - ptm