Skip to main content

Machine learning to predict San Francisco crime

In today’s post, we document our submission to the recent Kaggle competition aimed at predicting the category of San Francisco crimes, given only their time and location of occurrence. As a reminder, Kaggle is a site where one can compete with other data scientists on various data challenges.  We took this competition as an opportunity to explore the Naive Bayes algorithm. With the few steps discussed below, we were able to quickly move from the middle of the pack to the top 33% on the competition leader board, all the while continuing with this simple model!


Read More

MLB predictions take off!

Screen Shot 2015-04-11 at 10.51.35 AM

Announcing: EFavDB’s first major league baseball prediction project! Just as in our corresponding NBA project, we will be providing free-of-charge game winner predictions on a weekly basis. In addition, we’ve implemented a MLB dashboard (screenshot above) where you can get a quick summary of each team’s prior results. You can also use the dashboard to check out our guesses for who beat whom, were they to play today, etc.

The algorithm we’ll be applying this season is similar to that discussed here, for the NBA. We’ve set its parameters to generate reasonably conservative predictions, but ones that will also lead to interesting upset predictions when appropriate. Unlike many other sites, our predictions do not take into account the over-under values published by bookies, and so are independent of their opinions. Looking forward to a great season!

Read More


Forecasting Bike Sharing Demand

In today’s post, we document our efforts at applying a gradient boosted trees model to forecast bike sharing demand — a problem posed in a recent Kaggle competition. For those not familiar, Kaggle is a site where one can compete with other data scientists on various data challenges. Top scorers often win prize money, but the site more generally serves as a great place to grab interesting datasets to explore and play with. With the simple optimization steps discussed below, we managed to quickly move from the bottom 10% of the competition — our first-pass attempt’s score — to the top 10%: no sweat!


Read More

measles vaccine

Measles vaccination rate by USA state and relation to mean outbreak size

In this post, we provide a quick overview of the data and science of measles spread. Making use of python (code provided) we extract from a CDC data set the 2012 youth vaccination rate for each USA state — see figure below. To aid in the interpretation of this data, we also review and describe the results of a generalized “SIR” model for disease spread. (more…)

Read More

Screen Shot 2015-01-21 at 12.36.09 PM

Machine learning for facial recognition

A guest post, contributed by Damien Ramunno-Johnson (LinkedIn, bio-sketch)

Follow us on twitter for new submission alerts!


The ability to identify faces is a skill that people develop very early in life and can apply almost effortlessly. One reason for this is that our brains are very well adapted for pattern recognition. In contrast, facial recognition can be a somewhat difficult problem for computers. Today, given a full frontal image of a face, computer facial recognition software works well. However, problems can arise given large camera angles, poor lighting, or exaggerated facial expressions: Computers have a ways to go before they catch up with us in this arena. (more…)

Read More