Skip to main content

NBA 2015-16!!!

NBA is back this Tuesday! The dashboard and weekly predictions are now live*, once again. These will each be updated daily, with game winner predictions, hypothetical who-would-beat-whom daily matchup predictions, and more. For a discussion on how we make our predictions, see our first post on this topic. Note that our approach does not make use of any bookie predictions (unlike many other sites), and so provide an independent look on the game.

This season, we hope to crack 70% accuracy!

* Note that we have left up last season’s completed games results, for review purposes. Once every team has played one game, we’ll switch it over to the current season’s results.

Read More

svm_3d_cropped

Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Andrew Ng

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the support vectors in support vector machines?”

(more…)

Read More

A review of parameter regularization and Bayesian regression

Here, we review parameter regularization, which is a method for improving regression models through the penalization of non-zero parameter estimates. Why is this effective? Biasing parameters towards zero will (of course!) unfavorably bias a model, but it will also reduce its variance. At times the latter effect can win out, resulting in a net reduction in generalization error. We also review Bayesian regressions — in effect, these generalize the regularization approach, biasing model parameters to any specified prior estimates, not necessarily zero.

This is the second of a series of posts expounding on topics discussed in the text, “An Introduction to Statistical Learning”. Here, we cover material from its Chapters 2 and 6. See prior post here.

(more…)

Read More

Xiao_Liwu_im_San_Diego_Zoo_-_Foto_2

Getting started with Pandas

We have made use of Python’s Pandas package in a variety of posts on the site. These have showcased some of Pandas’ abilities including the following:

  • DataFrames for data manipulation with built in indexing
  • Handling of missing data
  • Data alignment
  • Melting/stacking and Pivoting/unstacking data sets
  • Groupby feature allowing split -> apply -> combine operations on data sets
  • Data merging and joining

Pandas is also a high performance library, with much of its code written in Cython or C. Unfortunately, Pandas can have a bit of a steep learning curve — In this post, I’ll cover some introductory tips and tricks to help one get started with this excellent package.

Notes:

  • This post was partially inspired by Tom Augspurger’s Pandas tutorial, which has a youtube video that can be viewed along side it. We also suggest some other excellent resource materials — where relevant — below.
  • The notebook we use below can be downloaded from our github page. Feel free to grab it and follow along.


Follow us on twitter for new submission alerts!

(more…)

Read More

Stochastic geometric series

Let $a_1, a_2, \ldots$ be an infinite set of non-negative samples taken from a distribution $P_0(a)$, and write
$$\tag{1} \label{problem}
S = 1 + a_1 + a_1 a_2 + a_1 a_2 a_3 + \ldots.
$$
Notice that if the $a_i$ were all the same, $S$ would be a regular geometric series, with value $S = \frac{1}{1-a}$. How will the introduction of $a_i$ randomness change this sum? Will $S$ necessarily converge? How is $S$ distributed? In this post, we discuss some simple techniques to answer these questions.

Note: This post covers work done in collaboration with my aged p, S. Landy.

(more…)

Read More

PMC_lettersoup

Build a web scraper for a literature search – from soup to nuts

Code, references, and examples of this project are on Github.

In this post, I’ll describe the soup to nuts process of automating a literature search in Pubmed Central using R.

It feels deeply satisfying to sit back and let the code do the dirty work.

Is it as satisfying as a bowl of red-braised beef noodle soup with melt-in-your-mouth tendons from Taipei’s Yong Kang Restaurant (featured image)?

If you have to do a lit search like this more than once, then I have to say the answer is yes — unequivocally, yes.
(more…)

Read More

Leave-one-out cross-validation

This will be the first of a series of short posts relating to subject matter discussed in the text, “An Introduction to Statistical Learning”. This is an interesting read, but it often skips over statement proofs — that’s where this series of posts comes in! Here, I consider the content of Section 5.1.2: This gives a lightning-quick “short cut” method for evaluating a regression’s leave-one-out cross-validation error. The method is applicable to any least-squares linear fit.

(more…)

Read More

Alcatraz

Machine learning to predict San Francisco crime

In today’s post, we document our submission to the recent Kaggle competition aimed at predicting the category of San Francisco crimes, given only their time and location of occurrence. As a reminder, Kaggle is a site where one can compete with other data scientists on various data challenges.  We took this competition as an opportunity to explore the Naive Bayes algorithm. With the few steps discussed below, we were able to quickly move from the middle of the pack to the top 33% on the competition leader board, all the while continuing with this simple model!

(more…)

Read More

Pseudotsuga_seed_seedling

How not to sort by average rating, revisited

What is the best method for ranking items that have positive and negative reviews? Some sites, including reddit, have adopted an algorithm suggested by Evan Miller to generate their item rankings. However, this algorithm can sometimes be unfairly pessimistic about new, good items. This is especially true of items whose first few votes are negative — an issue that can be “gamed” by adversaries. In this post, we consider three alternative ranking methods that can enable high-quality items to more-easily bubble-up. The last is the simplest, but continues to give good results: One simply seeds each item’s vote count with a suitable fixed number of hidden “starter” votes.
(more…)

Read More

A review of the online course “Introduction to Big Data with Apache Spark”

This is a review of Introduction to Big Data with Apache Spark (CS100.1x), the first in a two-part series introducing the big data processing engine, Spark.

The one month-long MOOC was offered on edX for the first time in June 2015, and its sequel, Scalable Machine Learning (CS190.1x) is currently under way.

lecture screenshot
Still from a lecture in “Introduction to Big Data with Apache Spark”

(more…)

Read More