## Try Caffe pre-installed on a VirtualBox image

A previous post showed beginners how to try out deep learning libraries by

1. using an Amazon Machine Image (AMI) pre-installed with deep learning libraries
2. setting up a Jupyter notebook server to play with said libraries

If you have VirtualBox and Vagrant, you can follow a similar procedure on your own computer. The advantage is that you can develop locally, then deploy on an expensive AWS EC2 gpu instance when your scripts are ready.
(more…)

## Start deep learning with Jupyter notebooks in the cloud

Want a quick and easy way to play around with deep learning libraries? Puny GPU got you down? Thanks to Amazon Web Services (AWS) — specifically, AWS Elastic Compute Cloud (EC2) — no data scientist need be left behind.

Jupyter/IPython notebooks are indispensable tools for learning and tinkering. This post shows how to set up a public Jupyter notebook server in EC2 and then access it remotely through your web browser, just as you would if you were using a notebook launched from your own laptop.
(more…)

## Dotfiles for peace of mind

Reinstalling software and configuring settings on a new computer is a pain. After my latest hard drive failure set the stage for yet another round of download-extract-install and configuration file twiddling, it was time to overhaul my approach. "Enough is enough!"

This post walks through

1. how to back up and automate the installation and configuration process
2. how to set up a minimal framework for data science

We’ll use a dotfiles repository on Github to illustrate both points in parallel.

## Independent component analysis

Two microphones are placed in a room where two conversations are taking place simultaneously. Given these two recordings, can one “remix” them in some prescribed way to isolate the individual conversations? Yes! In this post, we review one simple approach to solving this type of problem, Independent Component Analysis (ICA). We share an ipython document implementing ICA and link to a youtube video illustrating its application to audio de-mixing.

## Maximum-likelihood asymptotics

In this post, we review two facts about maximum-likelihood estimators: 1) They are consistent, meaning that they converge to the correct values given a large number of samples, $N$, and 2) They satisfy the Cramer-Rao lower bound for unbiased parameter estimates in this same limit — that is, they have the lowest possible variance of any unbiased estimator, in the $N\gg 1$ limit.

## Principal component analysis

We review the two essentials of principal component analysis (“PCA”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first $k$ of these — listed in descending eigenvalue order — provides the best possible $k$-dimensional approximation to the data, in the sense of captured variance.

## NBA 2015-16!!!

NBA is back this Tuesday! The dashboard and weekly predictions are now live*, once again. These will each be updated daily, with game winner predictions, hypothetical who-would-beat-whom daily matchup predictions, and more. For a discussion on how we make our predictions, see our first post on this topic. Note that our approach does not make use of any bookie predictions (unlike many other sites), and so provide an independent look on the game.

This season, we hope to crack 70% accuracy!

* Note that we have left up last season’s completed games results, for review purposes. Once every team has played one game, we’ll switch it over to the current season’s results.

## Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the support vectors in support vector machines?”

## A review of parameter regularization and Bayesian regression

Here, we review parameter regularization, which is a method for improving regression models through the penalization of non-zero parameter estimates. Why is this effective? Biasing parameters towards zero will (of course!) unfavorably bias a model, but it will also reduce its variance. At times the latter effect can win out, resulting in a net reduction in generalization error. We also review Bayesian regressions — in effect, these generalize the regularization approach, biasing model parameters to any specified prior estimates, not necessarily zero.

This is the second of a series of posts expounding on topics discussed in the text, “An Introduction to Statistical Learning”. Here, we cover material from its Chapters 2 and 6. See prior post here.

## Getting started with Pandas

We have made use of Python’s Pandas package in a variety of posts on the site. These have showcased some of Pandas’ abilities including the following:

• DataFrames for data manipulation with built in indexing
• Handling of missing data
• Data alignment
• Melting/stacking and Pivoting/unstacking data sets
• Groupby feature allowing split -> apply -> combine operations on data sets
• Data merging and joining

Pandas is also a high performance library, with much of its code written in Cython or C. Unfortunately, Pandas can have a bit of a steep learning curve — In this post, I’ll cover some introductory tips and tricks to help one get started with this excellent package.

Notes:

• This post was partially inspired by Tom Augspurger’s Pandas tutorial, which has a youtube video that can be viewed along side it. We also suggest some other excellent resource materials — where relevant — below.
• The notebook we use below can be downloaded from our github page. Feel free to grab it and follow along.