We review the two essentials of principal component analysis (“PCA”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first $k$ of these — listed in descending eigenvalue order — provides the best possible $k$-dimensional approximation to the data, in the sense of captured variance.

# Methods

## Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the *support vectors* in support vector machines?”

## Getting started with Pandas

We have made use of Python’s Pandas package in a variety of posts on the site. These have showcased some of Pandas’ abilities including the following:

- DataFrames for data manipulation with built in indexing
- Handling of missing data
- Data alignment
- Melting/stacking and Pivoting/unstacking data sets
- Groupby feature allowing split -> apply -> combine operations on data sets
- Data merging and joining

Pandas is also a high performance library, with much of its code written in Cython or C. Unfortunately, Pandas can have a bit of a steep learning curve — In this post, I’ll cover some introductory tips and tricks to help one get started with this excellent package.

Notes:

- This post was partially inspired by Tom Augspurger’s Pandas tutorial, which has a youtube video that can be viewed along side it. We also suggest some other excellent resource materials — where relevant — below.
- The notebook we use below can be downloaded from our github page. Feel free to grab it and follow along.

Follow @efavdb

Follow us on twitter for new submission alerts!

## Build a web scraper for a literature search – from soup to nuts

*Code, references, and examples of this project are on Github.*

In this post, I’ll describe the soup to nuts process of automating a literature search in Pubmed Central using R.

It feels deeply satisfying to sit back and let the code do the dirty work.

Is it as satisfying as a bowl of red-braised beef noodle soup with melt-in-your-mouth tendons from Taipei’s Yong Kang Restaurant (featured image)?

If you have to do a lit search like this more than once, then I have to say the answer is yes — unequivocally, yes.

(more…)

## Leave-one-out cross-validation

This will be the first of a series of short posts relating to subject matter discussed in the text, “An Introduction to Statistical Learning”. This is an interesting read, but it often skips over statement proofs — that’s where this series of posts comes in! Here, I consider the content of Section 5.1.2: This gives a lightning-quick “short cut” method for evaluating a regression’s leave-one-out cross-validation error. The method is applicable to any least-squares linear fit.

## The mean shift clustering algorithm

### Mean shift clustering

Mean shift clustering is a general non-parametric cluster finding procedure — introduced by Fukunaga and Hostetler [1], and popular within the computer vision field. Nicely, and in contrast to the more-well-known K-means clustering algorithm, the output of mean shift does not depend on any explicit assumptions on the shape of the point distribution, the number of clusters, or any form of random initialization.

(more…)

## Machine Learning Methods: Decision trees and forests

This post contains our crib notes on the basics of decision trees and forests. We first discuss the construction of individual trees, and then introduce random and boosted forests. We also discuss efficient implementations of greedy tree construction algorithms, showing that a single tree can be constructed in $O(k \times n \log n)$ time, given $n$ training examples having $k$ features each. We provide exercises on interesting related points and an appendix containing relevant python/sk-learn function calls.

(more…)