Skip to main content

Interpreting the results of linear regression

Our last post showed how to obtain the least-squares solution for linear regression and discussed the idea of sampling variability in the best estimates for the coefficients. In this post, we continue the discussion about uncertainty in linear regression — both in the estimates of individual linear regression coefficients and the quality of the overall fit.

Specifically, we’ll discuss how to calculate the 95% confidence intervals and p-values from hypothesis tests that are output by many statistical packages like python’s statsmodels or R. An example with code is provided at the end.

(more…)

Read More

Try Caffe pre-installed on a VirtualBox image

A previous post showed beginners how to try out deep learning libraries by

  1. using an Amazon Machine Image (AMI) pre-installed with deep learning libraries
  2. setting up a Jupyter notebook server to play with said libraries

If you have VirtualBox and Vagrant, you can follow a similar procedure on your own computer. The advantage is that you can develop locally, then deploy on an expensive AWS EC2 gpu instance when your scripts are ready.
(more…)

Read More

featured_image

Start deep learning with Jupyter notebooks in the cloud

Want a quick and easy way to play around with deep learning libraries? Puny GPU got you down? Thanks to Amazon Web Services (AWS) — specifically, AWS Elastic Compute Cloud (EC2) — no data scientist need be left behind.

Jupyter/IPython notebooks are indispensable tools for learning and tinkering. This post shows how to set up a public Jupyter notebook server in EC2 and then access it remotely through your web browser, just as you would if you were using a notebook launched from your own laptop.
(more…)

Read More

dotfiles_header

Dotfiles for peace of mind

Reinstalling software and configuring settings on a new computer is a pain. After my latest hard drive failure set the stage for yet another round of download-extract-install and configuration file twiddling, it was time to overhaul my approach. "Enough is enough!"

This post walks through

  1. how to back up and automate the installation and configuration process
  2. how to set up a minimal framework for data science

We’ll use a dotfiles repository on Github to illustrate both points in parallel.

(more…)

Read More

svm_3d_cropped

Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Andrew Ng

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the support vectors in support vector machines?”

(more…)

Read More

PMC_lettersoup

Build a web scraper for a literature search – from soup to nuts

Code, references, and examples of this project are on Github.

In this post, I’ll describe the soup to nuts process of automating a literature search in Pubmed Central using R.

It feels deeply satisfying to sit back and let the code do the dirty work.

Is it as satisfying as a bowl of red-braised beef noodle soup with melt-in-your-mouth tendons from Taipei’s Yong Kang Restaurant (featured image)?

If you have to do a lit search like this more than once, then I have to say the answer is yes — unequivocally, yes.
(more…)

Read More

A review of the online course “Introduction to Big Data with Apache Spark”

This is a review of Introduction to Big Data with Apache Spark (CS100.1x), the first in a two-part series introducing the big data processing engine, Spark.

The one month-long MOOC was offered on edX for the first time in June 2015, and its sequel, Scalable Machine Learning (CS190.1x) is currently under way.

lecture screenshot
Still from a lecture in “Introduction to Big Data with Apache Spark”

(more…)

Read More