We illustrate the application of two linear compression algorithms in python: Principal component analysis (PCA) and least-squares feature selection. Both can be used to compress a passed array, and they both work by stripping out redundant columns from the array. The two differ in that PCA operates in a particular rotated frame, while the feature selection solution operates directly on the original columns. As we illustrate below, PCA always gives a stronger compression. However, the feature selection solution is often comparably strong, and its output has the benefit of being relatively easy to interpret — a virtue that is important for many applications.
This is a tutorial post relating to our python feature selection package,
linselect. The package allows one to easily identify minimal, informative feature subsets within a given data set.
Here, we demonstrate
linselect‘s basic API by exploring the relationship between the daily percentage lifts of 50 tech stocks over one trading year. We will be interested in identifying minimal stock subsets that can be used to predict the lifts of the others.
This is a demonstration walkthrough, with commentary and interpretation throughout. See the package docs folder for docstrings that succinctly detail the API.
- Load the data and examine some stock traces
- FwdSelect, RevSelect; supervised, single target
- FwdSelect, RevSelect; supervised, multiple targets
- FwdSelect, RevSelect; unsupervised
The data and a Jupyter notebook containing the code for this demo are available on our github, here.
linselect package can be found on our github, here.
It has been quite awhile since I have posted, largely because soon after I started my job at Square I had a child! I hope to have some newer blog post soon. But along those lines I want to share a blog post I did with a coworker (Juan Hernandez) for Square that gives a taste of some of the cool data science work we have been up to. This post covers work we did to create a framework for making models interpretable.
This note addresses the typical applied problem of estimating from data how a target “conversion rate” function varies with some available scalar score function — e.g., estimating conversion rates from some marketing campaign as a function of a targeting model score. The idea centers around estimating the integral of the rate function; differentiating this gives the rate function. The method is a variation on a standard technique for estimating pdfs via fits to empirical cdfs.
We review the math and code needed to fit a Gaussian Process (GP) regressor to data. We conclude with a demo of a popular application, fast function minimization through GP-guided search. The gif below illustrates this approach in action — the red points are samples from the hidden red curve. Using these samples, we attempt to leverage GPs to find the curve’s minimum as fast as possible.
Appendices contain quick reviews on (i) the GP regressor posterior derivation, (ii) SKLearn’s GP implementation, and (iii) GP classifiers.
Here, I give a quick review of the concept of a Martingale. A Martingale is a sequence of random variables satisfying a specific expectation conservation law. If one can identify a Martingale relating to some other sequence of random variables, its use can sometimes make quick work of certain expectation value evaluations.
This note is adapted from Chapter 2 of Stochastic Calculus and Financial Applications, by Steele.
We review binary logistic regression. In particular, we derive a) the equations needed to fit the algorithm via gradient descent, b) the maximum likelihood fit’s asymptotic coefficient covariance matrix, and c) expressions for model test point class membership probability confidence intervals. We also provide python code implementing a minimal “LogisticRegressionWithError” class whose “predict_proba” method returns prediction confidence intervals alongside its point estimates.
Our python code can be downloaded from our github page, here. Its use requires the jupyter, numpy, sklearn, and matplotlib packages.
The AUC score is a popular summary statistic that is often used to communicate the performance of a classifier. However, we illustrate here that this score depends not only on the quality of the model in question, but also on the difficulty of the test set considered: If samples are added to a test set that are easily classified, the AUC will go up — even if the model studied has not improved. In general, this behavior implies that isolated, single AUC scores cannot be used to meaningfully qualify a model’s performance. Instead, the AUC should be considered a score that is primarily useful for comparing and ranking multiple models — each at a common test set difficulty.