This note addresses the typical applied problem of estimating from data how a target “conversion rate” function varies with some available scalar score function — e.g., estimating conversion rates from some marketing campaign as a function of a targeting model score. The idea centers around estimating the integral of the rate function; differentiating this gives the rate function. The method is a variation on a standard technique for estimating pdfs via fits to empirical cdfs.

# Statistics

## Gaussian Processes

We review the math and code needed to fit a Gaussian Process (GP) regressor to data. We conclude with a demo of a popular application, fast function minimization through GP-guided search. The gif below illustrates this approach in action — the red points are samples from the hidden red curve. Using these samples, we attempt to leverage GPs to find the curve’s minimum as fast as possible.

Appendices contain quick reviews on (i) the GP regressor posterior derivation, (ii) SKLearn’s GP implementation, and (iii) GP classifiers.

## Logistic Regression

We review binary logistic regression. In particular, we derive a) the equations needed to fit the algorithm via gradient descent, b) the maximum likelihood fit’s asymptotic coefficient covariance matrix, and c) expressions for model test point class membership probability confidence intervals. We also provide python code implementing a minimal “LogisticRegressionWithError” class whose “predict_proba” method returns prediction confidence intervals alongside its point estimates.

Our python code can be downloaded from our github page, here. Its use requires the jupyter, numpy, sklearn, and matplotlib packages.

## Hyperparameter sample-size dependence

Here, we briefly review a subtlety associated with machine-learning model selection: the fact that the optimal hyperparameters for a model can vary with training set size, $N.$ To illustrate this point, we derive expressions for the optimal strength for both $L_1$ and $L_2$ regularization in single-variable models. We find that the optimal $L_2$ approaches a finite constant as $N$ increases, but that the optimal $L_1$ decays exponentially fast with $N.$ Sensitive dependence on $N$ such as this should be carefully extrapolated out when optimizing mission-critical models.

## Average queue wait times with random arrivals

Queries ping a certain computer server at random times, on average $\lambda$ arriving per second. The server can respond to one per second and those that can’t be serviced immediately are queued up. What is the average wait time per query? Clearly if $\lambda \ll 1$, the average wait time is zero. But if $\lambda > 1$, the queue grows indefinitely and the answer is infinity! Here, we give a simple derivation of the general result — (9) below.

## Improved Bonferroni correction factors for multiple pairwise comparisons

A common task in applied statistics is the pairwise comparison of the responses of $N$ treatment groups in some statistical test — the goal being to decide which pairs exhibit differences that are statistically significant. Now, because there is one comparison being made for each pairing, a naive application of the Bonferroni correction analysis suggests that one should set the individual pairwise test sizes to $\alpha_i \to \alpha_f/{N \choose 2}$ in order to obtain a desired family-wise type 1 error rate of $\alpha_f$. Indeed, this solution is suggested by many texts. However, implicit in the Bonferroni analysis is the assumption that the comparisons being made are each mutually independent. This is not the case here, and we show that as a consequence the naive approach often returns type 1 error rates far from those desired. We provide adjusted formulas that allow for error-free Bonferroni-like corrections to be made.

[edit (7/4/2016): After posting this article, I’ve since found that the method we suggest here is related to / is a generalization of Tukey’s range test — see here.]

## Maximum-likelihood asymptotics

In this post, we review two facts about maximum-likelihood estimators: 1) They are consistent, meaning that they converge to the correct values given a large number of samples, $N$, and 2) They satisfy the Cramer-Rao lower bound for unbiased parameter estimates in this same limit — that is, they have the lowest possible variance of any unbiased estimator, in the $N\gg 1$ limit.

## A review of parameter regularization and Bayesian regression

Here, we review parameter regularization, which is a method for improving regression models through the penalization of non-zero parameter estimates. Why is this effective? Biasing parameters towards zero will (of course!) unfavorably bias a model, but it will also reduce its variance. At times the latter effect can win out, resulting in a net reduction in generalization error. We also review Bayesian regressions — in effect, these generalize the regularization approach, biasing model parameters to any specified prior estimates, not necessarily zero.

This is the second of a series of posts expounding on topics discussed in the text, “An Introduction to Statistical Learning”. Here, we cover material from its Chapters 2 and 6. See prior post here.

## Stochastic geometric series

Let $a_1, a_2, \ldots$ be an infinite set of non-negative samples taken from a distribution $P_0(a)$, and write

$$\tag{1} \label{problem}

S = 1 + a_1 + a_1 a_2 + a_1 a_2 a_3 + \ldots.

$$

Notice that if the $a_i$ were all the same, $S$ would be a regular geometric series, with value $S = \frac{1}{1-a}$. How will the introduction of $a_i$ randomness change this sum? Will $S$ necessarily converge? How is $S$ distributed? In this post, we discuss some simple techniques to answer these questions.

Note: This post covers work done in collaboration with my aged p, S. Landy.