Skip to main content
Untitled-2

Improved Bonferroni correction factors for multiple pairwise comparisons

A common task in applied statistics is the pairwise comparison of the responses of $N$ treatment groups in some statistical test — the goal being to decide which pairs exhibit differences that are statistically significant. Now, because there is one comparison being made for each pairing, a naive application of the Bonferroni correction analysis suggests that one should set the individual pairwise test sizes to $\alpha_i \to \alpha_f/{N \choose 2}$ in order to obtain a desired family-wise type 1 error rate of $\alpha_f$. Indeed, this solution is suggested by many texts. However, implicit in the Bonferroni analysis is the assumption that the comparisons being made are each mutually independent. This is not the case here, and we show that as a consequence the naive approach often returns type 1 error rates far from those desired. We provide adjusted formulas that allow for error-free Bonferroni-like corrections to be made.

[edit (7/4/2016): After posting this article, I’ve since found that the method we suggest here is related to / is a generalization of Tukey’s range test — see here.]

(more…)

Read More

Independent component analysis

Two microphones are placed in a room where two conversations are taking place simultaneously. Given these two recordings, can one “remix” them in some prescribed way to isolate the individual conversations? Yes! In this post, we review one simple approach to solving this type of problem, Independent Component Analysis (ICA). We share an ipython document implementing ICA and link to a youtube video illustrating its application to audio de-mixing.

(more…)

Read More

Maximum-likelihood asymptotics

In this post, we review two facts about maximum-likelihood estimators: 1) They are consistent, meaning that they converge to the correct values given a large number of samples, $N$, and 2) They satisfy the Cramer-Rao lower bound for unbiased parameter estimates in this same limit — that is, they have the lowest possible variance of any unbiased estimator, in the $N\gg 1$ limit.

(more…)

Read More

Principal component analysis

We review the two essentials of principal component analysis (“PCA”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first $k$ of these — listed in descending eigenvalue order — provides the best possible $k$-dimensional approximation to the data, in the sense of captured variance.

(more…)

Read More

NBA 2015-16!!!

NBA is back this Tuesday! The dashboard and weekly predictions are now live*, once again. These will each be updated daily, with game winner predictions, hypothetical who-would-beat-whom daily matchup predictions, and more. For a discussion on how we make our predictions, see our first post on this topic. Note that our approach does not make use of any bookie predictions (unlike many other sites), and so provide an independent look on the game.

This season, we hope to crack 70% accuracy!

* Note that we have left up last season’s completed games results, for review purposes. Once every team has played one game, we’ll switch it over to the current season’s results.

Read More

A review of parameter regularization and Bayesian regression

Here, we review parameter regularization, which is a method for improving regression models through the penalization of non-zero parameter estimates. Why is this effective? Biasing parameters towards zero will (of course!) unfavorably bias a model, but it will also reduce its variance. At times the latter effect can win out, resulting in a net reduction in generalization error. We also review Bayesian regressions — in effect, these generalize the regularization approach, biasing model parameters to any specified prior estimates, not necessarily zero.

This is the second of a series of posts expounding on topics discussed in the text, “An Introduction to Statistical Learning”. Here, we cover material from its Chapters 2 and 6. See prior post here.

(more…)

Read More

Stochastic geometric series

Let $a_1, a_2, \ldots$ be an infinite set of non-negative samples taken from a distribution $P_0(a)$, and write
$$\tag{1} \label{problem}
S = 1 + a_1 + a_1 a_2 + a_1 a_2 a_3 + \ldots.
$$
Notice that if the $a_i$ were all the same, $S$ would be a regular geometric series, with value $S = \frac{1}{1-a}$. How will the introduction of $a_i$ randomness change this sum? Will $S$ necessarily converge? How is $S$ distributed? In this post, we discuss some simple techniques to answer these questions.

Note: This post covers work done in collaboration with my aged p, S. Landy.

(more…)

Read More

Leave-one-out cross-validation

This will be the first of a series of short posts relating to subject matter discussed in the text, “An Introduction to Statistical Learning”. This is an interesting read, but it often skips over statement proofs — that’s where this series of posts comes in! Here, I consider the content of Section 5.1.2: This gives a lightning-quick “short cut” method for evaluating a regression’s leave-one-out cross-validation error. The method is applicable to any least-squares linear fit.

(more…)

Read More

Pseudotsuga_seed_seedling

How not to sort by average rating, revisited

What is the best method for ranking items that have positive and negative reviews? Some sites, including reddit, have adopted an algorithm suggested by Evan Miller to generate their item rankings. However, this algorithm can sometimes be unfairly pessimistic about new, good items. This is especially true of items whose first few votes are negative — an issue that can be “gamed” by adversaries. In this post, we consider three alternative ranking methods that can enable high-quality items to more-easily bubble-up. The last is the simplest, but continues to give good results: One simply seeds each item’s vote count with a suitable fixed number of hidden “starter” votes.
(more…)

Read More

Multivariate Cramer-Rao inequality

The Cramer-Rao inequality addresses the question of how accurately one can estimate a set of parameters $\vec{\theta} = \{\theta_1, \theta_2, \ldots, \theta_m \}$ characterizing a probability distribution $P(x) \equiv P(x; \vec{\theta})$, given only some samples $\{x_1, \ldots, x_n\}$ taken from $P$. Specifically, the inequality provides a rigorous lower bound on the covariance matrix of any unbiased set of estimators to these $\{\theta_i\}$ values. In this post, we review the general, multivariate form of the inequality, including its significance and proof.
(more…)

Read More