Introduction

Abstract

We're interested in the behavior/performance of Estimators/Algorithms

In applied microeconometrics, where we are generally interested in the causal effect of some policy/intervention, we use estimators with the following signature.

\[\begin{align*} \mathcal{A} :: \{ \mathcal{X} \times \mathcal{Y} \}^n \to \mathcal{R}^p\\ \end{align*}\]

Occasionally, these estimators will have an analytical form. For example, if we are interested in the average outcome we may use the following estimator.

\[\begin{align*} &\mathcal{A} :: \{ \mathcal{X} \times \mathcal{Y} \}^n \to \mathcal{R}\\ &\mathcal{A} \big(\{x_i, y_i \})_{i=1}^n\big) = \frac{1}{n} \sum y_i \\ \end{align*}\]

Or if we are interested in the linear approximation to the CEF we may use the following estimator.

\[\begin{align*} &\mathcal{A} :: \{ \mathcal{X} \times \mathcal{Y} \}^n \to \mathcal{R}^p\\ &\mathcal{A} \big(\{x_i, y_i \})_{i=1}^n\big) = \big( X^TX)^{-1}X^TY \\ \end{align*}\]

We may, though, be interested in more "complex" estimators that involve neural networks.

\[\begin{align*} &\mathcal{A} :: \{ \mathcal{X} \times \mathcal{Y} \}^n \to \mathcal{R}\\ &\mathcal{A} \big(\{x_i, y_i \})_{i=1}^n\big) = \sum f(\theta_1, x_i) - f(\theta_2, x_i) \\ & \quad \textrm{where} \ \theta_i = m^*\big(\{x_i, y_i \})_{i=1}^n\big) \end{align*}\]

To Do

Make the connection to kernel methods

Probability Space

Warning

Is the Borel Sigma Algebra well defined here? Should we add restrictions on \(\mathcal{X}, \mathcal{Y}\)?

\[\Big( \mathcal{X} \times \mathcal{Y}, \mathcal{B}(\mathcal{X} \times \mathcal{Y}), \mathbb{P}\Big)\]

Function Space/ Hypothesis Class/ Model Class

Warning

What additional restrictions should we place on \(\mathcal{H}\)?

\[ \mathcal{H} := \{h \mid h : \mathcal{X} \to \mathcal{Y} \}\]

Loss Function

Warning

Make note on parameterization

\[\begin{align*} &l : \mathcal{H} \to \mathcal{X} \to \mathcal{Y} \\ &l(h, x, y) = (y - h(x))^2 \end{align*}\]

Population Risk

\[\begin{align*} &L :: \mathcal{H} \to \mathcal{R}_+ \\ &L(h) := \underset{(x,y)\sim p}{\mathbb{E}} \big[l(h, x, y)\big] \\ &L(h)= \int _{\mathcal{X} \times \mathcal{Y}}l(h, X, Y)d\mathbb{P} \end{align*}\]

Empirical Risk

\[ \begin{align*} &\hat{L} :: \{X, Y\}^n \to \Theta \to \mathcal{R}_+ \\ &\hat{L}(\{x_i, y_i\}_{i=1}^n, \theta) = \frac{1}{n} \sum _i l(\theta, x_i, y_i) \end{align*}\]

Partial Evaluation: Expectation

Partially evaluated at \(\theta\), \(\hat{L}\) is a random variable. Taking its expectation

\[ \begin{align*} \mathbb{E}\big[ \hat{L} _{\theta}\big] &= \mathbb{E}\Big[ \frac{1}{n} \sum _i l(\theta, x_i, y_i)\Big] \\ &= \frac{1}{n} \sum _i \mathbb{E}\big[ l(\theta, x_i, y_i)\big] \\ &= L(\theta) \end{align*}\]

Partial Evaluation: Variance

\[ \begin{align*} \textrm{Var}(\hat{L} _{\theta}) &= \textrm{Var}\Big[ \frac{1}{n} \sum _i l(\theta, x_i, y_i)\Big] \\ &= \frac{1}{n^2} \sum _i \textrm{Var}\big[ l(\theta, x_i, y_i)\big] \\ &= \frac{\textrm{Var}\big[ l(\theta, x_i, y_i)\big]}{n} \end{align*}\]

Partial Evaluation: Variance

Partially evaluated at \(\theta\), \(\hat{L}\) is a random variable. Taking its expectation

\[ \begin{align*} \mathbb{E}\big[ \hat{L} _{\theta}\big] &= \mathbb{E}\Big[ \frac{1}{n} \sum _i l(\theta, x_i, y_i)\Big] \\ &= \frac{1}{n} \sum _i \mathbb{E}\big[ l(\theta, x_i, y_i)\big] \\ &= L(\theta) \end{align*}\]

Hmm

But we are not really interested in this relationship, becuase \(\hat{L}\) is not partially evaluated in practice. In practice, we use our training data to determine \(\theta\). That is we have an algorithm \(\mathcal{A}\).

Algorithm

\[ \begin{align*} \mathcal{A} :: \{\mathcal{X}, \mathcal{Y}\}^n \to \Theta \end{align*} \]

Empirical Risk Minimization

\[ \begin{align*} &\textrm{ERM} :: \{X, Y\}^n \to \Theta \\ &\textrm{ERM}(\{x_i, y_i\}_{i=1}^n) = \underset{\theta \in \Theta}{\textrm{minimize}} \ \hat{L}(\{x_i, y_i\}_{i=1}^n, \theta) \end{align*}\]

Consistency:

\[\mathbb{P}_n ( \| \mathcal{A}_n - \theta _0 \| > \varepsilon) \to 0 \]
Which is just a random variable. We can evaluate it as follows:

\[ \begin{align*} \mathbb{E} \big[ L \circ \mathcal{A} \big] \end{align*} \]
Excess Risk:

\[ \begin{align*} &E :: \mathcal{H} \to \mathcal{R}_+ \\ &E(h) = L(h) - \underset{g \in \mathcal{H}}{\inf} L(g) \end{align*}\]