Estimating Distribution Mean and Quantiles

Introduction

In this note, we look into the problem of estimating various statistics such as mean, quantiles of an unknown distribution from its sample.

More concretely, consider a univariate continuous random variable \(X\) following an unknown distribution with probability density function (pdf) \(p_X (x)\). Given a sample set \(\{x_i\}_{i=1}^n\) drawn i.i.d from the distribution, estimate:

Disclaimer: I've tried to keep all the math below valid yet as simple as possible. Feel free to correct me if you find anything horribly wrong.

The mean

For an arbitrary \(m \in \mathbb{R}\), denote \(\mathcal{L}(m) = \int (x-m)^2~p(x)~dx\).

Claim 1: \(m^{*} = \underset{m}{\textrm{argmin}}~\mathcal{L}(m) = \mu\).

Proof sketch:

$$ \begin{array}{rcl} \underset{m}{\textrm{argmin}}~\mathcal{L}(m) &=& \underset{m}{\textrm{argmin}} \left [\int m^2~p(x)~dx - \int 2mx~p(x)~dx \right ] \\ &=& \underset{m}{\textrm{argmin}} \left [m^2 - 2m\int x~p(x)~dx \right] \\ &=& \underset{m}{\textrm{argmin}}~\left (m - \int x~p(x)~dx\right )^2 = \int x~p(x)~dx = \mu \end{array} $$

Sample estimate

Since \(p(x)\) is unknown, we instead minimize an approximate of the \(\mathcal{L}(m)\) using the given sample,

$$\hat{\mathcal{L}}(m) = \frac{1}{n} \sum_{i=1}^n (x_i - m)^2$$

and obtain an estimate of the mean:

$$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n~x_i = \underset{m}{\textrm{argmin}}~\hat{\mathcal{L}}(m)$$

The median

Let \(F(x)\) denote the cumulative distribution, i.e., \(F(x) = \textrm{Prob}(X \leq x)\).

A median or \(q_{0.5}\) is roughly the point which divides the distribution in half, i.e.,

$$F(q_{0.5}) = 0.5$$

For an arbitrary \(m \in \mathbb{R}\), denote

$$\mathcal{L}_{0.5} (m) = \int |x-m|~p(x)~dx$$

Claim 2: \(m^{*} = \underset{m}{\textrm{argmin}}~\mathcal{L}_{0.5}(m) = q_{0.5}\).

See the next section for a proof sketch of a general case of any quantile \(\tau \in (0,1)\).

Sample estimate

Similar to the mean estimate, we can obtain an estimate of the median from the given sample,

$$\hat{q}_{0.5} = \underset{m}{\textrm{argmin}}~\frac{1}{n} \sum_{i=1}^n |x_i-m| = \textrm{median}(\{x_i\}_{i=1}^n)$$

The \(\tau\)-quantile

For an arbitrary \(m \in \mathbb{R}\), denote

$$\mathcal{L}_{\tau}(m) = \int (x-m)(\tau - 1_{x < m})~p(x)~dx$$

where

$$ 1_{x < m} = \left\{\begin{array}{ll} 1 & \text{if}~x < m \\ 0 & \text{otherwise} \end{array}\right. $$

Note that \((x-m)(\tau - 1_{x < m})\) is nothing other than the quantile loss often used in quantile regression.

Claim 3: \(m^{*} = \underset{m}{\textrm{argmin}}~\mathcal{L}_\tau (m) = q_{\tau}\)

Derivation of the quantile loss

Curious readers may wonder how to come up with the above formula of \(\mathcal{L}_{\tau}(m)\). Let's reverse-engineer the quantile loss.

Given arbitrary \(0 < c_1, c_2 \in \mathbb{R}\), consider a generalized form of the median loss \(\mathcal{L}_{0.5}\),

$$\mathcal{L}_{c_1, c_2}(m) = c_1 \underset{x \geq m}{\int} (x-m)~p(x)~dx + c_2 \underset{x < m}{\int} (m-x)~p(x)~dx$$

Intuitively, we assign different costs to \(x\) depending on its value relative to \(m\). The problem is for a given \(\tau \in (0,1)\) find \(c_1, c_2\) such that

$$\underset{m}{\textrm{argmin}}~\mathcal{L}_{c_1, c_2}(m) = q_\tau$$

Take a variable change \(u=F(x)\) then \(du = dF(x) = p(x)~dx\) and \(F(m) = v\) for some \(v \in [0,1]\). Also, \(x = F^{-1}(u)\). Note that we slightly abuse the notation here using \(\mathcal{L}_{c_1, c_2}(v)\) as the loss after the variable change to avoid introducing new notation.

$$ \begin{array}{lcl} \mathcal{L}_{c_1, c_2}(v) &=& c_1\int_v^1(F^{-1}(u)-F^{-1}(v))~du + c_2\int_0^v(F^{-1}(v)-F^{-1}(u))~du \\ &=& c_1\int_v^1 F^{-1}(u)~du - c_1 F^{-1}(v) (1-v) + c_2 F^{-1}(v) v - c_2\int_0^v F^{-1}(u)~du \end{array} $$

From the Leibniz integral rule and the product rule,

$$ \begin{array}{lcl} \frac{d}{dv}\mathcal{L}_{c_1, c_2}(v) & = & -c_1 F^{-1}(v) + c_1 F^{-1}(v) - c_1 (1-v)\frac{d}{dv}F^{-1}(v) \\ & & +~c_2 F^{-1}(v) + c_2 v\frac{d}{dv}F^{-1}(v) - c_2 F^{-1}(v) \\ & = & \frac{d}{dv}F^{-1}(v) (-c_1 + c_1 v + c_2 v)~~(*) \end{array} $$

For \(q_\tau\) to be the minimizer of \(\mathcal{L}_{c_1, c_2}(m)\), or equivalently, \(\tau\) is the minimizer of \(\mathcal{L}_{c_1, c_2}(v)\),

$$\frac{d}{dv}\mathcal{L}_{c_1, c_2}(v)_{\mid v=\tau} = 0,$$

and since \(\frac{d}{dv}F^{-1}(v) \neq 0\) in general, this essentially implies

$$c_1 \tau + c_2 \tau - c_1 = 0$$

Note that the loss minimizer doesn't change if we scale \(c_1, c_2\) with the same rate, so without the loss of generality, we can set \(c_1 = 1\). Thus solving for \(c_2\), we obtain \(c_2 = \frac{1-\tau}{\tau}\). Substitute \(c_1 = 1, c_2 = \frac{1-\tau}{\tau}\) into the original loss:

$$\mathcal{L}_{c_1, c_2}(m) = \underset{x \geq m}{\int} (x-m)~p(x)~dx + \frac{1-\tau}{\tau}\underset{x < m}{\int} (m-x)~p(x)~dx$$

Since \(\tau > 0\), minimizing \(\mathcal{L}_{c_1, c_2}(m)\) is equivalent to minimizing the following:

$$\tau\underset{x \geq m}{\int} (x-m)~p(x)~dx + (1-\tau)\underset{x < m}{\int} (m-x)~p(x)~dx$$

which is the same as

$$\mathcal{L}_{\tau}(m) = \int (x-m)(\tau - 1_{x < m})~p(x)~dx$$

Why \(q_\tau\) is the minimizer of \(\mathcal{L}_{\tau}(m)\)?

As shown above, \(\tau\) is a critical point of \(\mathcal{L}_{\tau}(v)\),

$$\frac{d}{dv}\mathcal{L}_{\tau}(v)_{\mid v=\tau} = 0$$

From \((*)\) with \(c_1 = \tau, c_2 = 1 - \tau\), we have \(\frac{d}{dv}\mathcal{L}_{\tau}(v) = \frac{d}{dv}F^{-1}(v) (v - \tau)\). Since \(F^{-1}\) is a non-decreasing function, i.e., for any \(0 \leq v_1 \leq v_2 \leq 1\), \(F^{-1}(v_1) \leq F^{-1}(v_2)\); so \(\frac{d}{dv}F^{-1}(v) > 0\). Thus, \(\frac{d}{dv}\mathcal{L}_{\tau}(v)\) has the same sign as \(v - \tau\):

For any \(v \in [0,1]\), \(\mathcal{L}_{\tau}(v) \geq \mathcal{L}_{\tau}(\tau)\). Thus \(\tau\) is a minimizer of \(\mathcal{L}_{\tau}(v)\), or equivalently \(q_\tau\) is a minimizer of \(\mathcal{L}_{\tau}(m)\).

Sample estimate

Finally, we can obtain an estimate of the quantile as:

$$\hat{q}_{\tau} = \underset{m}{\textrm{argmin}}~\frac{1}{n} \sum_{i=1}^n (x_i-m)(\tau - 1_{x_i < m})$$