3
$\begingroup$

We are given a dataset $X=(x_1,\ldots,x_n)$ of words (it could be duplicates). We assume each $x_i$ was generated by a probability distribution with probability $p_i$ (we do not know this values, we have to estimate them from the dataset).

Define the metric $d_k=\sum_{j=1}^k p_{i_j}$, which sums up the biggest $k$ values of $p_{i_j}$.

The question is: How will you determine whether the dataset $X$ is large enough to allow a good estimation of $d_k$?

I can estimate the probabilities from the dataset $X$, $\overline{p_{i_j}}$ = $\frac{f_{i_j}}{|X|}$, where $f_{i_j}$ is frequency of $x_{i_j}$ in dataset $X$. Therefore, I have an estimation for my metric $d_k$ by taking the first $k$ highest values of $\overline{p_{i_j}}$

We know that $MSE=bias^2 + variance$, then I can decide whether $d_k$ is a good estimation by taking a look on $MSE$ value and comparing it with a threshold. If it is smaller than $X$ is large enough. Otherwise it is not. Is it a good approach?

0 Answers 0