2
$\begingroup$

First, my background is not math.

My objective is to find the value that occurs most frequently in a data sample OR the value that is most likely.

Let's say my sample is [1,5,6,6,7,10]. Finding the mode for this sample is simple (the mode is 6).

But if let's say I change the sample to [1,5,6,7,10], I don't know how to find the mode. The results that I want is 6 since 6 is the most probable data.

Problem is, I don't even know what to google (tried for hours), and even when I do find something that MAY be the answer (kernel density estimation, continuous probability distribution), I don't understand what the hell they're talking about.

The actual situation consist of hundreds of data (in floats) that are saved in Excel. I would appreciate if someone could demo it in Excel.

1 Answers 1

1

For the record, here are some general solution sketches that also work for high-dimensional distributions (probably too complex for the asker, though; some sort of kernel density estimation is much simpler and reasonably good):

  • Train an f-GAN with reverse KL divergence, without giving any random input to the generator (i.e. force it to be deterministic).

  • Train an f-GAN with reverse KL divergence, move the input distribution to the generator towards a Dirac delta function as training progresses, and add a gradient penalty to the generator loss function.

  • Train a (differentiable) generative model that can tractably evaluate an approximation of the pdf at any point (I believe that e.g. a VAE, a flow-based model, or an autoregressive model would do). Then use some type of optimization (some flavor of gradient ascent can be used if model inference is differentiable) to find a maximum of that approximation.

  • 0
    I believe these solutions converge as the sample size and the "network approximation power" increase (assuming training works well).2018-12-20