1
$\begingroup$

Say I have 10,000 data in 2-D and I want to fit a curve to them. There are many functional forms this curve could take -- polynomial, B-spline, trigonometric, and so on. I've decided that I only want to use 4 parameters.

Is there a way to figure out what is the most accurate functional form? That is, considering all possible functions with 4 parameters, which one fits the best in, e.g., an $L_2$ sense?

edit: maybe I should ask about the most accurate function with the same Vapnik-Chervonenkis dimension as a polynomial of degree $4$ rather than "4 parameters"?

  • 1
    Suppose the function that actually fits your data is $f$ and pick three other random functions $g, h, i$. Then clearly the $4$-parameter family of functions $af + bg + ch + di$ is accurate and you'll want to use $a = 1, b = 0, c = 0, d = 0$. There's no way to reasonably answer this question without explicit constraints on what kind of functions you're willing to use.2011-05-20
  • 0
    @Qiaochu Yuan You can't necessarily suppose that $\exists f$ that fits the data. For example $f(3)$ cannot equal both 17 and 177 but (3,17) and (3,177) might both be in the data set.2011-05-20
  • 0
    @Lao Tzu: the point still stands. There are arbitrarily good answers if you allow yourself arbitrary classes of functions.2011-05-20
  • 0
    @Qiaochu Yuan In your example $f(x)$ can change direction 9999 times, which would take a polynomial with much more than 4 parameters. Equally $f$ might expand and shrink its "period" many times more than a sum of 4 trig functions could do. Is there a way to capture this seeming "tradeoff"? I tagged the question with Vapnik-Chervonenkis dimension because I thought maybe there's something deeper in that direction.2011-05-20
  • 0
    I recommend that you take a look at [Support Vector Machines](http://en.wikipedia.org/wiki/Support_vector_machine). They have the nice property that they are self-regularizing (i.e. the VC dimension adjusts to the data).2011-06-29
  • 0
    @Tpofofn It was actually in reading about SVM's that I heard about VC dimension. I'm wondering about generally the "size" of the class of functions with VC dimension $d$.2011-07-01
  • 0
    @Lao Tzu: It depends on the nature of your data. The VC dimension is equivalent to the number of free independent parameters. Some classes of functions are better suited to particular types of data (in SVM this amounts to the choice of kernel). I recommend that you start with a Gaussian RBF or polynomial kernel and see with gives you better performance. Given that you have a reasonably sized data set with only two dimensions, you should be able to get a good results with either of these two kernels using a cross-validated assessment of the error.2011-07-02

0 Answers 0