7
$\begingroup$

I have a bunch of real world data sets and from manually plotting some of the data in graphs, I've discovered some data sets look pretty much logarithmic and some look linear, or exponential (and some look like a mess :).

I've been reading up on curve fitting / data fitting on wikipedia and if I understand it correctly (which I seriously doubt) I can calculate a curve of best fit using least squares calculations, but I have to determine if I want to have the curve fit a logarithm, linear or exponential (etc) pattern first.

What I would really like to do is to pass a data set into a function (I'm a programmer with poor math skills) and have that return something like "this data set looks more linear than logarithmic" or "this looks exponential".

My question is: is that even possible, without a human looking at a graph and recognizing the pattern ?

My guess is: yes. But before I invest a ton of time in figuring out how to program this, I just want to make sure I'm not barking up the wrong tree and confirm this with you guys if possible.

Sorry if this is a dumb question, but just to be clear, I'm not looking for a how-to answer, just a simple yes or no will do, however if you have suggestions on how to tackle the problem, that would be awesome of course.

  • 1
    @Mitch Probably mistyped "shape the data..."2018-11-13

2 Answers 2

4

You may not realize it but this is a statistics question, for which statisticians have been studying...forever.

So the simple answer to your question 'Can you do it?' is yes.

But of course there's more nuance to that.

Normally statisticians will say 'pick your model first, one of those three (linear, exponential or logarithmic), and then for one of those, I can tell you what the 'best' line is. That is plain old linear regression will give you the best-fit line for a linear model, and for the other two you can transform the data (take a log or exponential of the $y$ value and -then- do linear regression. Part of the process/output of doing linear regression is a value that says how good the match is (the correlation coefficient).

But your question is more along the lines of which 'model' is the best. You might think that you would just compare the three correlation coefficients and pick the best one. I would think that too, except I am not a statistician and something tells me a statistician would have a fit over something so simple (probably also over my suggestion of using the correlation coefficient). So, for a real answer, I think a statistician would be able to answer much better (hmm...isn't there a statistics.stackexchange?). But for the moment, this is a good first approximation to an answer.


Edit: In the meantime, I asked directly at stats.stackexchange (as suggested by Rahul). They confirm my suspicions that a simple answer is yes, but it's not so simple.

  • 0
    @Rahul: THanks, I took you up on your suggestion. See my link in my edited answer.2011-04-08