3
$\begingroup$

I was reading an intro to stats textbook and I came across this:

'There are some standard transformations that are often applied when much of the data cluster near zero (relative to the larger values in the data set) and all observations are positive. A transformation is a rescaling of the data using a function. For instance, a plot of the natural logarithm...Transformed data are sometimes easier to work with when applying statistical models because the transformed data are much less skewed and outliers are usually less extreme. '

but doesn't applying a log function to every datapoint potentially change their relative distance to each other? For example, say I took the log base 10 of two numbers: 10 and 100. The respective logs would be 1 and 10. Oh I see... the two numbers are still in the same ratio. Is this why taking the logs of a data set is acceptable to do? The relative distance of each number is still the same?

Does the skew change when you do this? Is that a problem?

Do the numbers still have the same middle 50% of data?

  • 1
    The respective logs would be 1 and 2. Logs transform a ratio into a difference.2017-02-03

1 Answers 1

3

You are right that transformations have their difficulties as well as their advantages. Sometimes they make statistical analysis easier, and make the results of that analysis more difficult to understand.

Here are some situations in which log transformations have been used in practice.

Richter scale for earthquakes. The scale itself is a log of the energy released. If the original energy measurements were used, the numbers for the most destructive earthquakes would be much too big compared to more common small quakes for a clear understanding. However, even on the logged Richter scale damaging quakes are still outliers. In earthquake country there are hundreds of tiny quakes a year noticed only by sensitive instruments or people right at the epicenter (perhaps magnitude 2), and there are dozens of quakes that rattle the windows for miles around with only very small damage if any (perhaps magnitude 3 or 4). Seismic events below magnitude 1 usually aren't recorded. Fortunately, destructive and catastrophic quakes (say above magnitude 5) occur years apart.

Lumber from trees. One way to estimate the lumber a fir tree will produce is to measure its circumference $X_1$ at 5 feet off the ground (with a tape measure) and to measure its height $X_2$ (by sighting its top through a special instrument and using some trigonometry). Knowing lumber yields $Y$ from dozens of trees already cut, one can try to estimate by regresssion $\beta_0, \beta_1,$ and $\beta_2$ in the equation $Y = \beta_0 + \beta_1X_1 + \beta_2X_2.$ But this doesn't work well because the relationships aren't linear. A much more successful approach is to take logs of $Y, X_1,$ and $X_2,$ which I'll designate by *'s, and then to estimate $\beta$'s in the equation $Y^* = \beta_0 + \beta_1X_1^* + \beta_2X_2^*.$ (Because the lumber-producing part of a tree is sort of a mildly 'bulgy' cone, the $\beta_1$ in the second equation turns out to be around $2.$)

'Stabilizing' variances. Suppose we have exponential life times for three kinds of devices A, B, and C, and want to compare them (for example, using an analysis of variance). Here are fake data to represent typical results.

A: 0.3 0.9 0.1 0.3 0.7 0.9 0.1 1.5 0.2 0.2;    mean .52    SD .464
B: 0.4 0.4 0.1 0.2 0.1 0.6 0.3 0.3 0.2 0.6;    mean .32    SD .181
C: 0.1 0.1 0.3 0.1 0.7 0.4 0.4 0.1 0.1 0.4;    mean .27    SD .206

Notice that the standard deviations differ by quite a lot, which means that the variances do also, making analysis difficult. Let's look at log transformed data:

lnA:   mean -1.04    SD .954
lnB:   mean -1.31    SD .646
lnC:   mean -1.58    SD .787

Now the means are still (significantly) different. But the SDs are more nearly the same, so we would get more reliable results from a traditional analysis of variance.

Potential difficulties understanding transformed data. The difficulty in the lifetime data is that differences among the original data are essentially ratios of the logged data, so it can be more difficult to understand various kinds of comparisons among groups for the logged data (see the Comment by @QthePlatypus).

Taking more sophisticated approaches for the tree and lifetime data, one could do useful and reliable statistical analyses without needing to take logs. I prefer not to do log transformations (or other kinds of transformations) of data, except as a last resort. There are situations in which taking logs appears to make an analysis easier, but that is because the transformation hides a difficulty that ought to be addressed before analysis.