2
$\begingroup$

I've analysed newspapers by counting the language distributions of the articles.

The results look like that:

Day 1               Day 2              Day 3

Economy             Economy            Economy
language 1: 0,35    language 1: 0,30   language 1: 0,90
language 2: 0,11    language 2: 0,10   language 2: 0,00
language 3: 0,54    language 3: 0,60   language 3: 0,10

Sports              Sports             Sports
language 1: 0,40    language 1: 0,30   language 1: 1.00
language 2: 0,20    language 2: 0,20   language 2: 0,00
language 3: 0,40    language 3: 0,50   language 3: 0,00

So for instance on day 1, 35 % of the Economy-articles are written in language 1, 11 % in language 2 and so on.

Now I want to remove the outliers (like e.g. day 3) from my data. I was thinking about calculating the double standard deviation and remove all the values that are outside of it.

Does that make sense? Is there a problem if my values don't have a normal distribution? Or is there another way to get rid off the outliers?

In the end I want to calculate the average of all language 1, language 2 etc. values of each category over time and see how the values change.

Any technique how to do that?

Thanks in advance.

1 Answers 1

1

As I said in response eto your other question. Dixon's ratio test is a good way to detect a large outlier. What you are suggesting is called a 2-sigma rule. Two problems with it are (1) Approximately 5% of the population would be included in the group you throw out in addition to outliers the presumably came from a different distribution and (2) you don't know the variance, so you have to estimate it using all the data, but outliers inflate the variance thus helping to hide themselve. Dixon's test doesn't have these problems. Again I caution that identifying outliers does not necessarily tell you that you should remove them.