I've analysed newspapers by counting the language distributions of the articles.
The results look like that:
Day 1 Day 2 Day 3 Economy Economy Economy language 1: 0,35 language 1: 0,30 language 1: 0,90 language 2: 0,11 language 2: 0,10 language 2: 0,00 language 3: 0,54 language 3: 0,60 language 3: 0,10 Sports Sports Sports language 1: 0,40 language 1: 0,30 language 1: 1.00 language 2: 0,20 language 2: 0,20 language 2: 0,00 language 3: 0,40 language 3: 0,50 language 3: 0,00
So for instance on day 1, 35 % of the Economy-articles are written in language 1, 11 % in language 2 and so on.
Now I want to remove the outliers (like e.g. day 3) from my data. I was thinking about calculating the double standard deviation and remove all the values that are outside of it.
Does that make sense? Is there a problem if my values don't have a normal distribution? Or is there another way to get rid off the outliers?
In the end I want to calculate the average of all language 1, language 2 etc. values of each category over time and see how the values change.
Any technique how to do that?
Thanks in advance.