2
$\begingroup$

I have two matrices/datasets whose columns either have continuous or categorical values. One matrix is a perturbed version of the other. I'm looking for a distance measure for comparison and reporting that takes the data type in account, i.e., there is a higher penalty if two categorical variables are different compared to just a simple difference between two matrices.

I have two $n \times m$ matrices, $X$ and $Y$. $Y$ is a perturbed version of $X$. Usually i can measure the distance using some measure, say

$$d(x,y) = \sum_{n=1}^z\sum_{m=1}^z|x_{nm} - y_{nm}|$$

but $X$ and $Y$ have some columns with binary values and other columns with continuous values. Let's say my binary column indicates patient status (alive/dead). Now, when I use the above formula, it gives equal weight to all columns using same scale. i.e. for this measure, 33-32 is same is 1-0 from status column.

I want to use a distance measure that weighs the columns accordingly. Ideally, status 0 perturbed to 1 in binary column should be worse than, say weight perturbed from 70 to 71.

Apologies if it is straightforward.

  • 0
    I have no idea what you are asking.2017-01-21
  • 0
    @copper.hat I want to compare two matrices using some kind of norm that puts more weight when categorical data is different. I.e. treat continuous and categorical columns differently.2017-01-21
  • 0
    In cluster analysis we calculate a distance which can be a function of mixed data types. Look at CrossValidated for information on mixed data clustering http://stats.stackexchange.com/search?q=cluster+analysis+mixed+data+types2017-01-21
  • 0
    Are all the categorical variables binary?2017-01-21
  • 0
    @Rodrigo de Azevedo, for simplicity we can assume they are.2017-01-22

1 Answers 1

1

To weight the columns of $n \times m$ matrix $\mathrm X - \mathrm Y$, right-multiply it by a diagonal matrix $\mbox{diag} (\mathrm w)$, where $\mathrm w \geq 0_m$ is the weight vector. Then vectorize the matrix product and take the $1$-norm

$$\| \mbox{vec} \left((\mathrm X - \mathrm Y) \, \mbox{diag} (\mathrm w)\right) \|_1 = \| (\mbox{diag} (\mathrm w) \otimes \mathrm I_n) \, \mbox{vec} (\mathrm X - \mathrm Y) \|_1 = \| \mbox{diag} (\mathrm w \otimes 1_n) \, \mbox{vec} (\mathrm X - \mathrm Y) \|_1$$

Penalizing the squares of the differences instead of their absolute values,

$$\| (\mathrm X - \mathrm Y) \, \mbox{diag} (\mathrm w) \|_{\text{F}}^2$$