Let assume that I own a model agency (that is a far-fetched example), and I would like to create a tool for my HR department, that gives an estimate salary for any given profile.
I have a huge database of profiles, where each name has a salary, sex,age, height,eye color, country of origin, weight, experience...
For example : $($Mister $1234$ ,Male, $\$200$k/year,$6'3$ or $190$cm , $200$ pounds or $90$kg, Blue eyes, Brown hair, $22$ years old,...$)$
The idea is that every time there is a new applicant, I input his/her characteristics into the tool, and it returns a salary.
The first idea that came in my mind is to assume that the salary $S_{\delta}$ is written as $$S_{\delta}=S(\delta_1,...,\delta_n)=\prod_{k=1}^n S_k^{\delta_k}$$
where $\delta=(\delta_1,...,\delta_n)$, $S_k$ is a coefficient related to the $k^{th}$ criteria, and $\delta_k$ equals one(resp. zero) when the $k^{th}$ criteria is matched(resp. not matched). $\delta$ entirely defines a model profile.
I can rewrite $S_{\delta}$ as $$log(S_{\delta})=\sum_{k=1}^{n}{\delta_klog(S_k)}$$
or
$$Y_{\delta}=\sum_{k=1}^{n}{\delta_kY_k}$$
with $Y_k=log(S_k)$ and $Y_{\delta}=log(S_{\delta})$
I guess you know what I want to do next : I look at my database and I create a binary table for all profiles.
For example , $$ \begin{array}{c|lcr} Names & \text{Salaries}& \text{Male}& \text{Female}& \text{under 185 cm}& \text{above 185cm} & \text{17-20yo} & \text{21-24yo} \\ \hline C.C & 200 & 1 & 0 &0 &1&0 &1 \\ A.A & 300 & 0 & 1 &1 &0&1 &0 \\ B.B & 400 & 0 & 1 &1 &0&0 &1 \end{array} $$
Let assume I have a sample of size $N_{sample}$, I will solve $$\min_{Y_1,...,Y_n}\sum_{l=1}^{N_{sample}}{\left(Y_{\delta_l}-Salary(l)\right)^2}$$
where $\delta_l$ and $Salary(l)$ are respectively the criteria , and the salary of the $l^{th}$model.
$Excel$ can solve such a problem quickly by using the function $LINETEST$.
My issue is not about solving the equation , but how to pre-process the data when the input is a binary table.
Indeed, in another context , I would use the classical tool such as PCA on correlation matrix. However, here, I have a binary table , and in that case, we can clearly distinguish causality and correlation .
Besides using common sense to get rid of irrelevant factors( ex : male and above 185 cm always go along, one factor is redundant) , are there any other techniques to reduce the dimension of the problem in that context ?