To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.
Firstly, we need to acknowledge two forces in this setting
- Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience. Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
- Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner. Unfortunately, raw data rarely gives strong signals. And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data). Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data. Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.
This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.
Indicator Feature
This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output. For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert. Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.
After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.
Derived Feature
This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output. The idea is same as indicator feature except it is numeric in nature.Expert Stacking
Here we build a predictive model whose input features are taking from each of the expert's prediction output. For example, to predict the stock, our model takes 20 analyst's prediction as its input.The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic). The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly. On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.