Pragmatic Programming Techniques: machine learning

Showing posts with label machine learning. Show all posts

Sunday, September 2, 2018

Structure Learning and Imitation Learning

In classical prediction use case, the predicted output is either a number (for regression) or category (for classification). A set of training data (x, y) where x is the input and y is the labeled output is provided to train a parameterized predictive model.

The model is characterized by a set of parameters w
Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model predicts the probability of each possible class for classification
Define a Lost function L(y, y_hat) for regression, or L(y, P(y=a | x), P(y=b | x) ...), find the parameters w to minimize L

This problem is typically view as an optimization problem, and use gradient descent approach to solve it.

Need for Structure Learning

However, in some cases, y is not as simple as a number or a class. For example

For machine translation, x is a sentence in English, and y is a translated sentence in French
For self-driving-vehicle, x is a camera image, and y is the control action on steering wheel, brake, gas pedal

In these cases, the output y can be viewed as an object. But wait, can we break down the object as multiple numbers / categories and use the classical regression / classification approach to solve it ? Not quite, because the lost function cannot be just formulated as the summation of loss of individual components. For example, two French sentences using different words may still be a very good translation of the same English sentence. So we need to generalize a bit more and introduce the concept of an object and compatibility here.

The prediction problem can be generalized as: Given input x, finding an object y such that y is the "most compatible" with x. The compatibility is a parameterized function that we are going to learn from the training data.

The compatibility function is defined as F(x, y; w)
During training phase, we tune parameter w such that for every sample in training data, F(x, y; w) is the maximum. In other words, F(x, Y=y; w) > F(x, Y=other_val; w)

Notice that the training process is different from classical ML in the following way.

There are two optimization loop here. a) Given parameter w, find y_opt that maximize F(x,y,w). b) Given Lost = gap between F(x,y,w) and F(x,y_opt,w), find w that minimize the gap.
It turns out the first optimization is solved in a problem specific way while the second optimization can be solved by classical gradient descent approach.

After we learn the compatibility function parameters, at inference phase, we will apply the first optimization to the given input x to find the most compatible y_opt such that F(x, y_opt; w) is the maximum.

Rather than trying to exactly match y_hat to y in the training data, structure learning enable us to learn a more abstract relationship (ie: compatibility) between x and y so that we can output other equally-good y even it is not the same as the y in the training data. This more-generalized form is very powerful when we don't have a lot of training data. The downside of structure learning is it is compute intensive, because at inference phase it needs to solve an optimization problem which is typically expensive.

Imitation Learning

In a typical setting of Reinforcement Learning, an digital agent observe the state from the environment, formulate its policy to determine an action that it believes will maximize its accumulative future reward, take the action, get the reward from the environment and transition to the next state. Reinforcement learning is about how the agent optimize its policy along its experience when interacting with the environment.

For an overview of Reinforcement Learning and basic algorithm, you can visit my previous blog here.

Basically, reinforcement learning is based on a trial-and-error approach to learn the lesson. This approach can be very costly because during the trial process, serious mistake can be made (imagine if we use reinforcement learning to learn self-driving car, we may crash many cars before we learn something meaningful). In practice, people rely on simulator to mimic the environment. However, coming up good simulator is not easy because it requires a very deep understanding of how the actual environment behaves, this is one of the limitation that restrict reinforcement learning to be broadly applied.

Another important design consideration is how the reward is assigned. Of course, we can use the actual reward from the environment to train the policy, but this is usually very inefficient. Imagine when playing the chess game, we only get the reward at the end when we win/lose the game. Propagating this reward all the way to each move will be very inefficient. To make the learning faster, we usually use a technique called "reward shaping", basically to assign some artificial reward along the trajectory to bias the agent towards certain desirable actions (based on domain knowledge).

One special form of reward shaping is "imitation learning" where we assign intermediate reward based on how the action "similar" to when an expert does in real-life circumstances. Lets say we collect a set of observations that the expert is taking action y in state x, and try to learn a model that the agent will bias to take action y when seeing state x. But wait, does it sound like a supervised learning problem ? Can we just train a prediction model from x to y and we are done ?

Unfortunately, it is not as simple. Expert data is typically very sparse and expensive to get, meaning we usually don't have too many data from the expert. Imagine in a self-driving program, if we want to learn how to react when the car is almost crash, we may not find any situation in the expert's observation because the expert may not run into such dangerous situation at all. On the other hand, we don't need to copy exactly when the expert did in every situation, we just need to copy those that is relevant to the situation.

"Inverse Reinforcement Learning" comes into rescue. Basically, it cuts off the reward from the environment and replace it with a "reward estimator function", which is trained from a set of expert behavior, assuming that expert behavior will achieve highest reward.

Underlying algorithm of inverse reinforcement learning is based on the "structure learning" algorithm. In this case, the x is the start state, y is the output of the trajectory of the expert which is basically the training data. And y_opt is the output of the trajectory based on the agent policy, which is learned from the reward function using Reinforcement Learning algorithm. The compatibility function is basically our reward function because we assume expert behavior achieve highest reward.

Then we bring it the structure learning algorithm below ...

The agent still need to interact with the environment (or simulator) to get its trajectory, but the environment only need to determine the next state, but not the reward. Again, there are two nested optimization loop in the algorithm

Given a reward function (characterized by w), use classical RL to learn the optimal policy
Use the optimal policy to interact with the environment to collect the total reward of each episode, then adjust the reward function parameter w such that the expert behavior always get the highest total reward.

Friday, August 25, 2017

Reinforcement Learning Overview

There are basically 3 different types of Machine Learning

Supervised Learning: The major use case is Prediction. We provide a set of training data including the input and output, then train a model that can predict output from an unseen input.
Unsupervised Learning: The major use case is Pattern extraction. We provide a set of data that has no output, the algorithm will try to extract the underlying non-trivial structure within the data.
Reinforcement Learning: The major use case is Optimization. Mimicking how human learn from childhood, we use a trial and error approach to find out what actions will produce good outcome, and bias our preference towards those good actions.

In this post, I will provide an overview of the settings of Reinforcement Learning as well as some of its key algorithms.

Agent / Environment Interaction

Reinforcement Learning is all about how we can make good decision through trial and error. It is the interaction between the "agent" and the "environment".

Repeat the following steps until reaching a termination condition

The agent observe the environment having state s
Out of all possible actions, the agent need to decide which action to take. (this is called "policy", which is a function that output an action given the current state)
Agent take the action, and the environment receive that action
Through a transition matrix model, environment determine what is the next state and proceed to that state
Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s

The goal for the agent is to determine an optimal policy such that the "value" of the start state is maximized.

Some terminology

Episode: a sequence of (s1, a1, r1, s2, a2, r2, s3, a3, r3 .... st, at, rt ... sT, aT, rT)
Reward rt: Money the agent receive after taking action at a state at time t
Return: Cumulative reward since the action is taken (sum of rt, r[t+1], ... rT)
Value: Expected return at a particular state, called "state value" V(s), or expected return when taking action a at state s, called "Q Value" Q(s,a)

The optimal policy can be formulated as choosing action a* amount all choices of a at state s such that Q(s, a*) is maximum.

To deal with never ended interaction, we put a discount factor "gamma" on future reward. This discount factor will turn the sum of an infinite series into a finite number.

Optimal Policy when model is known

If we know the "model", then figuring out the policy is easy. We just need to use dynamic programming technique to compute the optimal policy offline and there is no need for learning.

Two algorithms can be used:

"Value iteration" starts with a random value and iteratively update the value based on the Bellman's equation, and finally compute the "value" of each state or state/action pair (also call Q state). The optimal policy for a given state s is to choose the action a* that maximize the Q value, Q(s, a).

Another algorithm "Policy iteration" starts with a random policy, and iteratively modifies the policy to make it better, until the policy at next iteration doesn't change any more.

However, in practice, we usually don't know the model, so we cannot compute the optimal policy as described above.

Optimal Policy when model is unknown

One solution is the "model based" learning, we spare some time to find out the transition probability model as well as the reward distribution model. To make sure we experience all possible combinations of different state/action pairs, we will take random action in order to learn the model.

Once we learn the model, we can go back to use the value iteration or policy iteration to determine the optimal policy.

Learning has a cost though. Rather than taking the best action, we will take random action in order to explore new actions that we haven't tried before and it is very likely that the associated reward is not maximum. However we accumulate our knowledge about how the environment reacts under a wider range of scenarios and hopefully this will help us to get a better action in future. In other words, we sacrifice or trade off our short term gain for a long term gain.

Making the right balance is important. A common approach is to use the epsilon greedy algorithm. For each decision step, we allocate a small probability e where we take random action and probability (1-e) where we take the best known action we have explored before.

Another solution approach is the "model free" learning. Lets go back to look at the detail formula under Value iteration and Policy iteration, the reason of knowing the model is to calculate the expected value of state value and Q value. Can we directly figure out the expected state and Q value through trial and error ?

Value based model free learning

If we modify the Q value iteration algorithm to replace the expected reward/nextstate with the actual reward/nextstate, we arrive at the SARSA algorithm below.

Deep Q Learning

The algorithm above requires us to keep a table to remember all Q(s,a) values which can be huge, and also becomes infinite if any of the state or action is continuous. To deal with this, we will introduce the idea of value function. The state and action will become the input parameters of this function, which will create "input features" and then feed into a linear model and finally output the Q value.

Now we modify the previous SARSA algorithm to the following ...

Instead of lookup the Q(s,a) value, we call the function (can be a DNN) to pass in the f(s, a) feature, and get its output
We randomly initialize the parameter of the function (can be weights if the function is a DNN)
We update the parameters using gradient descent on the lost which can be the difference between the estimated value and the target value (can be a one step look ahead estimation: r + gamma*max_a'[Q(s',a)] )

If we further generalize the Q value function using a deep neural network, and update the parameter using back propagation, then we reach a simple version of Deep Q Learning.

While this algorithm allow us to learn the Q value function which can represents a continuous state, we still need to evaluate every action and pick the one with the maximum Q value. In other words, the action space can only be discrete and finite.

Policy gradient

Since the end goal is to pick the right action, and finding out the Q value is just the means (so we can pick the action of maximum Q), why don't we learn a function that takes a state and directly output an action. Using this policy function approach, we can handle both continuous or discrete action space as well.

The key idea is to learn a function (given a state, output an action)

If the action is discrete, it outputs a probability distribution of each action
It the action is continuous, it output the mean and variance of the action, assume normal distribution

The agent will sample from the output distribution to determine the action, so its chosen action is stochastic (nondeterministic). Then the environment will determine the reward and next state. Cycle repeats ...

The goal is to find the best policy function where the expected value of Q(s, a) is maximize. Notice that s and a are random variable parameterized by θ.

To maximize an "expected value" of a function with parameters θ, we need to calculate the gradient of that function.

Actor Critic Algorithm

There are 2 moving targets in this equation:

To improve the policy function, we need an accurate estimation of Q value and also need to know the gradient of log(s, a)
To make the Q value estimation more accurate, we need a stable policy function

We can break down these into two different roles

An actor, whose job is to improve the policy function by tuning the policy function parameters
A critic, whose job is to fine tune the estimation of Q value based on current (incrementally improving) policy

The "actor critic" algorithm is shown below.

Then we enhance this algorithm by adding the following steps

Replace the Q value function with an Advantage function, where A(s, a) = Q(s, a) - Expected Q(s, *). ie: A(s, a) = Q(s, a) - V(s)
Run multiple thread Asynchronously

This is the state of the art A3C algorithm.

Learning resources and credits

Some of the algorithms I discussed above is extracted from the following sources

Saturday, July 15, 2017

Regression model outputting probability density distribution

For a classification problem (let say output is one of the labels R, G, B), how do we predict ?

There are two formats that we can report our prediction

Output a single value which is most probable outcome. e.g. output "B" if P(B) > P(R) and P(B) > P(G)
Output the probability estimation of each label. (e.g. R=0.2, G=0.3, B=0.4)

But if we look at regression problem (lets say we output a numeric value v), most regression model only output a single value (that minimize the RMSE). In this article, we will look at some use cases where outputting a probability density function is much preferred.

Predict the event occurrence time

As an illustrative example, we want to predict when would a student finish her work given she has already spent some time s. In other words, we want to estimate E[t | t > s] where t is a random variable representing the total duration and s is the elapse time so far.

Estimating time t is generally hard if the model only output an expectation. Notice that the model has the same set of features, expect that the elapse time has changed in a continuous manner as time passes.

Lets look at how we can train a prediction model that can output a density distribution.

Lets say our raw data schema: [feature, duration]

f1, 13.30
f2, 14.15
f3, 15.35
f4, 15.42

Take a look at the range (ie. min and max) of the output value. We transform into the training data of the following schema:

[feature, dur<13, dur<14, dur<15, dur<16]

f1, 0, 1, 1, 1
f2, 0, 0, 1, 1
f3, 0, 0, 0, 1
f4, 0, 0, 0, 1

After that, we train 4 classification model.

feature, dur<13
feature, dur<14
feature, dur<15
feature, dur<16

Now, given a new observation with corresponding feature, we can invoke these 4 model to output the probability of binary classification (cumulative probability). If we want the probability density, simply take the difference (ie: differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region. As time passed, we just need to slide the red line continuously and recalculate the expected time, we don't need to execute the prediction model unless the input features has changed.

Predict cancellation before commitment

As an illustrative example, lets say a customer of restaurant has reserved a table at 8:00pm. Time now is 7:55pm and the customer still hasn't arrive, what is the chance of no-show ?

Now, given a person (with feature x), and current time is S - t (still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema: [feature, arrival]

f1, -15.42
f2, -15.35
f3, -14.15
f4, -13.30

f5, infinity
f6, infinity

We transform into the training data of the following schema:

[feature, arr<-16, arr<-15, arr<-14, arr<-13]

f1, 0, 1, 1, 1
f2, 0, 1, 1, 1
f3, 0, 0, 1, 1
f4, 0, 0, 0, 1
f5, 0, 0, 0, 0
f6, 0, 0, 0, 0

After that, we train 4 classification models.

feature, arr<-16
feature, arr<-15
feature, arr<-14
feature, arr<-13

Notice that P(arr<0) can be smaller than 1 because the customer can be no show.

In this post, we discuss some use cases where we need the regression model to output not just its value prediction but also the probability density distribution. And we also illustrate how we can build such prediction model.

Sunday, July 2, 2017

How AI differs from ML

AI is not a new term, it is multiple decades old starting around early 80s when computer scientist design algorithms that can "learn" and "mimic human behavior".

On the "learning" side, the most significant algorithm is Neural Network, which is not very successful due to overfitting (the model is too powerful but not enough data). Nevertheless, in some more specific tasks, the idea of "using data to fit a function" has gained significant success and this form the foundation of "machine learning" today.

On the "mimic" side, people have focus in "image recognition", "speech recognition", "natural language processing", experts have been spending tremendous amount of time to create features like "edge detection", "color profile", "N-grams", "Syntax tree" ... etc. Nevertheless, the success is moderate.

Traditional Machine Learning

Machine Learning (ML) Technique has played a significant role in prediction and ML has undergone multiple generations, with a rick set of model structure, such as

Linear regression
Logistic regression
Decision tree
Support Vector Machine
Bayesian model
Regularization model
Ensemble model
Neural network

Each of these predictive model is based on certain algorithmic structure, with parameters as tunable knobs. Training a predictive model involves the following

Choose a model structure (e.g. Logistic regression, or Random forest, or ...)
Feed the model with training data (with both input and output)
The learning algorithm will output the optimal model (ie: model with specific parameters that minimize the training error)

Each model has its own characteristics and will perform good in some tasks and bad in others. But generally, we can group them into the low-power (simple) model and the high-power (complex) model. Choose between different models is a very tricky question.

Traditionally, using a low power / simple model is preferred over the use of a high power / complex model for the following reasons

Until we have massive processing power, training the high power model will take too long
Until we have massive amount of data, training the high power model will cause the overfit problem (since the high power model has rich parameters and can fit into a wide range of data shape, we may end up train a model that fits too specific to the current training data and not generalized enough to do good prediction on future data).

However, choosing a low power model suffers from the so called "under-fit" problem where the model structure is too simple and unable to fit the training data in case it is more complex. (Imagine the underlying data has a quadratic relationship: y = 5 * x^2, there is no way you can fit a linear regression: y = a*x + b no matter what a and b we pick).

To mitigate the "under-fit problem", data scientist will typically apply their "domain knowledge" to come up with "input features", which has a more direct relationship with the output. (e.g. Going back to the quadratic relationship: y = 5 * square(x), if you create a feature z = x^2, then you can fit a linear regression: y = a*z + b, by picking a = 5 and b = 0)

The major obstacle of "Machine Learning" is this "Feature Engineering" step which requires deep "domain experts" to identify important signals before feeding into training process. The feature engineering step is very manual and demands a lot of scarce domain expertise and therefore become the major bottleneck of most machine learning tasks today.

In other words, if we don't have enough processing power and enough data, then we have to use the low-power / simpler model, which requires us to spend significant time and effort to create appropriate input features. This is where most data scientists spending their time doing today.

Return of Neural Network

At early 2000, machine processing power has increased tremendously, with the advancement of cloud computing, massively parallel processing infrastructure, together with big data era where massive amount of fine grain event data being collected. We are no longer restricted to the low-power / simple model. For example, two most popular, mainstream machine learning model today are RandomForest and Gradient Boosting Tree. Nevertheless, although both of them are very powerful and provide non-linear model fitting to the training data, data scientist still need to carefully create features in order to achieve good performance.

At the same time, computer scientists has revisited the use of many layers Neural Network in doing these human mimic tasks. This give a new birth to DNN (Deep Neural Network) and provide a significant breakthrough in image classification and speech recognition tasks. The major difference of DNN is that you can feed the raw signals (e.g. the RGB pixel value) directly into DNN without creating any domain specific input features. Through many layers of neurons (hence it is called "deep" neural network), DNN can "automatically" generate the appropriate features through each layer and finally provide a very good prediction. This saves significantly the "feature engineering" effort, a major bottleneck done by the data scientists.

DNN also evolves into many different network topology structure, so we have CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), GAN (Generative Adversarial Network), Transfer Learning, Attention Model ... etc. The whole spectrum is called Deep Learning, which is catching the whole machine learning community’s attention today.

Reinforcement Learning

Another key component is about how to mimic a person (or animal) learn. Imagine the very natural animal behavior of perceive/act/reward cycle. A person or animal will first understand the environment by sensing what "state" he is in. Based on that, he will pick an "action" which brings him to another "state". Then he will receive a "reward". The cycle repeats until he dies. This way of learning (called "Reinforcement Learning") is quite different from the "curve fitting" approaches of traditional supervised machine learning approach. In particular, learning in RL is very fast because every new feedback (such as perform an action and receive a reward) is sent immediately to influence subsequent decisions. Reinforcement Learning has gain tremendous success in self-driving cars as well as AlphaGO (Chess Playing Robot).

Reinforcement Learning also provides a smooth integration between "Prediction" and "Optimization" because it maintains a belief of current state and possible transition probabilities when taking different actions, and then make decisions which action can lead to the best outcome.

AI = DL + RL

Compare to the classical ML Technique, DL provide a more powerful prediction model that usually produce good prediction accuracy. Compare to the classical Optimization model using LP, RL provide a much faster learning mechanism and also more adaptive to change of the environment.

Saturday, June 27, 2015

When machine replace human

Recently, a good friend sent me an article from Harvard Business Review called "Beyond Automation", written by Thomas H. Davenport and Julia Kirby. The article talked about how automation affects our job forces and displacing values from human workers. It proposed 5 strategies in how we can get prepared to retain competitiveness in the automation era. This is a very good article and triggers me a lot of thoughts.

I want to explore a fundamental question: "Can machine replace a human in future ?"

Lets start looking at what machines are doing and not doing today. Machines are operating under a human's program, and therefore it can only solve those problems that we, human can express or codified in a structural form. Don't underestimate its power underneath. With good abstract thinking, smartest human in the world has partitioned large number of problems (by its problem nature) into different problem categories. Each category is expressed in form od a "generic problem" and subsequently a "general solution" is developed. Notice that computer scientist has been doing this for many decades, and come up with the powerful algorithm such as "Sorting", "Finding shortest path", "Heuristic search" ... etc.

By grouping concrete problems by their nature into a "generic, abstract problem", we can significantly reduce the volume of cases/scenarios while still covers a large area of ground. The "generic solution" we developed can also be specialized for each concrete problem scenario. After that we can develop a software program which can be executed in a large cluster of machines equipped with fast CPU and a lot of memory. Compare this automated solution with what a human can do in a manual fashion. In these areas, once problems are well-defined and solutions are automated by software program, computers with much powerful CPU and memory will always beat human in many many orders of magnitude. There is no question that the human job in these areas will be eliminated.

In terms of capturing our experience using a abstract data structure and algorithm, computer scientist are very far from done. There are still a very large body of problems that even the smartest human haven't completely figured out how to put them in a structural form yet. Things that involve "perception", "intuition", "decision making", "estimation", "creativity" are primarily done today by human. I believe these type of jobs will continue to be done by human workers in our next decade. On the other hand, with our latest technology research, we continuously push our boundary of automation into some of these areas. "Face recognition", "Voice recognition" that involves high degree of perception can now be done very accurately by software program. With "machine learning" technology, we can do "prediction" and make judgement in a more objective way than a human. Together with "planning" and "optimization" algorithm, large percentage of decision making can be automated, and the result is usually better because of a less biased and data-driven manner.

However, in these forefront areas where latest software technology is unable to automate every steps, the human is need in the path to make a final decision, or interven in those exceptional situation that the software is not programmed to handled. There are jobs that a human and machine can working together to make better outcome. This is what is called "augmentation" in the article. Some job examples are artists are using advanced software to touchup their photos, using computer graphics to create movies, using machine learning to do genome sequence processing, using robots to perform surgery, driver-less vehicles ... etc.

Whether computer programming can replace human completely remains to be seen, but I don't think this will happen in the next 2 decades. We humans are unique and good at perceiving things with multiple level of abstractions from different angles. We are good at connecting the dots between unrelated areas. We can invent new things. These are things that machine will be very hard to do, or at least will take a long time if at all possible.

"When can we program a machine that can write program ?"

The HBR article suggests a person can consider five strategies (step up, step aside, step in, step narrowly and step forward) to retain value in the automation era. I favor the "step forward" strategy because the person is driving the trend rather than passively reacting to the trend. Date back to our history, human's value system has been shifted across industry revolution, internet revolution etc. At the end of the day, it is more-sophisticated human who take away jobs (and value) from other less-sophisticated human. And it is always the people who drives the movement to be the winner of this value shift. It happens in the past and will continue into future.

Sunday, July 27, 2014

Incorporate domain knowledge into predictive model

As a data scientist / consultant, in many cases we are being called in to work with domain experts who has in-depth business knowledge of industry settings. The main objective is to help our clients to validate and quantify the intuition of existing domain knowledge based on empirical data, and remove any judgement bias. In many cases, customers will also want to build a predictive model to automate their business decision making process.

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Firstly, we need to acknowledge two forces in this setting

Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience. Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner. Unfortunately, raw data rarely gives strong signals. And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data). Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data. Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.

Maintain a balance between these forces is a critical success factor of many data science project.

This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.

Indicator Feature

This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output. For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.

Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert. Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.

After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.

Derived Feature

This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output. The idea is same as indicator feature except it is numeric in nature.

Expert Stacking

Here we build a predictive model whose input features are taking from each of the expert's prediction output. For example, to predict the stock, our model takes 20 analyst's prediction as its input.

The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic). The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly. On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.

Wednesday, March 12, 2014

Common Text Mining workflow

In this post, I want to summarize a common pattern that I have used in my previous text mining projects.

Text mining is different in that it uses vocabulary term as a key elements in feature engineering, otherwise it is quite similar to a statistical data mining project. Following are the key steps ...

Determine the "object" that we are interested to analyze. In some cases, the text document itself is the object (e.g. an email). In other cases, the text document is providing information about the object (e.g. user comment of a product, tweaks about a company)
Determine the features of the object we are interested, and create the corresponding feature vector of the object.
Feed the data (each object and its corresponding set of features) to standard descriptive analytics and predictive analytics techniques.

The overall process of text mining can be described in the following flow ...

Extract docs

In this phase, we are extracting text document from various types of external sources into a text index (for subsequent search) as well as a text corpus (for text mining).

Document source can be a public web site, an internal file system, or a SaaS offerings. Extracting documents typically involves one of the following ...

Perform a google search or crawl a predefined list of web sites, then download the web page from the list of URL, parse the DOM to extract text data from its sub-elements, and eventually creating one or multiple documents, store them into the text index as well as text Corpus.
Invoke the Twitter API to search for tweets (or monitor a particular topic stream of tweets), store them into the text index and text Corpus.
There is no limit in where to download the text data. In an intranet environment, this can be downloading text document from a share drive. On the other hand, in a compromised computer, user's email or IM can also be download from the virus agent.
If the text is in a different language, we may also invoke some machine translation service (e.g. Google translate) to convert the language to English.

Once the document is stored in the text index (e.g. Lucene index), it is available for search. Also, once the document is stored in the text corpus, further text processing will be involved.

Transformation

After the document is stored in the Corpus, here are some typical transformations ...

If we want to extract information about some entities mentioned in the document, we need to conduct sentence segmentation, paragraph segmentation in order to provide some local context from which we can analyze the entity with respect to its relationship with other entities.
Attach Part-Of-Speech tagging, or Entity tagging (person, place, company) to each word.
Apply standard text processing such as lower case, removing punctuation, removing numbers, removing stopword, stemming.
Perform domain specific conversion such as replace dddd-dd-dd with , (ddd)ddd-dddd to , remove header and footer template text, remove terms according to domain-specific stop-word dictionary.
Optionally, normalize the words to its synonyms using Wordnet or domain specific dictionary.

Extract Features

For text mining, the "bag-of-words model" is commonly used as the feature set. In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document). Hence all documents within the corpus is represented as a giant document/term matrix. The "term" can be generalized as uni-gram, bi-gram, tri-gram or n-gram, while the cell value in the matrix represents the frequency of the term appears in the document. We can also use TF/IDF as the cell value to dampen the importance of those terms if it appears in many documents. If we just want to represent whether the term appears in the document, we can binarize the cell value into 0 or 1.

After this phase, the Corpus will turn into a large and sparse document term matrix.

Reduce Dimensions

Since each row in the document/term matrix represents each document as a high dimension vector (with each dimension represents the occurrence of each term), there are two reasons we want to reduce its dimension ...

For efficiency reason, we want to reduce the memory footprint for storing the corpus
We want to transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms. (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence)

SVD (Singular Value Decomposition) is a common matrix factorization technique to convert a "term" vector into a "concept" vector. SVD can be used to factor a large sparse matrix of M by N into the multiplication of three smaller dense matrix M*K, K*K, K*N. Latent Semantic Indexing (LSI) is applying the SVD in the document term matrix.

Another popular technique call topic modeling, based on LDA (Latent Dirichlet Allocation) is also commonly used to transform the document into a smaller set of topic dimensions.

Apply standard data mining

At this point, each document is represented as a topic vector. We can also add more domain specific features (such as for spam detection, whether the document contains certain word or character patterns such as '$', '!'). After that we can feed the each vector into the regular machine learning process.

Tools and Library

I have used Python's NLTK as well as R's TM, topicmodel library for performing the text mining work that I described above. Both of these library provide a good set of features for mining text documents.

Monday, March 3, 2014

Estimating statistics via Bootstrapping and Monte Carlo simulation

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population. For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population. We understand the estimation done this way (via random sampling) can deviate from the population. Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval". A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation. Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set. Bootstrapping provides a way to address this ...

Bootstrapping

Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set. Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
     90% 
236.8105 
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
     90% 
232.3641 
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = sample_set1, statistic = ninety_percentile, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
> boot.ci(boot_result, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale

Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way. It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics.

If we assume certain distribution model want to see the distribution of certain statistics. Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples. We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)

Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation is a powerful tool to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

Saturday, September 7, 2013

Exploration and Exploitation

"Cold Start" is a common problem happens quite frequent in recommendation system. When a new item enters, there is no prior history that the recommendation system can use. Since recommendation is an optimization engine which recommends item that matches the best with the user's interests. Without prior statistics, the new item will hardly be picked up as recommendation and hence continuously lacking the necessary statistics that the recommendation system can use.

One example is movie recommendation where a movie site recommends movies to users based on their past viewing history. When a new movie arrives the market, there isn't enough viewing statistics about the movie and therefore the new movie will not have a strong match score and won't be picked as a recommendation. Because we never learn from those that we haven't recommended, the new movies will continuously not have any statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving when a new Ad enters the Ad repository.

Multilevel Granularity Prediction

One solution of cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres). Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.

In other words, we can make recommendation based on two models of different granularity. Fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy. For cold-start problem when the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model. For new items, the fine-grain model will have low confidence and therefore it gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction. We have to use the fine-grain model to predict. But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user ?

Optimization under Uncertainty

The core of our problem is we need to optimize under uncertainties. We have two approaches

Exploitation: Make the most optimal choice based on current data. Because of uncertainty (high variation) the current data may deviate from its true expected value, we may end up picking a non-optimal choice.
Exploration: Make a random choice or choices that we haven't made before. The goal is to gather more data point and reduce the uncertainty. This may waste our cycles of picking the optimal choice.

Lets start with a simple, multi-bandit problem. There are multiple bandit in a Casino, each Bandit has a different probability to win. If you know the true underlying winning probability of each bandit, you will pick the one with the highest winning probability and keep playing on that one.
Unfortunately, you don't know the underlying probability and has only a limited number of rounds to play. How would you choose which bandit to play to maximize the total number of rounds you win.

Our strategy should strike a good balance between exploiting and exploring. To measure how good the strategy is, there is a concept of "regret", which is the ratio of the two elements

Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
Value you obtain by following the strategy

We'll pick our strategy to do more Exploration initially when you have a lot of uncertainty, and gradually tune down the ratio of Exploration (and leverage more on Exploitation) as we collected more statistics.

Epsilon-Greedy Strategy

In the "epsilon-greedy" strategy, at every play we throw a dice between explore and exploit.
With probability p(t) = k/t (where k is a constant and t is the number of tries so far),we pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past). And with probability 1 - p(t), we pick the bandit that has the highest probability of win based on passed statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes. However, it doesn't have a smooth transition between explore and exploit. Also while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits. While it exploits, it doesn't consider the confidence of probability estimation.

Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval. In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.

Initially each bandit has a zero mean and a large confidence interval. As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i. We also adjust the confidence interval (reducing deviation) as we play the bandit.
e.g. standard deviation is (p.(1-p)/n)^0.5

Notice that the UCB model can be used in a more general online machine learning setting. We require the machine learning model be able to output its estimation based on a confidence parameter. As a concrete example, lets say a user is visiting our movie site and we want to recommend a movie to the user, based on a bunch of input features (e.g. user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify movie candidate based on relevancy (ie: user's viewing history or user search query). For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and need to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution). We then combine them by add the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether user click on it, view it ... etc. and the response will be fed back to our ML model as a new training data. Our ML model is an online learning setting and will update the model with this new training data. Over time the 68% confidence range will be reduced over time as more and more data is gathered.

Relationship with A/B Testing

For most web sites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things. In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.

When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly. In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment. The goal is to test if the treatment causes any significant improvement from the control. Certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense. Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience. This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination. In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from. Throughout the process, A/B test framework will explore/exploit among different user experience to optimize for the objective function. At any time, we can query the A/B testing framework to find out the latest statistics of each user experience. This provides a much better way to look at large number of experiment result at the same time.