Pragmatic Programming Techniques: data mining

Showing posts with label data mining. Show all posts

Wednesday, November 5, 2014

Common data science project flow

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used. In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people. For the projects I involved, many of them target to solve a very tactical and specific problem. However, over the last few years more and more enterprises start to realize the strategic value of data. I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation. Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge. Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion. Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth. On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model. It is also fun to go through the concept blending process between a data scientist and a domain expert. I force them to think in my way and they force me to think in their way. This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line. I am, by no means an expert in their domain. But their problem (how to predict demand, and how to set price) is general enough across many industries. Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process. Here are some key decisions that need to be made by the management.

How many cars the company should produce for next year ?
What should be the renew price of the cars ?

First of all, we need to identify the ultimate "goal" of these decisions. Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1: Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function. Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)
...

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry. At this stage, we don't need to know the exact formula of the function F/G/H. We only need to capture the links between the variables. It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality). However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process. There may be additional investigation, background studies if the subject matter experts doesn't exist. Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2: Define the dependency function

In this round, we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that. For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values. Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality. One popular approach is to use exponential smoothing techniques such as Holt/Winters model. Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input. We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be. In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice. So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined

A goal defined by maximizing (or minimizing) an objective function
A set of variables (including the decision and environment variables)
A set of functions that define how these variables are inter-related to each other. Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)

Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources. For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first). For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods. Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below

Define an objective function, constraints, decision variables and environment variables
Identify the relationship between different variables
Collect or predict those environment variables
Optimize those decision variables based on the objective functions
Return the optimal value of decision variables as the answer

So far, our dependency graph is acyclic where our decision won't affect the underlying variables. Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players. For example, their pricing strategy may causes other competitors to change their own pricing strategy as well. And how the competitors would react is less predictable and historical data play a less important role here. At some point, human judgement will get involved to fill the gaps.

Sunday, July 27, 2014

Incorporate domain knowledge into predictive model

As a data scientist / consultant, in many cases we are being called in to work with domain experts who has in-depth business knowledge of industry settings. The main objective is to help our clients to validate and quantify the intuition of existing domain knowledge based on empirical data, and remove any judgement bias. In many cases, customers will also want to build a predictive model to automate their business decision making process.

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Firstly, we need to acknowledge two forces in this setting

Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience. Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner. Unfortunately, raw data rarely gives strong signals. And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data). Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data. Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.

Maintain a balance between these forces is a critical success factor of many data science project.

This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.

Indicator Feature

This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output. For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.

Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert. Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.

After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.

Derived Feature

This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output. The idea is same as indicator feature except it is numeric in nature.

Expert Stacking

Here we build a predictive model whose input features are taking from each of the expert's prediction output. For example, to predict the stock, our model takes 20 analyst's prediction as its input.

The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic). The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly. On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.

Saturday, November 9, 2013

Diverse recommender

This is a continuation of my previous blog in recommendation systems, which describes some basic algorithm for building recommendation systems. These techniques evaluate each item against user's interest independently and pick the topK items to construct the recommendation. However, it suffers from the lack of diversity. For example, the list may contain the same book with soft cover, hard cover, and Kindle version. Since human's interests are usually diverse, a better recommendation list should contain items that covers a broad spectrum of user's interests, even though each element by itself is not the most aligned with the user's interests.

In this post, I will discuss a recommendation algorithm that consider diversity in its list of recommendation.

Topic Space

First of all, lets define a "topic space" where both the content and user will be map to. Having a "topic space" is a common approach in recommendation because it can reduce dimensionality and resulting in better system performance and improved generality.

The set of topics in topic space can be extracted algorithmically using Text Mining techniques such as LDA, but for simplicity here we use a manual approach to define the topic space (topics should be orthogonal to each other, as highly correlated topics can distort the measures). Lets say we have the following topics defined ...

Romance
Sports
Politics
Weather
Horror
...

Content as Vector of topic weights

Once the topic space is defined, content author can assign topic weights to each content. For example, movie can be assigned with genres and web page can be assigned with topics as well. Notice that a single content can be assigned with multiple topics of different weights. In other words, each content can be described as a vector of topic weights.

User as Vector of topic weights

On the other hand, user can also be represented as a vector of topic weights based on their interaction of content, such as viewing a movie, visiting a web page, buying a product ... etc. Such interaction can have a positive or negative effect depends on whether the user like or dislike the content. If the user like the content, the user vector will have the corresponding topic weight associated with the content increases by multiplying alpha (alpha > 1). If the user dislike the content, the corresponding topic weight will be divided by alpha. After each update, the user vector will be normalized to a unit vector.

Diversifying the recommendation

We use a utility function to model the diversity of the document and then maximize the utility function.

In practice, A is not computed from the full set of documents, which is usually huge. The full set of documents is typically indexed using some kind of Inverted index technologies using the set of topics as keywords, each c[j,k] is represented as tf-idf.

User is represented as a "query", and will be sent to the inverted index as a search request. Relevant documents (based on cosine distance measures w.r.t. user) will be return as candidate set D (e.g. top 200 relevant content).

To pick the optimal set of A out of D, we use a greedy approach as follows ...

Start with an empty set A
Repeat until |A| > H

Pick a doc i from D such that by adding it to the set A will maximize the utility function
Add doc i into A and remove doc i from D

Saturday, September 7, 2013

Exploration and Exploitation

"Cold Start" is a common problem happens quite frequent in recommendation system. When a new item enters, there is no prior history that the recommendation system can use. Since recommendation is an optimization engine which recommends item that matches the best with the user's interests. Without prior statistics, the new item will hardly be picked up as recommendation and hence continuously lacking the necessary statistics that the recommendation system can use.

One example is movie recommendation where a movie site recommends movies to users based on their past viewing history. When a new movie arrives the market, there isn't enough viewing statistics about the movie and therefore the new movie will not have a strong match score and won't be picked as a recommendation. Because we never learn from those that we haven't recommended, the new movies will continuously not have any statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving when a new Ad enters the Ad repository.

Multilevel Granularity Prediction

One solution of cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres). Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.

In other words, we can make recommendation based on two models of different granularity. Fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy. For cold-start problem when the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model. For new items, the fine-grain model will have low confidence and therefore it gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction. We have to use the fine-grain model to predict. But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user ?

Optimization under Uncertainty

The core of our problem is we need to optimize under uncertainties. We have two approaches

Exploitation: Make the most optimal choice based on current data. Because of uncertainty (high variation) the current data may deviate from its true expected value, we may end up picking a non-optimal choice.
Exploration: Make a random choice or choices that we haven't made before. The goal is to gather more data point and reduce the uncertainty. This may waste our cycles of picking the optimal choice.

Lets start with a simple, multi-bandit problem. There are multiple bandit in a Casino, each Bandit has a different probability to win. If you know the true underlying winning probability of each bandit, you will pick the one with the highest winning probability and keep playing on that one.
Unfortunately, you don't know the underlying probability and has only a limited number of rounds to play. How would you choose which bandit to play to maximize the total number of rounds you win.

Our strategy should strike a good balance between exploiting and exploring. To measure how good the strategy is, there is a concept of "regret", which is the ratio of the two elements

Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
Value you obtain by following the strategy

We'll pick our strategy to do more Exploration initially when you have a lot of uncertainty, and gradually tune down the ratio of Exploration (and leverage more on Exploitation) as we collected more statistics.

Epsilon-Greedy Strategy

In the "epsilon-greedy" strategy, at every play we throw a dice between explore and exploit.
With probability p(t) = k/t (where k is a constant and t is the number of tries so far),we pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past). And with probability 1 - p(t), we pick the bandit that has the highest probability of win based on passed statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes. However, it doesn't have a smooth transition between explore and exploit. Also while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits. While it exploits, it doesn't consider the confidence of probability estimation.

Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval. In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.

Initially each bandit has a zero mean and a large confidence interval. As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i. We also adjust the confidence interval (reducing deviation) as we play the bandit.
e.g. standard deviation is (p.(1-p)/n)^0.5

Notice that the UCB model can be used in a more general online machine learning setting. We require the machine learning model be able to output its estimation based on a confidence parameter. As a concrete example, lets say a user is visiting our movie site and we want to recommend a movie to the user, based on a bunch of input features (e.g. user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify movie candidate based on relevancy (ie: user's viewing history or user search query). For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and need to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution). We then combine them by add the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether user click on it, view it ... etc. and the response will be fed back to our ML model as a new training data. Our ML model is an online learning setting and will update the model with this new training data. Over time the 68% confidence range will be reduced over time as more and more data is gathered.

Relationship with A/B Testing

For most web sites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things. In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.

When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly. In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment. The goal is to test if the treatment causes any significant improvement from the control. Certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense. Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience. This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination. In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from. Throughout the process, A/B test framework will explore/exploit among different user experience to optimize for the objective function. At any time, we can query the A/B testing framework to find out the latest statistics of each user experience. This provides a much better way to look at large number of experiment result at the same time.

Saturday, August 17, 2013

Six steps in data science

"Data science" and "Big data" has become a very hot term in last 2 years. Since the data storage price dropped significantly, enterprises and web companies has collected huge amount of their customer's behavioral data even before figuring out how to use them. Enterprises also start realizing that these data, if use appropriately, can turn into useful insight that can guide its business along a more successful path.

From my observation, the effort of data science can be categorized roughly into the following areas.

Data acquisition
Data visualization
OLAP, Report generation
Response automation
Predictive analytics
Decision optimization

Data Acquisition

Data acquisition is about collecting data into the system. This is an important step before any meaningful processing or analysis can begin. Most companies start with collecting business transaction records from their OLTP system. Typically there is an ETL (Extract/Transform/Load) process involved to ingest the raw data, clean the data and transform them appropriately. Finally, the data is loaded into a data warehouse where the subsequent data analytics exercises are performed. In today's big data world, the destination has been shifted from traditional data warehouse to Hadoop distributed file system (HDFS).

Data Visualization

Data visualization is usually the first step of analyzing the data. It is typically done by plotting data in different ways to establish give a quick sense of its shape, in order to guide the data scientist to determine what subsequent analysis should be conducted. This is also where a more experience data scientist can be distinguished from an less experienced one based on how fast they can spot common patterns or anomalies from data. Most of the plotting package works only for data that fits into one single machine. Therefore, in the big data world, data sampling is typically conducted first to reduce the data size, and then import to a single machine where R, SAS, SPSS can be used to visualized the sample data.

OLAP / Reporting

OLAP is about aggregating transaction data (e.g. revenue) along different dimensions (e.g. Month, Location, Product) where the enterprise defines its KPI/business metrics that measures the companies' performance. This can be done either in an ad/hoc manner (OLAP), or in a predefined manner (Report template). Report writer (e.g. Tableau, Microstrategy) are use to produce the reports. The data is typically stored in regular RDBMS or a multidimensional cube which is optimizing for OLAP processing (ie: slice, dice, rollup, drilldown). In the big data world, Hive provides a SQL-like access mechanism and is commonly used to access data stored in HDFS. Most popular report writers have integrated with Hive (or declare plan to integrate with it) to access big data stored in HDFS.

Response Automation

Response automation is about leveraging domain-specific knowledge to encode a set of "rules" which include event/condition/action. The system monitors all events observed, matches them against the conditions, (which can be a boolean expression of event attributes, or a sequence of event occurrence), and trigger appropriate actions. In the big data world, automating such response is typically done by stream processing mechanism (such as Flume and Storm). Notice that the "rule" need to be well-defined and unambiguous.

Predictive Analytics

Prediction is about estimating unknown data based on observed data through statistical/probabilistic approach. Depends on the data type of the output, "prediction" can be subdivided into "classification" (when the output is a category) or "regression" (when the output is a number).

Prediction is typically done by first "train" a predictive model using historic data (where all input and output value is known). This training is done via an iterative process where the performance of the model in each iteration is measured at the end of each iteration. Additional input data or different model parameters will be used in next round of iteration. When the predictive performance is good enough and no significant improvement is made between subsequent iterations, the process will stop and the best model created during the process will be used.

Once we have the predictive model, we can use it to predict information we haven't observed, either this is information that is hidden, or hasn't happen yet (ie: predicting future)

For a more detail description of what is involved in performing predictive analytics, please refer to my following blog.

Decision Optimization

Decision optimization is about making the best decision after carefully evaluating possible options against some measure of business objectives. The business objective is defined by some "objective function" which is expressed as a mathematical formula of some "decision variables". Through various optimization technique, the system will figure out what the decision variables should be in order ro maximize (or minimize) the value of the objective function. The optimizing technique for discrete decision variables is typically done using exhaustive search, greedy/heuristic search, integer programming technique, while optimization for continuous decision variables is done using linear/quadratic programming.

From my observation, "decision optimization" is the end goal of most data science effort. But "decision optimization" relies on previous effort. To obtain the overall picture (not just observed data, but also hidden or future data) at the optimization phase, we need to make use of the output of the prediction. And in order to train a good predictive model, we need to have the data acquisition to provide clean data. All these efforts are inter-linked with each other in some sense.

Monday, July 29, 2013

OLAP operation in R

OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions. This is a well-established field in Business Intelligence / Reporting. In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R.

Facts and Dimensions

The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table

A Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount).

A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc.

In a typical setting of Multi-dimensional model ...

Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema.
Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema.
Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other.
However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema.
But it is a hard rule that Dimension table NEVER points / references Fact table

A simple STAR schema is shown in following diagram.

Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc.

Here we first create a sales fact table that records each sales transaction.

# Setup the dimension tables

state_table <- 
  data.frame(key=c("CA", "NY", "WA", "ON", "QU"),
             name=c("California", "new York", "Washington", "Ontario", "Quebec"),
             country=c("USA", "USA", "USA", "Canada", "Canada"))

month_table <- 
  data.frame(key=1:12,
            desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
            quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))

prod_table <- 
  data.frame(key=c("Printer", "Tablet", "Laptop"),
            price=c(225, 570, 1120))

# Function to generate the Sales table
gen_sales <- function(no_of_recs) {

  # Generate transaction data randomly
  loc <- sample(state_table$key, no_of_recs, 
                replace=T, prob=c(2,2,1,1,1))
  time_month <- sample(month_table$key, no_of_recs, replace=T)
  time_year <- sample(c(2012, 2013), no_of_recs, replace=T)
  prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
  unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
  amount <- unit*prod_table[prod,]$price

  sales <- data.frame(month=time_month,
                      year=time_year,
                      loc=loc,
                      prod=prod,
                      unit=unit,
                      amount=amount)

  # Sort the records by time order
  sales <- sales[order(sales$year, sales$month),]
  row.names(sales) <- NULL
  return(sales)
}

# Now create the sales fact table
sales_fact <- gen_sales(500)

# Look at a few records
head(sales_fact)

  month year loc   prod unit amount
1     1 2012  NY Laptop    1    225
2     1 2012  CA Laptop    2    450
3     1 2012  ON Tablet    2   2240
4     1 2012  NY Tablet    1   1120
5     1 2012  NY Tablet    2   2240
6     1 2012  CA Laptop    1    225

Multi-dimensional Cube

Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension.

# Build up a cube
revenue_cube <- 
    tapply(sales_fact$amount, 
           sales_fact[,c("prod", "month", "year", "loc")], 
           FUN=function(x){return(sum(x))})

# Showing the cells of the cude
revenue_cube

, , year = 2012, loc = CA

         month
prod         1    2     3    4    5    6    7    8    9   10   11   12
  Laptop  1350  225   900  675  675   NA  675 1350   NA 1575  900 1350
  Printer   NA 2280    NA   NA 1140  570  570  570   NA  570 1710   NA
  Tablet  2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360

, , year = 2013, loc = CA

         month
prod         1    2    3    4    5    6    7    8    9   10   11   12
  Laptop   225  225  450  675  225  900  900  450  675  225  675 1125
  Printer   NA 1140   NA 1140  570   NA   NA  570   NA 1140 1710 1710
  Tablet  3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480

, , year = 2012, loc = NY

         month
prod         1     2    3    4    5    6    7    8    9   10   11   12
  Laptop   450   450   NA   NA  675  450  675   NA  225  225   NA  450
  Printer   NA  2280   NA 2850  570   NA   NA 1710 1140   NA  570   NA
  Tablet  3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360

, , year = 2013, loc = NY

.....

dimnames(revenue_cube)

$prod
[1] "Laptop"  "Printer" "Tablet" 

$month
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

$year
[1] "2012" "2013"

$loc
[1] "CA" "NY" "ON" "QU" "WA"

OLAP Operations

Here are some common operations of OLAP

Slice
Dice
Rollup
Drilldown
Pivot

"Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet".

# Slice
# cube data in Jan, 2012
revenue_cube[, "1", "2012",]

         loc
prod        CA   NY   ON   QU   WA
  Laptop  1350  450   NA  225  225
  Printer   NA   NA   NA 1140   NA
  Tablet  2240 3360 5600 1120 2240
 
# cube data in Jan, 2012
revenue_cube["Tablet", "1", "2012",]

  CA   NY   ON   QU   WA 
2240 3360 5600 1120 2240

"Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY].

revenue_cube[c("Tablet","Laptop"), 
             c("1","2","3"), 
             ,
             c("CA","NY")]


, , year = 2012, loc = CA

        month
prod        1    2     3
  Tablet 2240 4480 12320
  Laptop 1350  225   900

, , year = 2013, loc = CA

        month
prod        1    2    3
  Tablet 3360 3360 1120
  Laptop  225  225  450

, , year = 2012, loc = NY

        month
prod        1     2    3
  Tablet 3360 13440 2240
  Laptop  450   450   NA

, , year = 2013, loc = NY

        month
prod        1    2    3
  Tablet 3360 4480 6720
  Laptop  450   NA  225

"Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).

apply(revenue_cube, c("year", "prod"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      prod
year   Laptop Printer Tablet
  2012  22275   31350 179200
  2013  25200   33060 166880

"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).

apply(revenue_cube, c("year", "month", "prod"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


, , prod = Laptop

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 2250 2475 1575 1575 2250 1800 1575 1800  900 2250 1350 2475
  2013 2250  900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250

, , prod = Printer

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 1140 5700  570 3990 4560 2850 1140 2850 2850 1710 3420  570
  2013 1140 4560 3420 4560 2850 1140  570 3420 1140 3420 3990 2850

, , prod = Tablet

      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200  7840
  2013  8960 11200 10080  7840 14560 10080 29120 15680 15680  8960 12320 22400

"Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location.

apply(revenue_cube, c("year", "month"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885
  2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500
 

apply(revenue_cube, c("prod", "loc"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


         loc
prod         CA     NY    ON    QU    WA
  Laptop  16425   9450  7650  7425  6525
  Printer 15390  19950  7980 10830 10260
  Tablet  90720 117600 45920 34720 57120

I hope you can get a taste of the richness of data processing model in R.

However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.

Friday, February 15, 2013

Text Processing (part 1) : Entity Recognition

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Given the sentence: Ricky is living in CA USA.
Output: Ricky/SP is/NA living/NA in/NA CA/SL USA/CL

Basically we want to tag each word with the entity, whose definition is domain specific. In this example, we define the following tags

NA - Not Applicable
SP - Start of a person name
CP - Continue of a person name
SL - Start of a location name
CL - Continue of a location name

Hidden Markov Model

Lets say there is a sequence of state, lets look at multiple probabilistic graph.

However, in our tagging example, we don't directly observe the tag. Instead, we only observe the words. In this case, we can use a hidden markov model (ie: HMM).

Now the tagging problem can be structured as follows.

Given a sequence of words, we want to predict the most likely tag sequence.

Find a tag sequence t1, t2, ... tn that maximize the probability of P(t1, t2, .... | w1, w2 ...)

Using Bayes rules,
P(t1, t2, .... | w1, w2 ...) = P(t1, t2, ... tn, w1, w2, ... wn) / P(w1, w2, ... wn)

Since the sequence w1, w2 ... wn is observed and constant among all tag sequence. This is equivalent to maximize P(t1, t2, ... tn, w1, w2, ... wn) which is equal to P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

Now, P(t1|S), P(t2|t1) ... can be estimated by counting the occurrence within the training data.
P(t2|t1) = count(t1, t2) / count(*, t2)

Similarly, P(w1|t1) = count(w1, t1) / count(*, t1)

Viterbi Algorithm

Now the problem is find a tag sequence t1, ... tn to maximize
P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

A naive method is to find all possible combination of tag sequence and then evaluate the above probability. The order of complexity will be O(|T|^n) where T is the number of possible tag values. Notice that this is exponential complexity with respect to the length of the sentence.

However, there is a more effective Viterbi algorithm that leverage the Markov chain properties as well as dynamic programming technique.

The key element is M(k, L) which indicates the max probability of any length k sequence that ends at tk = L. On the other hand, M(k, L) is computed by looking back different choices of S of the length k-1 sequence, and pick the one that gives the maximum M(k-1, S)*P(tk=L | tk-1=S)*P(wk|tk=L). The complexity of this algorithm is O(n*|T|^2).

To find the actual tag sequence, we also maintain a back pointer from every cell to S which leads to the cell. Then we can trace back the path from the max cell M(n, STOP) where STOP is the end of sentence.

Notice that for some rare words that is not observed from the training data, P(wk|tk=L) will be zero and cause the whole term to be zero. Such words can be numbers, dates. One way to solve this problem is to group these rare words into patterns (e.g. 3 digits, year2012 ... etc) and then compute P(group(wk) | tk=L). However, such grouping is domain specific and has to be hand-tuned.

Reference

NLP course from Michael Collins of Columbia Unversity

Monday, January 14, 2013

Optimization in R

Optimization is a very common problem in data analytics. Given a set of variables (which one has control), how to pick the right value such that the benefit is maximized. More formally, optimization is about determining a set of variables x1, x2, … that maximize or minimize an objective function f(x1, x2, …).

Unconstrained optimization

In an unconstraint case, the variable can be freely selected within its full range.

A typical solution is to compute the gradient vector of the objective function [∂f/∂x1, ∂f/∂x2, …] and set it to [0, 0, …]. Solve this equation and output the result x1, x2 … which will give the local maximum.

In R, this can be done by a numerical analysis method.

> f <- function(x){(x[1] - 5)^2 + (x[2] - 6)^2}
> initial_x <- c(10, 11)
> x_optimal <- optim(initial_x, f, method="CG")
> x_min <- x_optimal$par
> x_min
[1] 5 6

Equality constraint optimization

Moving onto the constrained case, lets say x1, x2 … are not independent and then have to related to each other in some particular way: g1(x1, x2, …) = 0, g2(x1, x2, …) = 0.

The optimization problem can be expressed as …
Maximize objective function: f(x1, x2, …)
Subjected to equality constraints:

g1(x1, x2, …) = 0
g2(x1, x2, …) = 0

A typical solution is to turn the constraint optimization problem into an unconstrained optimization problem using Lagrange multipliers.

Define a new function F as follows ...
F(x1, x2, …, λ1, λ2, …) = f(x1, x2, …) + λ1.g1(x1, x2, …) + λ2.g2(x1, x2, …) + …

Then solve for ...
[∂F/∂x1, ∂F/∂x2, …, ∂F/∂λ1, ∂F/∂λ2, …] = [0, 0, ….]

Inequality constraint optimization

In this case, the constraint is inequality. We cannot use the Lagrange multiplier technique because it requires equality constraint. There is no general solution for arbitrary inequality constraints.

However, we can put some restriction in the form of constraint. In the following, we study two models where constraint is restricted to be a linear function of the variables:
w1.x1 + w2.x2 + … >= 0

Linear Programming

Linear programming is a model where both the objective function and constraint function is restricted as linear combination of variables. The Linear Programming problem can be defined as follows ...

Maximize objective function: f(x1, x2) = c1.x1 + c2.x2

Subjected to inequality constraints:

a11.x1 + a12.x2 <= b1
a21.x1 + a22.x2 <= b2
a31.x1 + a32.x2 <= b3
x1 >= 0, x2 >=0

As an illustrative example, lets walkthrough a portfolio investment problem. In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.

StockA gives 5% return on average
StockB gives 4% return on average
StockC gives 6% return on average

To set some constraints, lets say my investment in C must be less than sum of A, B. Also, investment in A cannot be more than twice of B. Finally, at least 10% of investment in each stock.

To formulate this problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Maximize expected return: f(x1, x2, x3) = x1*5% + x2*4% + x3*6%

Subjected to constraints:

10% < x1, x2, x3 < 100%
x1 + x2 + x3 = 1
x3 < x1 + x2
x1 < 2 * x2

> library(lpSolve)
> library(lpSolveAPI)
> # Set the number of vars
> model <- make.lp(0, 3)
> # Define the object function: for Minimize, use -ve
> set.objfn(model, c(-0.05, -0.04, -0.06))
> # Add the constraints
> add.constraint(model, c(1, 1, 1), "=", 1)
> add.constraint(model, c(1, 1, -1), ">", 0)
> add.constraint(model, c(1, -2, 0), "<", 0)
> # Set the upper and lower bounds
> set.bounds(model, lower=c(0.1, 0.1, 0.1), upper=c(1, 1, 1))
> # Compute the optimized model
> solve(model)
[1] 0
> # Get the value of the optimized parameters
> get.variables(model)
[1] 0.3333333 0.1666667 0.5000000
> # Get the value of the objective function
> get.objective(model)
[1] -0.05333333
> # Get the value of the constraint
> get.constraints(model)
[1] 1 0 0

Quadratic Programming

Quadratic programming is a model where both the objective function is a quadratic function (contains up to two term products) while constraint function is restricted as linear combination of variables.

The Quadratic Programming problem can be defined as follows ...

Minimize quadratic objective function:
f(x1, x2) = c1.x12 + c2. x1x2 + c2.x22 - (d1. x1 + d2.x2)
Subject to constraints

a11.x1 + a12.x2 == b1
a21.x1 + a22.x2 == b2
a31.x1 + a32.x2 >= b3
a41.x1 + a42.x2 >= b4
a51.x1 + a52.x2 >= b5

Express the problem in Matrix form:

Minimize objective ½(DTx) - dTx
Subject to constraint ATx >= b
First k columns of A is equality constraint

As an illustrative example, lets continue on the portfolio investment problem. In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.

StockA gives 5% return on average
StockB gives 4% return on average
StockC gives 6% return on average

We also look into the variance of each stock (known as risk) as well as the covariance between stocks.

For investment, we not only want to have a high expected return, but also a low variance. In other words, stocks with high return but also high variance is not very attractive. Therefore, maximize the expected return and minimize the variance is the typical investment strategy.

One way to minimize variance is to pick multiple stocks (in a portfolio) to diversify the investment. Diversification happens when the stock components within the portfolio moves from their average in different direction (hence the variance is reduced).

The Covariance matrix ∑ (between each pairs of stocks) is given as follows:
1%, 0.2%, 0.5%
0.2%, 0.8%, 0.6%
0.5%, 0.6%, 1.2%

In this example, we want to achieve a overall return of at least 5.2% of return while minimizing the variance of return

To formulate the problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Minimize variance: xT∑x

Subjected to constraint:

x1 + x2 + x3 == 1
X1*5% + x2*4% + x3*6% >= 5.2%
0% < x1, x2, x3 < 100%

> library(quadprog)
> mu_return_vector <- c(0.05, 0.04, 0.06) 
> sigma <- matrix(c(0.01, 0.002, 0.005, 
+                   0.002, 0.008, 0.006, 
+                   0.005, 0.006, 0.012), 
+                  nrow=3, ncol=3)
> D.Matrix <- 2*sigma
> d.Vector <- rep(0, 3)
> A.Equality <- matrix(c(1,1,1), ncol=1)
> A.Matrix <- cbind(A.Equality, mu_return_vector, 
                    diag(3))
> b.Vector <- c(1, 0.052, rep(0, 3))
> out <- solve.QP(Dmat=D.Matrix, dvec=d.Vector, 
                  Amat=A.Matrix, bvec=b.Vector, 
                  meq=1)
> out$solution
[1] 0.4 0.2 0.4
> out$value
[1] 0.00672
>

Integration with Predictive Analytics

Optimization is usually integrated with predictive analytics, which is another important part of data analytics. For an overview of predictive analytics, here is my previous blog about it.

The overall processing can be depicted as follows:

In this diagram, we use machine learning to train a predictive model in batch mode. Once the predictive model is available, observation data is fed into it at real time and a set of output variables is predicted. Such output variable will be fed into the optimization model as the environment parameters (e.g. return of each stock, covariance ... etc.) from which a set of optimal decision variable is recommended.

Thursday, October 4, 2012

Machine Learning in Gradient Descent

In Machine Learning, gradient descent is a very popular learning mechanism that is based on a greedy, hill-climbing approach.

Gradient Descent

The basic idea of Gradient Descent is to use a feedback loop to adjust the model based on the error it observes (between its predicted output and the actual output). The adjustment (notice that there are multiple model parameters and therefore should be considered as a vector) is pointing to a direction where the error is decreasing in the steepest sense (hence the term "gradient").

Notice that we intentionally leave the following items vaguely defined so this approach can be applicable in a wide range of machine learning scenarios.

The Model
The loss function
The learning rate

Gradient Descent is very popular method because of the following reasons ...

Intuitive and easy to understand
Easy to run in parallel processing architecture
Easy to run incrementally with additional data

On the other hand, the greedy approach in Gradient Descent can be trapped in local optimum. This can be mitigated by choosing a convex LOSS function (which has a single optimum), or multiple starting points can be picked randomly (in the case, we hope the best local optimum is close to the global optimum).

Batch vs Online Learning

While some other Machine Learning model (e.g. decision tree) requires a batch of data points before the learning can start, Gradient Descent is able to learn each data point independently and hence can support both batch learning and online learning easily. The difference lies in how the training data is fed into the model and how the loss function computes its error.

In batch learning, all training will be fed to the model, who estimates the output for all data points. Error will then be summed to compute the loss and then update the model. Model in this case will be updated after predicting the whole batch of data points.

In online learning mode (also called stochastic gradient descent), data is fed to the model one at a time while the adjustment of the model is immediately made after evaluating the error of this single data point. Notice that the final result of incremental learning can be different from batch learning, but it can be proved that the difference is bound and inversely proportional to the square root of the number of data points.

The learning rate can be adjusted as well to achieve a better stability in convergence. In general, the learning rate is higher initially and decrease over the iteration of training (in batch learning it decreases in next round, in online learning it decreases at every data point). This is quite intuitive as you paid less attention to the error as you have learn more and more. Because of that online learning is sensitive to the arrival order of data.

One way to adjust the learning rate is to have a constant divide by the square root of N (where N is the number of data point seen so far).

ɳ = ɳ_initial / (t ^ 0.5).

By using different decay factor, we can control how much attention we should pay for the late coming data. In online learning, as data comes in time of occurrence, we can play around with this decay factor to guide how much attention the learning mechanism should be paying to latest arrival data. Online learning automatically adapt to change of trends over time.

Most real world machine learning scenario relies on stationality of the model. By the way, learning is about "learning from past experience". If the environment changes too rapidly that the past experience is invalid, there is little value to learn. Because of this reason, most machine learning project are satisfied by using batch learning (daily or weekly) and the demand of online learning is not very high. A very common batch learning model is described in my previous blog here.

Parallel Learning

Because of no dependency in data processing, Gradient Descent is very easy to put into a parallel processing environment such as Map/Reduce. Here we illustrate how to parallelize the execution of batch learning.

Notice that there are multiple rounds of Map/Reduce until the model converges. On the other hand, online learning is not possible for Hadoop Map/Reduce which doesn't support real-time at this moment.

In summary, gradient descent is a very powerful approach of machine learning and works well in a wide spectrum of scenarios.

Sunday, September 23, 2012

Location Sensitive Hashing in Map Reduce

Inspired by Dr. Gautam Shroff who teaches the class: Web Intelligence and Big data in coursera.org, there are many scenarios where we want to compute similarity between large amount of items (e.g. photos, products, persons, resumes ... etc). I want to add another algorithm to my Map/Reduce algorithm catalog.

For the background of Map/Reduce implementation on Hadoop. I have a previous post that covers the details.

Large Scale Similarity Computation

Lets say there are N items (N is billions) and we want to find all those that are similar to each other. (similarity is defined by a distance function). The goal is to output a similarity matrix. (Notice that this matrix is very sparse and most of the cells are infinite)

One naive way is to compute the similarity of each possible pairs of items, hence an O(N^2) problem which is huge. Can we reduce the order of complexity ?

Location Sensitive Hashing

First idea: Find a hashing function such that similar items (say distance is less than some predefined threshold) will be hashed to the same bucket.

Lets say if we pick the hash function such that Probability(H(a) == H(b)) is proportional to similarity between a and b. And then we only perform detail comparison on items that falls into the same bucket.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison.

x <- seq(0, 1, 0.01)
y <- x
plot(x, y, xlab="similarity", ylab="prob of detail compare")

Lets say we are interested in comparing all pairs of items whose similarity is above 0.3, we have a problem here because we have probability 0.7 = 1 - 0.3 of missing them (as they are not landing on the same bucket). We want a mechanism that is highly selective; probability of performing a detail comparison should be close to one when similarity is above 0.3 and close to zero when similarity is below 0.3.

Second idea: Lets use 100 hash functions and 2 items that has 30 or more matches of such hash functions will be selected for detail comparison.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison.

# Probability of having more than "threshold" matches out 
# of "no_of_hash" with a range of varying similarities

prob_select <- function(threshold, similarity, no_of_hash) {
  sum <- rep(0, length(similarity))
  for (k in 0:floor(no_of_hash * threshold)) {
    sum <- sum + dbinom(k, no_of_hash, similarity)
  }
  return(1 - sum)
}

x <- seq(0, 1, 0.01)
y <- prob_select(0.3, x, 100)
plot(x, y, main="black: 100 hashes, Red: 1000 hashes", 
xlab="similarity", ylab="prob of detail compare")
lines(x, y)
y <- prob_select(0.3, x, 1000)
lines(x, y, col="red")

The graph looks much better this time, the chance of being selected for detail comparison jumps from zero to one sharply when the similarity crosses 0.3

To compare the items that are similar, we first compute 100 hashes (based on 100 different hash functions) for each item and output all combination 30 hashes as a key. Then we perform pairwise comparison for all items that has same key.

But look at the combination of 30 out of 100, it is 100!/(30! * 70!) = 2.93 * 10^25, which is impractically huge. Even the graph is a nice, we cannot use this mechanism in practice.

Third idea: Lets use 100 hash function and break them into b groups of r each (ie: b*r = 100). Further let assume b = 20, and r = 5. In other words, we have 20 groups and Group1 has hash1 to hash5, Group2 has hash6 to hash10 ... etc. Now, we call itemA's group1 matches itemB's group1 if all their hash1 to hash5 are equal to each other. Now, we'll perform a detail comparison of itemA and itemB if any of the groups are equal to each other.

Probability of being selected is 1 - (1-s^r)^b

The idea can be visualized as follows

Notice that in this model, finding r and b based on s is a bit trial and error. Here we try 20 by 5, 33 by 3, 10 by 10.

prob_select2 <- function(similarity, row_per_grp, no_of_grp) {
  return(1 - (1 - similarity^row_per_grp)^no_of_grp)
}

x <- seq(0, 1, 0.01)
y <- prob_select2(x, 5, 20)

plot(x, y, 
main="black:20 by 5, red:10 by 10, blue:33 by 3", 
xlab="similarity", ylab="prob of detail compare")

lines(x, y)
y <- prob_select2(x, 10, 10)
lines(x, y, col="red")
y <- prob_select2(x, 3, 33)
lines(x, y, col="blue")

From the graph, we see the blue curve fits better to select the similarity at 0.3. So lets use 33 by 3.

Notice that the ideal graph should be a step function where probability jumps from 0 to 1 when similarity cross over the similarity threshold that we are interested to capture (ie: we want to put all pairs whose similarity bigger than this threshold to be in a same bucket and all pairs whose similarity less that this threshold to be in a different bucket). Unfortunately, our curve is a S-curve, not a step function. This means there will be false positive and false negative. False position lies on the left side of the similarity threshold where we have a small chance to put them into the same bucket, which will cost up some extra work to compare them later and throw them away. On the other hand, false negative lies on the right side where we have a small chance of putting items that are very similar into different buckets and not considering them in the detail comparison. Depends on whether we need to catch all the similar items above the threshold, we may need shift the S curve left or right by tuning the r and b parameters.

To perform the detail comparison, we can use a parallel Map/Reduce implementation

Map Reduce Implementation

Here we have two round of Map/Reduce. In the first round, map function will compute all the groupKeys for each item and emit the groupKey with the item. All the items that has the groupKey matches will land on the same reducer, which creates all the possible pairs of items (these are candidates for pairwise comparison).

However, we don't want to perform the detail comparison in the first round as there may be many duplicates for item pairs that matches more than one group. Therefore we want to perform another round of Map/reduce to remove the duplicates.

The first round proceeds as follows ...

After that, the second round proceeds as follows ...

By combining Location Sensitive Hashing and Map/Reduce, we can perform large scale similarity calculation in an effective manner.

Monday, August 13, 2012

BIG Data Analytics Pipeline

"Big Data Analytics" has recently been one of the hottest buzzwords. It is a combination of "Big Data" and "Deep Analysis". The former is a phenomenon of Web2.0 where a lot of transaction and user activity data has been collected which can be mined for extracting useful information. The later is about using advanced mathematical/statistical technique to build models from the data. In reality I've found these two areas are quite different and disjoint and people working in each area have a pretty different background.

Big Data Camp

People working in this camp typically come from Hadoop, PIG/Hive background. They usually have implemented some domain-specific logic to process large amount of raw data. Often the logic is relatively straight-forward based on domain-specific business rules.

From my personal experience, most of the people working in big data come from a computer science and distributed parallel processing system background but not from the statistical or mathematical discipline.

Deep Analysis Camp

On the other hand, people working in this camp usually come from statistical and mathematical background which the first thing being taught is how to use sampling to understand a large population's characteristic. Notice the magic of "sampling" is that the accuracy of estimating the large population depends only in the size of sample but not the actual size of the population. In their world, there is never a need to process all the data in the population in the first place. Therefore, Big Data Analytics is unnecessary under this philosophy.

Typical Data Processing Pipeline

Learning from my previous projects, I observe most data processing pipeline fall into the following pattern.

In this model, data is created from the OLTP (On Line Transaction Processing) system, flowing into the BIG Data Analytics system which produced various output; including data mart/cubes for OLAP (On Line Analytic Processing), reports for the consumption of business executives, and predictive models that feedback decision support for OLTP.

Big Data + Deep Analysis

The BIG data analytics box is usually done in a batch fashion (e.g. once a day), usually we see big data processing and deep data analysis happens at different stages of this batch process.

The big data processing part (colored in orange) is usually done using Hadoop/PIG/Hive technology with classical ETL logic implementation. By leveraging the Map/Reduce model that Hadoop provides, we can linearly scale up the processing by adding more machines into the Hadoop cluster. Drawing cloud computing resources (e.g. Amazon EMR) is a very common approach to perform this kind of tasks.

The deep analysis part (colored in green) is usually done in R, SPSS, SAS using a much smaller amount of carefully sampled data that fits into a single machine's capacity (usually less than couple hundred thousands data records). The deep analysis part usually involve data visualization, data preparation, model learning (e.g. Linear regression and regularization, K-nearest-neighbour/Support vector machine/Bayesian network/Neural network, Decision Tree and Ensemble methods), model evaluation. For those who are interested, please read up my earlier posts on these topics.

Implementation Architecture

There are many possible ways to implement the data pipeline described above. Here is one common implementation that works well in many projects.

In this architecture, "Flume" is used to move data from OLTP system to Hadoop File System HDFS. A workflow scheduler (typically a cron-tab entry calling a script) will periodically run to process the data using Map/Reduce. The data has two portions: a) Raw transaction data from HDFS b) Previous model hosted in some NOSQL server. Finally the "reducer" will update the previous model which will be available to the OLTP system.

For most the big data analytic projects that I got involved, the above architecture works pretty well. I believe projects requiring real-time feedback loop may see some limitation in this architecture. Real-time big data analytics is an interesting topic which I am planning to discuss in future posts.

Wednesday, August 8, 2012

Measuring similarity and distance function

Measuring similarity or distance between two data points is very fundamental to many Machine Learning algorithms such as K-Nearest-Neighbor, Clustering ... etc. Depends on the nature of the data point, various measurement can be used.

Distance between numeric data points

When the dimension of data point is numeric, the general form is called Minkowski distance

( (x₁ - x₂)^p + (y₁ - y₂)^p)^1/p

When p = 2, this is equivalent to Euclidean distance. When p = 1, this is equivalent to Manhattan distance.

This measure is independent of the underlying data distribution. But what if the value along the x-dimension is much bigger than that from the y-dimension. So we need to bring all of them into the same scale first. A common way is to perform a z-transform where each data point first subtract the mean value, and then divide the standard deviation.

(x₁, y₁) becomes ( (x₁ – μ_x)/σ_x, (y₁ – μ_y)/σ_y )

This measure, although taking into consideration of the distribution of each dimension, it assumes the dimension are independent of each other. But what if x-dimension and y-dimension has some correlation. To consider correlation between different dimensions, we use ...

Mahalanobis distance = (v₁^→- v₂^→)^T.CovMatrix.(v₁^→- v₂^→) where v₁^→ = (x₁, y₁)

If we care about the direction of the data rather than the magnitude, then cosine distance is a common approach. It computes the dot product of the two data points divided by the product of their magnitude. Cosine distance, together with term/document matrix, is commonly used to measure the similarity between documents.

Distance between categorical data points

Since there is no ordering between categorical value, we can only measure whether the categorical value is the same or not. Basically we are measuring the degree of overlapping of attribute values. Hamming distance can be used to measure how many attributes need to changed in order to match each other. We can calculate the ratio to determine how similar (or difference) between two data points using simple matching coefficient:
noOfMatchAttributes / noOfAttributes

However, when the data point contains asymmetric binary data attributes, equality of certain value doesn't mean anything. For example, lets say the data point represents a user with attributes represent each movie. The data point contains a high dimensional binary value representing whether the user has seen the movie. (1 represent yes and 0 represent no). Given that most users only see a very small portion of all movies, if both user hasn't seen a particular movie (both value is zero), it doesn't indicate any similarity between the user. On the other hand, if both user saw the same movie (both value is one), it implies a lot of similarity between the user. In this case, equality of one should carry a much higher weight than equality of zero. This lead to Jaccard similarity :
noOfOnesInBoth / (noOfOnesInA + noOfOnesInB - noOfOnesInAandB)

Besides matching or not, if category is structured as a Tree hierarchy, then the distance of two category can be quantified by path length of their common parent. For example, "/product/spot/ballgame/basketball" is closer to "/product/spot/ballgame/soccer/shoes" than "/product/luxury/handbags" because the common parent has a longer path.

Similarity between instances containing mixed types of attributes

When the data point contain a mixed of attributes, we can calculate the similarity of each attribute (or group the attributes of the same type), and then combine them together using some weighted average.

But we have to be careful when treating asymmetric attributes where its presence doesn't mean anything.

combined_similarity(x, y) = Σ_{over_k}[w_k * δ_k * similarity(x_k, y_k)] / Σ_{over_k}(δ_k)

where Σ_{over_k}(w_k) = 1

Distance between sequence (String, TimeSeries)

In case each attribute represent an element of a sequence, we need a different way to measure the distance. For example, lets say each data point is a string (which contains a sequence of characters), then edit distance is a common measurement. Basically, edit distance is how many "modifications" (which can be insert, modify, delete) is needed to change stringA into stringB. This is usually calculated by using dynamic programming technique.

Time Series is another example of sequence data. Similar to the concept of edit distance, Dynamic Time Warp is about distorting the time dimension by adding more data points in both time series such that their square error between corresponding pairs is minimized. Where to add these data points are solved using a similar dynamic programming technique. Here is a very good paper that describe the details.

Distance between nodes in a network

In a homogenous undirected graph (nodes are of the same type), distance between nodes can be measured by the shortest path.

In a bi-partite graph, there are two types of nodes in which each node only connects to the other type. (e.g. People joining Communities). Similarity between nodes (of same type) can be measured by analyzing how similar their connected communities are.

SimRank is an iterative algorithm that compute the similarity of each type of nodes by summing the similarity between all pairs of other type of nodes that it has connected, while other type of nodes' similarity is computed in the same way.

We can also use a probabilistic approach such as RandomWalk to determine the similarity. Each people node will pass a token (label with the people's name) along a randomly picked community node which it is connected to (weighted by the strength of connectivity). Each community node will propagated back the received token back to a randomly picked people. Now the people who received the propagated token may drop the token (with a chance beta) or propagated to a randomly chosen community again. This process continues under all the tokens are die out (since they have a chance of being dropped). After that, we obtain the trace Matrix and compute the similarity based on the dot product of the tokens it receives.

Distance between population distribution

Instead of measuring distance between individual data points, we can also compare a collection of data points (ie: population) and measure the distance between them. In fact, one important part of statistics is to measure the distance between two groups of samples and see if the "difference" is significant enough to conclude they are from different populations.

Lets say the population contains members that belongs to different categories and we want to measure if population A and population B have same or different proportions of members across these categories, then we can use Chi-Square or KL-Divergence to measure their distance.

In case every member of the population has two different numeric attributes (e.g. weight and height), and we want to infer one attribute from the other if they are correlated, correlation coefficient is a measure that quantify their degree of correlation; whether these two attributes are moving along the same direction (heavier people are taller), different direction (heavier people are shorter), or independent. The correlation coefficient ranges from -1 (negatively correlated) to 0 (no correlation) to 1 (positively correlated).

If the two attributes are categorical (rather than numeric), then mutual information is a common way to measure their dependencies and give a good sense of whether knowing the value of one attribute can help inferring the other attribute.

Now if there are two judges who rank a collection of items and we are interested in the degree of agreement of their ranking order. We can use Spearman's rank coefficient to measure their degree of consensus in the ranking order.