Pragmatic Programming Techniques: statistics

Showing posts with label statistics. Show all posts

Saturday, April 29, 2017

An output of a truly random process

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not. Basically, data science is all about learning hidden pattern that is affecting the observations. If the observation is following a random process, then there is nothing we can learn about. Let me walk through an example to illustrate.

Lets say someone is making a claim that he is throwing a fair dice (with number 1 to 6) sequentially.
Lets say I claim the output of my dice throw is uniformly random, ie: with equal chances of getting a number from 1 to 6.

And then he throws the dice 12 times, and show you the output sequence. From the output, can you make a judgement whether this is really a sequential flow of a fair dice ? In other words, is the output really follow a random process as expected ?

Lets look at 3 situations

Situation 1 output is [4, 1, 3, 1, 2, 6, 3, 5, 5, 1, 2, 4]
Situation 2 output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Situation 3 output is [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]

At first glance, the output of situation 1 looks like resulting from a random process. Situation 2 definitely doesn't look like it. Situation 3 is harder to judge. If you look at the proportion of the output numbers, the frequency of each output number of situation 3 definitely follows a uniform distribution of a fair dice. But if you look at the number ordering, situation 3 follows a well-defined ordering that doesn't seem to be random at all. Therefore, I don't think the output of situation 3 is following a random process.

However, this seems to be a very arbitrary choice. Why would I look at the number ordering at all ? Should I look for more properties ? such as ...

Whether the number of the even position are even
Average gap between consecutive throws
Whether the number in the 3rd position always smaller than the 10th position
...

As you can see, depends on my imagination, the list can go on and on. How can I tell whether situation 3 is following a random process or not ?

Method 1: Randomization Test

This is based on the hypothesis testing methodology. We establish null hypothesis H0 that situation 3 follows a random process.

First, I define an arbitrary list of statistics of my choices

statisticA = proportion of even numbers in even position
statisticB = average gap between consecutive output numbers
statisticC = ...

Second, I run a simulation to generate 12 numbers based on a random process. Calculate the corresponding statistics defined above.

Third repeat the simulation for N times, output the mean and standard deviation of the statistics.

If the statisticA or B or C of situation 3 are too far away (based on the likelihood pValue) from the mean of statistics A/B/C by the number of standard deviation of statistics A/B/C, then we conclude that situation 3 is not following a random process. Otherwise, we don't have enough evidence to show our null hypothesis is violated and so we accept situation 3 follows the random process.

Method 2: Predictability Test

This is based on the theory of predictive analytics.

First, I pick a particular machine learning algorithm, lets say time series forecast using ARIMA.

Notice that I can also choose to use RandomForest and create some arbitrary input features (such as previous output number, maximum number in last 3 numbers ... etc)

Second, I train my selected predictive model based on the output data of situation 3 (in this example, situation 3 has only 12 data point, but imagine we have much more than 12 data point).

Third, I evaluate my model in the test set. And see whether the prediction is much better than a random guess. For example I can measure the lift of my model by comparing the RMSE (root mean square error) or my prediction and the standard deviation of the testing data. If the lift is very insignificant, then I conclude that situation 3 results from a random process, because my predictive model doesn't learn any pattern.

Monday, March 3, 2014

Estimating statistics via Bootstrapping and Monte Carlo simulation

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population. For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population. We understand the estimation done this way (via random sampling) can deviate from the population. Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval". A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation. Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set. Bootstrapping provides a way to address this ...

Bootstrapping

Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set. Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
     90% 
236.8105 
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
     90% 
232.3641 
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = sample_set1, statistic = ninety_percentile, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
> boot.ci(boot_result, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale

Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way. It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics.

If we assume certain distribution model want to see the distribution of certain statistics. Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples. We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)

Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation is a powerful tool to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

Wednesday, August 8, 2012

Measuring similarity and distance function

Measuring similarity or distance between two data points is very fundamental to many Machine Learning algorithms such as K-Nearest-Neighbor, Clustering ... etc. Depends on the nature of the data point, various measurement can be used.

Distance between numeric data points

When the dimension of data point is numeric, the general form is called Minkowski distance

( (x₁ - x₂)^p + (y₁ - y₂)^p)^1/p

When p = 2, this is equivalent to Euclidean distance. When p = 1, this is equivalent to Manhattan distance.

This measure is independent of the underlying data distribution. But what if the value along the x-dimension is much bigger than that from the y-dimension. So we need to bring all of them into the same scale first. A common way is to perform a z-transform where each data point first subtract the mean value, and then divide the standard deviation.

(x₁, y₁) becomes ( (x₁ – μ_x)/σ_x, (y₁ – μ_y)/σ_y )

This measure, although taking into consideration of the distribution of each dimension, it assumes the dimension are independent of each other. But what if x-dimension and y-dimension has some correlation. To consider correlation between different dimensions, we use ...

Mahalanobis distance = (v₁^→- v₂^→)^T.CovMatrix.(v₁^→- v₂^→) where v₁^→ = (x₁, y₁)

If we care about the direction of the data rather than the magnitude, then cosine distance is a common approach. It computes the dot product of the two data points divided by the product of their magnitude. Cosine distance, together with term/document matrix, is commonly used to measure the similarity between documents.

Distance between categorical data points

Since there is no ordering between categorical value, we can only measure whether the categorical value is the same or not. Basically we are measuring the degree of overlapping of attribute values. Hamming distance can be used to measure how many attributes need to changed in order to match each other. We can calculate the ratio to determine how similar (or difference) between two data points using simple matching coefficient:
noOfMatchAttributes / noOfAttributes

However, when the data point contains asymmetric binary data attributes, equality of certain value doesn't mean anything. For example, lets say the data point represents a user with attributes represent each movie. The data point contains a high dimensional binary value representing whether the user has seen the movie. (1 represent yes and 0 represent no). Given that most users only see a very small portion of all movies, if both user hasn't seen a particular movie (both value is zero), it doesn't indicate any similarity between the user. On the other hand, if both user saw the same movie (both value is one), it implies a lot of similarity between the user. In this case, equality of one should carry a much higher weight than equality of zero. This lead to Jaccard similarity :
noOfOnesInBoth / (noOfOnesInA + noOfOnesInB - noOfOnesInAandB)

Besides matching or not, if category is structured as a Tree hierarchy, then the distance of two category can be quantified by path length of their common parent. For example, "/product/spot/ballgame/basketball" is closer to "/product/spot/ballgame/soccer/shoes" than "/product/luxury/handbags" because the common parent has a longer path.

Similarity between instances containing mixed types of attributes

When the data point contain a mixed of attributes, we can calculate the similarity of each attribute (or group the attributes of the same type), and then combine them together using some weighted average.

But we have to be careful when treating asymmetric attributes where its presence doesn't mean anything.

combined_similarity(x, y) = Σ_{over_k}[w_k * δ_k * similarity(x_k, y_k)] / Σ_{over_k}(δ_k)

where Σ_{over_k}(w_k) = 1

Distance between sequence (String, TimeSeries)

In case each attribute represent an element of a sequence, we need a different way to measure the distance. For example, lets say each data point is a string (which contains a sequence of characters), then edit distance is a common measurement. Basically, edit distance is how many "modifications" (which can be insert, modify, delete) is needed to change stringA into stringB. This is usually calculated by using dynamic programming technique.

Time Series is another example of sequence data. Similar to the concept of edit distance, Dynamic Time Warp is about distorting the time dimension by adding more data points in both time series such that their square error between corresponding pairs is minimized. Where to add these data points are solved using a similar dynamic programming technique. Here is a very good paper that describe the details.

Distance between nodes in a network

In a homogenous undirected graph (nodes are of the same type), distance between nodes can be measured by the shortest path.

In a bi-partite graph, there are two types of nodes in which each node only connects to the other type. (e.g. People joining Communities). Similarity between nodes (of same type) can be measured by analyzing how similar their connected communities are.

SimRank is an iterative algorithm that compute the similarity of each type of nodes by summing the similarity between all pairs of other type of nodes that it has connected, while other type of nodes' similarity is computed in the same way.

We can also use a probabilistic approach such as RandomWalk to determine the similarity. Each people node will pass a token (label with the people's name) along a randomly picked community node which it is connected to (weighted by the strength of connectivity). Each community node will propagated back the received token back to a randomly picked people. Now the people who received the propagated token may drop the token (with a chance beta) or propagated to a randomly chosen community again. This process continues under all the tokens are die out (since they have a chance of being dropped). After that, we obtain the trace Matrix and compute the similarity based on the dot product of the tokens it receives.

Distance between population distribution

Instead of measuring distance between individual data points, we can also compare a collection of data points (ie: population) and measure the distance between them. In fact, one important part of statistics is to measure the distance between two groups of samples and see if the "difference" is significant enough to conclude they are from different populations.

Lets say the population contains members that belongs to different categories and we want to measure if population A and population B have same or different proportions of members across these categories, then we can use Chi-Square or KL-Divergence to measure their distance.

In case every member of the population has two different numeric attributes (e.g. weight and height), and we want to infer one attribute from the other if they are correlated, correlation coefficient is a measure that quantify their degree of correlation; whether these two attributes are moving along the same direction (heavier people are taller), different direction (heavier people are shorter), or independent. The correlation coefficient ranges from -1 (negatively correlated) to 0 (no correlation) to 1 (positively correlated).

If the two attributes are categorical (rather than numeric), then mutual information is a common way to measure their dependencies and give a good sense of whether knowing the value of one attribute can help inferring the other attribute.

Now if there are two judges who rank a collection of items and we are interested in the degree of agreement of their ranking order. We can use Spearman's rank coefficient to measure their degree of consensus in the ranking order.