Pragmatic Programming Techniques: Architecture Design

Showing posts with label Architecture Design. Show all posts

Sunday, August 17, 2014

Lambda Architecture Principles

"Lambda Architecture" (introduced by Nathan Marz) has gained a lot of traction recently. Fundamentally, it is a set of design patterns of dealing with Batch and Real time data processing workflow that fuel many organization's business operations. Although I don't realize any novice ideas has been introduced, it is the first time these principles are being outlined in such a clear and unambiguous manner.

In this post, I'd like to summarize the key principles of the Lambda architecture, focus more in the underlying design principles and less in the choice of implementation technologies, which I may have a different favors from Nathan.

One important distinction of Lambda architecture is that it has a clear separation between the batch processing pipeline (ie: Batch Layer) and the real-time processing pipeline (ie: Real-time Layer). Such separation provides a means to localize and isolate complexity for handling data update. To handle real-time query, Lambda architecture provide a mechanism (ie: Serving Layer) to merge/combine data from the Batch Layer and Real-time Layer and return the latest information to the user.

Data Source Entry

At the very beginning, data flows in Lambda architecture as follows ...

Transaction data starts streaming in from OLTP system during business operations. Transaction data ingestion can be materialized in the form of records in OLTP systems, or text lines in App log files, or incoming API calls, or an event queue (e.g. Kafka)
This transaction data stream is replicated and fed into both the Batch Layer and Realtime Layer

Here is an overall architecture diagram for Lambda.

Batch Layer

For storing the ground truth, "Master dataset" is the most fundamental DB that captures all basic event happens. It stores data in the most "raw" form (and hence the finest granularity) that can be used to compute any perspective at any given point in time. As long as we can maintain the correctness of master dataset, every perspective of data view derived from it will be automatically correct.

Given maintaining the correctness of master dataset is crucial, to avoid the complexity of maintenance, master dataset is "immutable". Specifically data can only be appended while update and delete are disallowed. By disallowing changes of existing data, it avoids the complexity of handling the conflicting concurrent update completely.

Here is a conceptual schema of how the master dataset can be structured. The center green table represents the old, traditional-way of storing data in RDBMS. The surrounding blue tables illustrates the schema of how the master dataset can be structured, with some key highlights

Data are partitioned by columns and stored in different tables. Columns that are closely related can be stored in the same table
NULL values are not stored
Each data record is associated with a time stamp since then the record is valid

Notice that every piece of data is tagged with a time stamp at which the data is changed (or more precisely, a change record that represents the data modification is created). The latest state of an object can be retrieved by extracting the version of the object with the largest time stamp.

Although master dataset stores data in the finest granularity and therefore can be used to compute result of any query, it usually take a long time to perform such computation if the processing starts with such raw form. To speed up the query processing, various data at intermediate form (called Batch View) that aligns closer to the query will be generated in a periodic manner. These batch views (instead of the original master dataset) will be used to serve the real-time query processing.

To generate these batch views, the "Batch Layer" use a massively parallel, brute force approach to process the original master dataset. Notice that since data in master data set is timestamped, the data candidate can be identified simply from those that has the time stamp later than the last round of batch processing. Although less efficient, Lambda architecture advocates that at each round of batch view generation, the previous batch view should just be simply discarded and the new batch view is computed from master dataset. This simple-mind, compute-from-scratch approach has some good properties in stopping error propagation (since error cannot be accumulated), but the processing may not be optimized and may take a longer time to finish. This can increase the "staleness" of the batch view.

Real time Layer

As discussed above, generating the batch view requires scanning a large volume of master dataset that takes few hours. The batch view will therefore be stale for at least the processing time duration (ie: between the start and end of the Batch processing). But the maximum staleness can be up to the time period between the end of this Batch processing and the end of next Batch processing (ie: the batch cycle). The following diagram illustrate this staleness.

Even the batch view is stale period, business operates as usual and transaction data will be streamed in continuously. To answer user's query with the latest, up-to-date information. The business transaction records need to be captured and merged into the real-time view. This is the responsibility of the Real-time Layer. To reduce the latency of latest information availability close to zero, the merge mechanism has to be done in an incremental manner such that no batching delaying the processing will be introduced. This requires the real time view update to be very different from the batch view update, which can tolerate a high latency. The end goal is that the latest information that is not captured in the Batch view will be made available in the Realtime view.

The logic of doing the incremental merge on Realtime view is application specific. As a common use case, lets say we want to compute a set of summary statistics (e.g. mean, count, max, min, sum, standard deviation, percentile) of the transaction data since the last batch view update. To compute the sum, we can simply add the new transaction data to the existing sum and then write the new sum back to the real-time view. To compute the mean, we can multiply the existing count with existing mean, adding the transaction sum and then divide by the existing count plus one. To implement this logic, we need to READ data from the Realtime view, perform the merge and WRITE the data back to the Realtime view. This requires the Realtime serving DB (which host the Realtime view) to support both random READ and WRITE. Fortunately, since the realtime view only need to store the stale data up to one batch cycle, its scale is limited to some degree.
Once the batch view update is completed, the real-time layer will discard the data from the real time serving DB that has time stamp earlier than the batch processing. This not only limit the data volume of Realtime serving DB, but also allows any data inconsistency (of the realtime view) to be clean up eventually. This drastically reduce the requirement of sophisticated multi-user, large scale DB. Many DB system support multiple user random read/write and can be used for this purpose.

Serving Layer

The serving layer is responsible to host the batch view (in the batch serving database) as well as hosting the real-time view (in the real-time serving database). Due to very different accessing pattern, the batch serving DB has a quite different characteristic from the real-time serving DB.

As mentioned in above, while required to support efficient random read at large scale data volume, the batch serving DB doesn't need to support random write because data will only be bulk-loaded into the batch serving DB. On the other hand, the real-time serving DB will be incrementally (and continuously) updated by the real-time layer, and therefore need to support both random read and random write.

To maintain the batch serving DB updated, the serving layer need to periodically check the batch layer progression to determine whether a later round of batch view generation is finished. If so, bulk load the batch view into the batch serving DB. After completing the bulk load, the batch serving DB has contained the latest version of batch view and some data in the real-time view is expired and therefore can be deleted. The serving layer will orchestrate these processes. This purge action is especially important to keep the size of the real-time serving DB small and hence can limit the complexity for handling real-time, concurrent read/write.

To process a real-time query, the serving layer disseminates the incoming query into 2 different sub-queries and forward them to both the Batch serving DB and Realtime serving DB, apply application-specific logic to combine/merge their corresponding result and form a single response to the query. Since the data in the real-time view and batch view are different from a timestamp perspective, the combine/merge is typically done by concatenate the results together. In case of any conflict (same time stamp), the one from Batch view will overwrite the one from Realtime view.

Final Thoughts

By separating different responsibility into different layers, the Lambda architecture can leverage different optimization techniques specifically designed for different constraints. For example, the Batch Layer focuses in large scale data processing using simple, start-from-scratch approach and not worrying about the processing latency. On the other hand, the Real-time Layer covers where the Batch Layer left off and focus in low-latency merging of the latest information and no need to worry about large scale. Finally the Serving Layer is responsible to stitch together the Batch View and Realtime View to provide the final complete picture.

The clear demarcation of responsibility also enable different technology stacks to be utilized at each layer and hence can tailor more closely to the organization's specific business need. Nevertheless, using a very different mechanism to update the Batch view (ie: start-from-scratch) and Realtime view (ie: incremental merge) requires two different algorithm implementation and code base to handle the same type of data. This can increase the code maintenance effort and can be considered to be the price to pay for bridging the fundamental gap between the "scalability" and "low latency" need.

Nathan's Lambda architecture also introduce a set of candidate technologies which he has developed and used in his past projects (e.g. Hadoop for storing Master dataset, Hadoop for generating Batch view, ElephantDB for batch serving DB, Cassandra for realtime serving DB, STORM for generating Realtime view). The beauty of Lambda architecture is that the choice of technologies is completely decoupled so I intentionally do not describe any of their details in this post. On the other hand, I have my own favorite which is different and that will be covered in my future posts.

Wednesday, October 7, 2009

Architecture Review Process

If you are lucky enough to keep the engineers who are knowledgeable about the current system around, then conducting a formal "Architecture Review Process" is probably the most efficient way to understand how the existing system work.

However, if these people have already left the company already, then a different reverse engineering effort need to be taken instead.

Participants
It is usually involved a number of key persons

A facilitator who orchestrate the whole review process and control the level of depth at different stages
A recorder who documents key points, ideas, observations and outstanding issues throughout the review process
Key architects and developers who collectively understand the details of the legacy systems, and be able to get down to any level of details (even code walk-through) if necessary.
Domain expert who understand every details of how people use the system, what are their current pain points and what features will help them most. The domain expert also helps to set business priorities between conflicting goals.

Process

The architecture review process has the following steps.

Use Cases and Actors
From the business domain expert, we focus in the "actors" (who use the system) as well as the "use case" (how they use it). We also look at the key metrics to measure the efficiency of their tasks (e.g. how long does it take to complete the use case)

Activity Analysis
Drill down from each use case, we identify activities that each actor perform. We look at what data need to be captured at each activity and how actors interact with each other as well as with the system.

At this point, we should establish a good external view of the system. Now we dig into the internals of it ...

Technology Stack
This purpose is to understand what are those building blocks underlie the system and get a good sense of whether the build vs buy combination is correct. Things like which programming language, Java vs DOtNet, which App Server, what DB, any ORM vs direct SQL, XML vs JSON, which IOC or AOP container, Messaging framework ... etc need to be discussed. We also need to distinguish the core features (which you mostly want to build) from the non-core features (which you mostly want to leverage 3rd party code). By the end of this exercise, we'll get a very good understanding about the foundation on which we write our code and perhaps we can also identify certain areas where we can swap in 3rd party code.

Component Analysis
This is the portion where most of the time is being spent. Here we dissect the whole system into components. It starts off by the architect highlighting a list of major components of current system. For each component, we look at

The responsibility of the component
The persistent data owned by the component and the life cycle of maintaining this data
The interface of the component
The thread model executing the logic of the component (ie: Caller thread vs listening thread vs a new spawn thread) as well as any concurrent access implications
What are the potential bottleneck of this component and how we can remove the bottleneck when it occurs.
How does the component scale up along growth of different dimensions (e.g. more users, more data, more traffic rate ... etc) ?
What is the impact if this component crashes ? How does the recovery happen ?

It is important to realize whether the component communicates across VM boundaries. If so,

What is the authentication and authorization mechanism ?
What is the message format being communicated ?
Is the data transfer in clear text or encrypted ? And where is the secret key being stored ?
What is the handshaking sequence (protocol) ?
Is the component stateful or stateless ?

Since we already dive deep into the architecture, I usually take a further step to drill into the code by asking the developers to walk me through the code of some key components. This usually will give me a good sense about the code quality. Whether the code is easy to read, whether the logic is easy to follow, whether the method is too many lines of code, whether there is duplicated logic scattering around ... etc, and more important, whether there are sufficient unit tests around.

Maintainability Analysis
This focus in the ongoing maintenance of the architecture, things like ...

Is the system sufficiently instrumented for monitoring purpose ?
When the problem happens, is there enough trace around to quickly identify what went wrong ?
Can the system continue working (with some tolerable degradation) when some components fail ?

Extensibility Analysis
Understand the parameters that affects the behavior of the system. When different scenarios of changes happens, how much code need to be change to accommodate that ? Or can the system still serve by just changing the configurable parameters ? For example, does the system hard code business rules or using some kind of rule engine ? Does the system hard code the business flow or using some kind of workflow system ? What if the system need to serve a different UI device (like mobile devices) ?

Output

The output of an architecture review process is typically

A set of documents/diagrams of the key abstractions that gives a better view of how the overall system. This documents should help a new comer to get up to speed quicker as well as communicate the architecture to a broader audiences.
A set of recommended action plans on what can be done to improve the current architecture.

Sunday, September 27, 2009

Reinforcement Learning

Reinforcement Learning (RL) is a type of Machine Learning other than "supervised learning" (having a teaching phase that shows the learning between inputs and correct answers) and "unsupervised learning" (discovering clusters and outliers from a set of input samples).

In RL, consider there exist a set of "states" (from the environment) where the agent is going to make some decision of which actions to take and this action will cause it to transfer to a different state. A reward is assigned to the agent after this state transition. During the RL process, the agent's goal is to go through a trial and error process to learn what would be the optimal decision at each state such that the reward is maximized.

The hard part of RL is to know which action has a long term effect on the final outcome. For example, a wrong decision may not have an immediate bad result and therefore may be hidden. RL is about how to assign blames to previous decisions when a bad outcome has been detected.

Basic Iteration Approach
There is a reward matrix, each row represents the from-state and each column represent the to-state. The cell (i, j) represent the "immediate reward" obtained when moving from state i to state j.

The goal is to find an optimal policy which recommends the action that should be taken at each state in order to maximize the sum of reward.

We can use a value vector of each element (i) to represent agent's perception of the overall gained reward if he is at state (i). At the beginning, the value vector is set with random value. We use the following iterative approach to modify the value vector until it converges.

def learn_value_vector
 current_state = initial_state
 set value_vector to all zeros
 repeat until value_vector.converges
   # Need to enumerate all reachable next states
   for each state(j) reachable by current state(i)
     Take action to reach next state(j)
     Collect reward(i, j)
     action_value(j) =
       reward(i, j) + discount * value_vector(j)
   end
   # Since you will always take the path of max overall reward
   value_vector(i) = max_over_j(action_value(j))
   current_state = state(maxj)
 end
end

After we figure out this value vector, deriving the policy is straightforward. We just need to look across all the value of subsequent next states and pick the one with the highest value.

def learn_policy1
 for each state(i)
   best_transition = nil
   max_value = 0
   for each state(j) reachable from state(i)
     if value(j) > max_value
       best_transition = j
       max_value = value(j)
     end
   end
   policy(i) = best_transition
 end
end

One problem of this approach is requiring us to try out all possible actions and evaluate all the rewards to the next state. So there is an improve iterative approach described below.

Q-Learning
In Q-Learning, we use a Q Matrix instead of the value vector. Instead of estimating the value of each state, we estimate the value of each transition from the current state. In other words, we associate the value with the pair instead of just .

Therefore, the cell(i, j) of the Q matrix represents the agent's perceived value of the transition from state(i) to state(j). We use the following iterative approach to modify the value vector until it converges.

def learn_q_matrix
 current_state = initial_state
 set Q matrix to all zeros
 repeat until Q matrix converges
   select the next state(j) randomly
   collect reward (i, j)
   value(j) = max Q(j, k) across k
   Q(i, j) = reward(i, j) + discount * value(j)
 end
end

After figuring out the Q matrix, the policy at state (i) is simply by picking state(j) which has the max Q(i, j) value.

def learn_policy2
 for each state(i)
   best_transition = nil
   max_value = 0
   for each state(j) reachable from state(i)
     if Q(i,j) > max_value
       best_transition = j
       max_value = Q(i,j)
     end
   end
   policy(i) = best_transition
 end
end

The relationship between the action (what the agent do) and the state transition (what is the new state end up) doesn't necessary be deterministic. In real life, the action and its effect is usually probabilistic rather than deterministic. (e.g. if you leave your house early, you are more likely to reach your office earlier, but it is not guaranteed). Imagine of a probabilistic state transition diagram, where each action has multiple branches leading to each possible next state with a probability assigned to each branch. Making decisions in this model is called the Marchov Decision Process.

The Q-learning approach described above is also good for Marchov Decision Process.

For some good articles in RL,
Reinforcement Learning: A tutorial
Q-learning by examples

Friday, May 8, 2009

Machine Learning: Linear Model

Linear Model is a family of model-based learning approaches that assume the output y can be expressed as a linear algebraic relation with the input attributes x1, x2 ...

Here our goal is to learn the parameters of the underlying model, which the coefficients.

Linear Regression

Here the input and output are both real numbers, related through a simple linear relationship. The learning goal is to figure out the hidden weight value (ie: the W vector).

Given a batch of training data, we want to figure out the weight vector W such that the total sum of error (which is the difference between the predicted output and the actual output) to be minimized.

Instead of using the batch processing approach, a more effective approach is to learn incrementally (update the weight vector for each input data) using a gradient descent approach.

Gradient Descent

Gradient descent is a very general technique that we can use to incrementally adjust the parameters of the linear model. The basic idea of "gradient descent" is to adjust each dimension (w0, w1, w2) of the W vector according to their contribution of the square error. Their contribution is measured by the gradient along the dimension which is the differentiation of the square error with respect to w0, w1, w2.

In the case of Linear Regression ...

Logistic Regression

Logistic Regression is used when the output y is binary and not a real number. The first part is the same as linear regression while a second step sigmod function is applied to clamp the output value between 0 and 1.

We use the exact same gradient descent approach to determine the weight vector W.

Neural Network

Inspired by how our brain works, Neural network organize many logistic regression units into layers of perceptrons (each unit has both input and outputs in binary form).

Learning in Neural network is to discover all the hidden values of w. In general, we use the same technique above to adjust the weight using gradient descent layer by layer. We start from the output layer and move towards the input layer (this technique is called backpropagation). Except the output layer, we don't exactly know the error at the hidden layer, we need to have a way to estimate the error at the hidden layers.

But notice there is a symmetry between the weight and the input, we can use the same technique how we adjust the weight to estimate the error of the hidden layer.

Support Vector Machine

Tuesday, May 5, 2009

Machine Learning: Nearest Neighbor

This is the simplest technique for instance based learning. Basically, find a previous seen data that is "closest" to the query data point. And then use its previous output for prediction.

The concept of "close" is defined by a distance function, dist(A, B) gives a quantity which need to observe the triangular inequality.
ie: dist(A, B) + dist(B, C) >= dist(A, C)

Defining the distance function can be domain specific. One popular generic distance function is to use the Euclidean distance.
dist(A, B) = square_root(sum_over_i(square(xai - xbi)))

In order to give each attribute the same degree of influence, you need to normalize their scale within the same range. On the other hand, you need to figure out a way to compute the difference between categorical values (ie: whether "red" is more similar to "blue" or "green"). A common approach is to see whether "red" and "blue" affects the output value in a similar way. If both colors has similar probability distribution across each output value, then we consider the two colors are similar.

Therefore you need to transform the attributes xi.

Normalize their scale: transform xi = (xi - mean) / std-deviation
Quantify categorical data: If xi is categorical, then (xai - xbi) = sum_over_k(P(class[k] | xai) – P(class[k] | xbi))

Nearest neighbor will be sensitive to outliers, say you have a few abnormal data and query point around these outliers will be wrongly estimated. One solution is to use multiple nearest neighbors and combine their output in a certain way. This is known as KNN (k-nearest-neighbor). If the problem is classification, every neighbor will cast a vote with a weight inversely proportional to the "distance" with the query point, and the majority win. If the problem is regression, the weighted average of their output will be used instead.

Execution Optimization

One problem of instance-based learning is that you need to store all previously seen data and also compute the distance of query point to each of them. Both time and space complexity to serve a single query is O(M * N) where M is the number of dimensions and N is the number of previous data points.

Instead of compute the distance between the query point to each of the existing data points, you can organized the existing points into a KD Tree based on the distance function. The KD Tree has the properties that the max distance between two nodes is bound by the level of their common parent.

Using the KD Tree, you navigate the tree starting at the root node. Basically, you calculate the dist(current_node, query_point)

and each of the child nodes of the current_node
dist(child_j, query_point)

And then find the minimum of them, if the minimum is one of its child, then you navigate down the tree by setting current_node to this child and repeat the process. You terminate if the current_node is the minimum, or when there is no more child nodes.

After terminating at a particular node, this node is pretty close to the query point. You need to explore the surrounding nodes around this node (its siblings, siblings child, parent's siblings) to locate the K nearest neighbors.

By using a KD Tree, the time complexity depends on the depth of the tree and hence of order O(M * log N)

Note that KD Tree is not effective when the data has high dimensions (> 6).

Another way is to throw away some of the previous seen data if they won't affect the result prediction (especially effective for classification problem, you can just keep the data at the boundary between two different output values and throw away the interior points of a cluster of data points all has the same output values). However, if you are using KNN, then throwing away some points may change the result. So a general approach is to verify the previous seen data is still correctly predicted after throwing out various combination of points.

Recommendation Engine

A very popular application of KNN is the recommendation engine of many e-commerce web sites using a technique called "Collaborative Filtering". E.g. An online user have purchased a book, the web site looks at other "similar" users to see what other books they have seen and recommends that to the current user.

First of all, how do we determine what attributes of the users to be captured. This is a domain-specific questions because we want to identify those attributes that are most influential, maybe we can use static information such as user's age, gender, city ... etc. But here lets use something more direct ... the implicit transaction information (e.g. if the user has purchased a book online, we know that he likes that book) as well as explicit rating information (e.g. the user rates a book he bought previously so we know whether he/she likes the book or not).

Lets use a simple example to illustrate the idea. Here we have a number of users who rates a set of movies. The ratings is from [0 - 10] where 0 means hates it and 10 means extremely likes it.

The next important things is to define the distance function. We don't want to use the rating directly because of the following reasons.

Some nice users give an average rating of 7 while some tough users give an average rating of 5. On the other hand, the range of ratings of some users are wide while other users are narrow. However, we don't want these factors to affect our calculation of user similarity. We consider two users of same taste as long as they rate the same movie above their average or below their average with the same percentage of their rating range. Two users has different taste if they rate the movies in different directions.

Lets call rating_i_x to denote user_i's rating on movie_x

We can use the correlation coefficient to capture this.

sum_of_product =
sum_over_x( (rating_i_x - avg(rating_i)) * (rating_j_x - avg(rating_j)) )

If this number is +ve, then user_i and user_j are moving in the same direction. If this number is -ve, then they are moving in opposite direction (negatively correlated). If this number is zero, then they are uncorrelated.

We also need to normalize them with the range of the user's ratings, so we compute
root_of_product_square_sum =
square_root(sum_over_x( ((rating_i_x - avg(rating_i)) **2) * ((rating_j_x - avg(rating_j)) **2) )))

Define Pearson Coefficient = sum_of_product / root_of_product_square_sum

Let Pearson Coefficient to quantify the "similarity" between 2 users.

We may also use negatively correlated users to make recommendation. For example, if user_i and user_j is negatively correlated, then we can recommend the movies that user_j hates to user_i. However, this seems to be a bit risky so we are not doing it here.

Monday, May 4, 2009

Machine Learning: Probabilistic Model

Probabilistic model is a very popular approach of “model-based learning” based on Bayesian theory. Under this approach, all input attributes is binary (for now) and the output is categorical.

Here, we are given a set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data

We assume there exist a hidden probability distribution from the input attributes to the output. The goal is to learn this hidden distribution and apply it to the input attributes of the later encountered query point to pick the class that has the maximum probability.

Making Prediction

Lets say the possible value of output y is {class_1, class_2, class_3}. Given input [x1, x2, x3, x4], we need to compute the probability of each output class_j, and predict the one which has the highest value.

max {P(class_j | observed_attributes)}

According to Bayes theorem, this value is equal to …

max { P(observed_attributes | class_j) * P(class_j) / P(observed_attributes) }

The dominator is the same for all class_j, so we can ignore it, so we just need to find

max { P(observed_attributes | class_j) * P(class_j) }

P(class_j) is easy to find, we just calculate

P(class_j) = (samples_of_class_j / total samples)

Now, lets look at the other term, P(observed_attributes | class_j), from Bayesian theory

P(x1 ^ x2 ^ x3 ^ x4 | class_j) =
P(x1 | class_j) *
P(x2 | class_j ^ x1) *
P(x3 | class_j ^ x1 ^ x2) *
P(x4 | class_j ^ x1 ^ x2 ^ x3)

Learning the probability distribution

In order to provide all the above terms for the prediction, we need to build the probability distribution model by observing the training data set.

Notice that finding the last term is difficult. Assume x1, x2 ... is binary, there can be 2 ** (m – 1) possible situations to watch. It is very likely that we haven’t seen enough situations in the training data, in this case this term for all class_j will be zero. So a common solution is to start with count = 1 for all possible combinations and increase the count when we see an occurrence in the training data.

Bayesian Network

Bayesian Network base on the fact we know certain attributes are clearly independent. By applying this domain knowledge, we draw a dependency graph between attributes. The probability of occurrence of a particular node only depends on the occurrence of its parent nodes and nothing else. To be more precise, nodeA and nodeB (which is not related with a parent-child relationship) doesn't need to be completely independent, they just need to be independent given their parents.

In other words, we don't mean P(A|B) = P(A),
we just need P(A | parentsOfA ^ B) = P(A | parentsOfA)
Therefore P(x4 | class_j ^ x1 ^ x2 ^ x3) = P(x4 | class_j ^ parentsOf_x4)

we only need to find 2 ** p situations of the occurrence combination of the parent nodes where p is the number of parent nodes.

Naive Bayes

Naive Bayes takes a step even further by assuming every node is completely independent

Note that x1, x2, x3, x4 can be categorical as well. For example, if x1 represents zip code, then P('95110' | class_j) is the same as P(x1 = '95110' | class_j).

What if x1 is a continuous variable ? (say height of a person). The challenge is that a continuous variable has infinite possibility such that the chance of seeing one in the training data is almost zero.

One way to deal with continuous variable is to discretetize it. In other words, we can partition x1 into buckets which has an associated range and assign x1 to the corresponding bucket if it falls into that range.

Another way is for each output class_j, we presume a arbitrary distribution function for x1. (lets say we pick the normal distribution function). So the purpose of the training phase is to figure out the parameter of this distribution function (in this case, it is the mean and standard deviation).

In other words, we use the subset of training data whose output class is class_j. Within this subset, we compute the mean and standard deviation of x1 (height of the person). Later on when we want to compute P(x1 = 6.1 feet | class_j), we just apply the distribution function (plug-in the learned mean and standard deviation).

Note that the choice of the form of the distribution function is quite arbitrary here and it may be simply wrong. So it is important to analyze how x1 affects the output class in order to pick the right distribution function.

Spam Filter Application

Lets walk through an application of the Naive Bayes approach. Here we want to classify a particular email to determine whether it is spam or not.

The possible value of output y is {spam, nonspam}

Given an email: "Hi, how are you doing ?", we need to find ...
max of P(spam | mail) and P(nonspam | mail), which is same as ...
max of P(mail | spam) * P(spam) and P(mail | nonspam) * P(nonspam)

Lets focus in P(mail | spam) * P(spam)

We can view mail as an array of words [w1, w2, w3, w4, w5]

P(mail | spam) =
P(w1='hi' | spam) *
P(w2='how' | spam ^ w1='hi') *
P(w3='are' | spam ^ w1='hi' ^ w2='how') *
...

We make some naive assumptions here

Chance of occurrence is independent of preceding words. In other words, P(w2='how' | spam ^ w1='hi') is the same as P(w2='how' | spam)
Chance of occurrence is independent of word position. P(w2='how' | spam) is the same as (number of 'how' in spam mail) / (number of all words in spam mail)

With these assumptions ...

P(mail | spam) =
(hi_count_in_spam / total_words_in_spam) *
(how_count_in_spam / total_words_in_spam) *
(are_count_in_spam / total_words_in_spam) *
...

What if we haven't seen the word "how" from the training data ? Then the probability will becomes zero. Here we adjust terms such that a reasonable probability is assigned to unseen words.

P(mail | spam) =
((hi_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((how_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((are_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
...

What we need is the word_count per word/class combination as well as the word_count per class. This can be done through feeding a large number of training sample mails labeled with "spam" or "nonspam" into a learning process.

The learning process can also be done in parallel using a 2-rounds of Map/Reduce.

Alternatively, we can also update the counts incrementally as new mail arrives.

Saturday, May 2, 2009

Machine Learning Intuition

As more and more user data are gathered on different web sites (such as e-commerce, social network), data mining / machine learning technique becomes an increasingly important tool to analysis them and extract useful information out of it.

There a wide variety of machine learning applications, such as …

Recommendation: After buying a book at Amazon, or rent a movie from Netflix, they recommends what other items that you may be interested
Fraud detection: To protect its buyer and seller, an auction site like EBay detect abnormal patterns to identify fraudulent transaction
Market segmentation: Product company divide their market into segments of similar potential customers and design specific marketing campaign for each segment.
Social network analysis: By analysis the user’s social network profile data, social networking site like Facebook can categorize their users and personalize their experience
Medical research: Analyzing DNA patterns, Cancer research, Diagnose problem from symptoms

However, machine learning theory involves a lot of math which is non-trivial for people who doesn’t have the rigorous math background. Therefore, I am trying to provide an intuition perspective behind the math.

General Problem

Each piece of data can be represented as a vector [x1, x2, …] where xi are the attributes of the data.

Such attributes can be numeric or categorical. (e.g. age is an numeric attribute and gender is a categorical attribute)

There are basically 3 branch of machine learning ...

Supervised learning

The main use of supervised learning is to predict an output based on a set of training data. A set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data.
When y is numeric, the prediction is called regression. When y is categorical, the prediction is called classification.

Unsupervised learning

The main use of unsupervised learning is to discover unknown patterns within data. (e.g. grouping similar data, or detecting outliers).
Identifying clusters is a classical scenario of unsupervised learning

Reinforcement learning

This is also known as “continuous learning” where the final output is not given. The agent will choose an action based on its current state and then will be present with an award. The agent learns how to maximize its award and come up with a model call “optimal policy”. A policy is a mapping between from “state” to “action” (given I am at a particular state, what action should I take).

Data Warehouse

Data warehouse is not “machine learning”, it is basically a special way to store your data so that it can be easily group in many ways for doing analysis in a manual way.

Typically, data is created from OLTP systems which runs the company’s business operation. OLTP capture the “latest state” of the company. Data are periodically snapshot to the data-warehouse for OLAP, in other words, data-warehouse add a time dimension to the data.

There is an ETL process that extract data from various sources, cleansing the data, transform to the form needed by the data-warehouse and then load into the data cube.

Data-warehouse typically organize the data as a multi-dimensional data cube based on a "Star schema" (1 Fact table + N Dimension tables). Each cell contains aggregate data along different (combination) of dimensions.

OLAP processing involves the following operations

Rollup: Aggregate data within a particular dimension. (e.g. For the “time” dimension, you can “rollup” the aggregation from “month” into “quarter”)
Drilldown: Breakdown the data within a particular dimension (e.g. For the “time” dimension, you can “drilldown” from months” into “days”)
Slice: Cut a layer out of a particular dimension (e.g. Look at all data at “Feb”)
Dice: Select a sub data cube (e.g. Look at all data at “Jan” and “Feb” as well as product “laptop” and “hard disk”

The Data-warehouse can be further diced into specific “data marts” that focus in different areas for further drilldown analysis.

Some Philosophy

To determine the output from a set of input attributes, one way is to study the physics behinds them and write a function that transform the input attributes to the output. However, what if the relationship is unknown ? or the relationship hasn’t been formally specified ?

Instead of based on a sound theoretical model, machine learning is trying to make prediction based on previously observed data. There are 2 broad type of learning strategies

Instance-based learning

Also known as lazy learning, the learner remembers all the previous seen examples. When a new piece of input data arrives, it tried to find the best matched data it previous seen and use its output to predict the output of the new data. It has an underlying assumption that if two piece of data are “similar” in their input attributes, their output are also similar.
Nearest neighbor is a classical approach for instance-based learning

Model-based learning

Also known as eager learning, the learner assume the input attributes and the output are related in a particular model (e.g. linear regression, logistic regression, neural network, probabilistic belief network, decision tree … etc). So the learner learn the parameters of such a model based on previously seen examples. When a new piece of input data arrives, it will use the learned model to predict the output.

Eager learning that learn a generalized model upfront, and lazy learning learn from seen examples at the time of query. Instead of learning a generic model that fits all observed data, lazy learning can focus its analysis close to the query point. However, getting a comprehensible model in lazy learning is harder and it also require large memory to store all seen data.

Wednesday, October 31, 2007

How bad code is formed

Ignore in some case consultants do this purposefully to keep their contract, in general no one intentionally create unmaintainable code. However, bad code silently creep in over the period of the product's life cycle.

Manager's mind

Lets write the code quick. We don't have enough time for the next rollout. If we cannot rollout, the project may die quickly. Lets put long-term maintainability as a secondary priority, which we'll improve after the rollout when we have time. By the way, who knows if I am still working in this project after a year.

Unfortunately, if this rollout is successful, the next release cycle has an even more tighter schedule
Code is "written once" and "read several hundred times" throughout its lifecycle. Subsequent maintenance cost is several hundreds times more than the gain we made in its initial rollout.

I cannot justify any resources spent on refactoring because it doesn't add any features to my next release. The code is working already. You mention about improved code maintainability and I don't know how to quantify that.

This is how bad code retains and accumulates. After its momentum has grown to a certain point, then everyone just accept the reality that the code is already bad enough that no motivation is there to improve it.

Lets have the developers focusing in writing the real code. We'll have the QA folks worry about the testing. This way, they can proceed at full speed in parallel.

Developer's mind

Similar logic is working elsewhere but it doesn't fit exactly what I want. It isn't provide enough feature for my need and it also provide some features that I don't need. It is not worthwhile to spend my time to read and understand how it works, lets re-implement my version from scratch.

This is how duplicated logic getting into the system. There are many ways to do one thing.

Similar logic is already working elsewhere. It is not worthwhile to spend my time reinventing the wheel. Lets copy it here and modify for my need.

This is another way of how duplicated code is getting into the system. When you change the code in one place, you need to make corresponding changes in many other places where you have made copies, otherwise some copies may not be working correctly.
Due to "laziness", variable names are kept the same in their copies, making the copied code even harder to understand.

This code is not broken, please don't make any change

This is how bad code retains and accumulates. After its momentum has grown to a certain point, then everyone just accept the reality that the code is already bad enough that no motivation is there to improve it.

Of course the code is hard to read because it is solving a complex problem. If the code is easy to read, it must not be doing anything significant. You just need to be smarter when reading the code.

I am not convinced that this is the right way of doing this kind of things. Therefore, I am not going to follow the convention even it has been done in many other places. I am going to do it my way.

I don't know why I have to do it this way. But since this is the way we do this kind of things in many other places, I just follow that convention. Is "consistency" a problem ?

Coding for maintainability

One important criteria for judging the value of any serious software product is how maintainable it is over its life cycle. It is important to notice that code is "write once" but "read many times" when people subsequently maintains it (enhancing features, fixing bugs, optimizing performance). Any cost spent in understanding any non-trivial code will be multiplied many times by itself and added to the overall maintenance budget of the software product.

It is so important to spend all the effort to make the code as readable as possible. Once you do it, you will start to get your dividends over the product's life cycle.

Lets set some criteria/characteristics of code quality from a maintainability standpoint.

Good code

Just by looking at its signature, you can guess what a class/method/function is doing without looking at its implementation. A very descriptive name of the class/method/function and parameter/variable are carefully chosen such that its associated semantics is very clear. When you open up its implementation, it is doing exactly what you guess.

Since the naming is so descriptive, it doesn't need a lot of comments to make the code readable. Comments are only appears for code segments that are implementing a complex algorithm or when optimizing performance of hotspots (where code readability is intentionally sacrificed).

The responsibility of a class/method/function is very clear. There is only one way to do a particular thing and is well-encapsulated within the responsible class/method/function. There is no duplicated code.

Responsibility tends to be distributed evenly so you don't have a package with too many classes or a class with too many instance variables/methods. And you won't have a method that span more than 2 pages of code lines.

Bad code

Similar code shows up in many places, it is hard to know which of these are "similar" and which are "exactly same"

Extremely long methods (couple hundreds lines of code within a method) and usually accompanied by deeply nested case statements

Only have trivial comments and lack of useful comments

There are many different ways to do one particular thing although they are supposed to achieve the same result

To do a particular thing, the caller need to make multiple calls to different objects in a certain sequence.

The class/method/function seems to be implying certain responsibilities but when you open up its implementation, you figure out it is doing something completely different. You quickly lose confidence in your understanding and has to trace down into all implementations details to see what it is actually doing in every place.

There are some places in the code where you cannot figure out what it is trying to do

In my next blog, I'll discuss how bad code is formed.

Thursday, October 25, 2007

Understand Legacy Code

The first step to support Legacy System is to understand the code. One approach is to read through all the lines of code. Quite often, this is quite ineffective, you may be wasting a lot of time reading dead code.

What I typically do is quickly scan through the package structure of the code, the class name and public methods. If the code structure is good, then I can get some rough idea about where the logic lies. However, I rely more on analyzing the run time behavior of the code to determine its logic flow.

A technique that I use frequently is to just run the system and determine the execution path. At the first step, I need to determine the entry points of the system.

Locate the entry points

Usually, the entry points are ...

main() method where the system boots
background maintenance tasks
event-driven callbacks

Finding out these entry points is non-trivial, especially if the documentation of the overall system is poor. I have built a tool that allows me to instrument those classes that I am interested so that when I start the system, it will print out all the entry points when they execute. Entry point is the lowest method of the stack trace that you are interested. The tool is based on AspectJ which allows me to inject some print statements when certain methods are executed.

After I get an idea about where the entry points are, the next step is to drill down into each execution point to study what it is doing.

Drill down into entry points

The first thing I usually do is to read through the code trying to understand what it is doing. Eclipse IDE is very effective to navigate through the code to see which method is calling which method. The most frequently used Eclipse feature is ...

"open declaration" which jumps to the method being called
"references" which list all the methods that is calling this methods
"Calling hierachy" which is a tree of callers which calls this method
"Type hierachy" which is a tree of Class hierachy

After that, I will add trace statements into the code to verify my understanding. Consider Trace statement is an enhanced logging mechanism which include the thread id, the calling stack trace and a message. Trace statements is temporary and its purposes is help me to understanding the code. After that, I will remove my trace statement. Since I don't want anyone to use the trace statements in the production environment, I wrote a "Trace" class so that I can easily remove it when I am done.

Document the understanding

Draw a UML diagram to describe my understanding. And also take notes about any observed issues and what can be improved.

Legacy System Phenomenen

As mentioned in my earlier blog, the software engineering skills we've learned assumes a green-field project where we starts on a clean sheet of paper. In this case, we can apply a number of practices to build a solid foundation for future evolution and maintenance.

Test Driven Development will force you to document the external behavior of your system in terms of Unit Test
Enforce a coding standard and use a tool to enforce that
Make sure the code is regularly reviewed to ensure readability and modularity

In reality, less than 2% project are under this situation. Mostly likely, we are dealing with a piece of software that is not built on this foundation. In many cases, even the projects starts from a good foundation, it will deteriorate over time because of schedule pressure, lack of engineering skills, people turnover ... etc.

How can one support such a legacy system ? By support, I mean the following ...

Fix bugs when it occurs
Tune performance when loading increases
Add features when new requirement arrives

In order to do the these things effectively, we need to first "understand" the legacy code and then "refactor" it. "Understanding" is about how to quickly establish an insight of the inner workings of the legacy code without making any modifications that may accidentally change its external behavior and causes bugs. "Refactoring" is about how to improve the understandability of legacy code by making modification to the code without changing its external behavior.

In later blogs, I will share my techniques of how to understand a piece of code written by someone else and also how to refactor it safely.

Monday, October 22, 2007

"Test" as the "Spec"

One of a frequently encountered question in enterprise software development is where does the hand off happen from the architect (who design the software) to the developer (who implements the software). Usually, the architect designs the software at a higher level of abstraction and then communicate her design to the developers, who break down into more concrete, detail-level design before turning into implementation-level code.

The hand off happens typically via an architecture spec written by the architect, composed of UML class diagrams, sequence diagrams or state transition diagrams ... etc. Based on the understanding from these diagrams, the developers go ahead to write the code.

However, the progression is not as smooth as we expect. There is a gap between the architecture diagrams and the code and a transformation is required. During such transformation, there is a possibility of mis-interpretation, wrong assumption. Quite often, the system ends up to be wrongly implemented due to miscommunication between the developer and the architect. Such miscommunication can either due to the architect hasn't described his design clear enough in the spec, or because the developer is not experienced enough to fill in some left-out detail that the architect consider obvious.

One way to mitigate this problem is to have more design review meetings, or code review session to make sure what is implemented is correctly reflecting the design. Unfortunately, I found such review sessions are usually not happening either because the architect is too busy in other tasks, or she is reluctant to read the developer's code. It ends up the implementation doesn't match the design. Quite often, this discrepancy is discovered at a very late stage and left no time to fix. While developers start patching the current implementation for bug fixing or adding new features, the architect lose the control on the architecture evolution.

Is there a way for the architect to enforce her design at the early stage given the following common constraints ?

The architect cannot afford frequent progress/checkpoint review meetings
While making sure the implementation compliant with the design at a higher level, the architect doesn't want to dictate the low level implementation details

Test As A Specification

The solution is to have the architect writing the Unit Tests (e.g. JUnit Test Classes in Java), which acts as the "Spec" of her design.

In this model, the architect will focus in the "interface" aspect and how this system interact with external parties, such as the client (how this system will be used), as well as the collaborators (how this system uses other systems).

The system will expose a set of "Facade" classes which encapsulate the system's external behavior and act as the entry point to its client. Therefore, by writing Unit Tests against these "Facades", the architect can fully specify the external behavior of the system.

A set of "Collaborator" classes is also defined to explicitly capture the interaction of this system to other supporting systems. This "Collaborator" classes are specified in terms of Mock Objects so that the required behavior of supporting system is fully specified. On the other hand, the interaction sequence with the supporting systems are specified via "expectation" of Mock objects.

The architect will specify the behavior of "Facade" and "Collaborator" in a set of XUnit Test Cases, which acts as the design spec of the system. This way, the architect is defining the external behavior of the system while giving enough freedom for the developers to decide on the internal implementation structure. Typically, there are many "Impl Detail" classes which the Facade delegates to. These "Impl Detail" classes will invoke the "Collaborator interface" to get things done in some cases.

Note that the architect is not writing ALL the test cases. Architecture-level Unit Test are just a small subset of the overall test cases specifically focus in the architecture level abstraction. These tests are specifically written to ignore the implementation detail so that its stability will not be affected by change of implementation logic.

On the other hand, developers will also provide a different set of TestCases that covers the "Impl Detail" classes. Usually, this set of "Impl Level TestCase" which change when the developers change the internal implementations. By separating these 2 sets of test cases under different categories, they can evolve independently when different aspects of the system changes along its life cycle, and resulting in a more maintainable system as it evolves.

Example: User Authentication

To illustrate, lets go through an example using a User Authentication system. There maybe 40 classes that implements this whole UserAuthSubsystem. But the architecture-level test cases only focused in the Facade classes and specify only the "external behavior" of what this subsystem should provide. It doesn't touch any of the underlying implementation classes because those are the implementor's choices which the architect doesn't want to constrain.

** User Authentication Subsystem Spec starts here **

Responsibility:

Register User -- register a new user
Remove User -- delete a registered user
Process User Login -- authenticate a user login and activate a user session
Process User Logout -- inactivate an existing user session

Collaborators:

Credit Card Verifier -- Tell if the user name match the the card holder
User Database - Store user's login name, password and personal information


public class UserAuthSystemTest {
 UserDB mockedUserDB;
 CreditCardVerifier mockedCreditCardVerifier;
 UserAuthSystem uas;

 @Before
 public void setUp() {
   // Setup the Mock collaborators
   mockedUserDB = createMock(UserDB.class);
   mockedCardVerifier =
     createMock(CreditCardVerifier.class);
   uas =
     new UserAuthSubsystem(mockedUserDB,
                       mockedCardVerifier);
 }

 @Test
 public void testUserLogin_withIncorrectPassword() {
   String uName = "ricky";
   String pwd = "test1234";

   // Define the interactions with Collaborators
   expect(mockUserDB.checkPassword(uName, pwd)))
        .andReturn("false");
   replay();

   // Check the external behavior is correct
   assertFalse(uas.login(userName, password));
   assertNull(uas.getLoginSession(userName));

   // Check the collaboration with collaborators
   verify();
 }

 @Test
 public void testRegistration_withGoodCreditCard() {
   String userName = "Ricky TAM";
   String password = "testp";
   String creditCard = "123456781234";
   expect(mockCardVerifier.checkCard(userName,creditCard)))
           .andReturn("true");
   expect(mockUserDB.addUser(userName, password)));
   replay();
   uas.registerUser(userName, creditCard, password));
   verify();
 }

 @Test
 public void testUserLogin_withCorrectPassword() { .... }

 @Test
 public void testRegistration_withBadCreditCard() { .... }

 @Test
 public void testUserLogout() { .... }

 @Test
 public void testUnregisterUser() { .... }
}

** User Authentication Subsystem Spec ends here **

Summary

This approach ("Test" as a "Spec") has a number of advantages ...

There is no ambiguation about the system's external behavior and hence no room for mis-communication since the intended behavior of the system is communicated clearly in code.
The architect can write the TestCase at the level of abstractions she choose. She has full control in what she wants to constraint and what she wants to give freedom.
By elevating architect-level test cases as the spec of the system's external behavior. They become more stable and independent of changes in implementation details.
This approach force the architect to think repeatedly what is the "interface" of the subsystem and also what are the collaborators. So the system design is forced to have a clean boundary.

Open Source Phenomenon

The Open Source Phenomenon has changed drastically the skills set requirement for our software development professionals today.

Traditionally, software development involves not just writing the business logic of your application, but also a lot of plumbing logic (such as data persistence, fault resilience). These plumbing logic are typically independent of your business logic, and also quite complicated (although they are well-studied engineering problem). In most projects, large portion of the engineer effort had gone to re-inventing solution for these recurring plumbing problem.

The situation is quite different today in our software development world. Thanks to the Open Source Phenomenon, most of the plumbing logic are available in some forms of software libraries or frameworks with source code available. There are different kinds of open source licensing model with difference degrees of usage or modification restrictions. Nevertheless, the software development game plan was changed drastically since then. Instead of building all plumbing from scratch, we can just leverage what is out there as Open source libraries/frameworks. Open source phenomenon bring huge savings in the project cost and also reduce time to market.

However, surviving in the Open Source world requires a different set of software engineering skill sets that is not taught in our college. Most techniques we learn from colleges assumes a green-field development environment and we learn how to design a good system from scratch rather than how to understand or adapt existing code. We learn how to "write code" but not "read code".

Skills to survive in the Open Source world are quite different. Such as ...

"Knowing how to find" is more important than "knowing how to build"

"Google", "Wikipedia" are your friend. Of course, "Apache", "SourceForge" is good place to check. Sometimes I find it is not easy for a hard-core software engineer to look for other people's code to use. Especially before she has certain familiarity of the code base. The mindset need to be changed now. The confidence should be based on its popularity and in some case, you just use it and see how it goes.

"Quick learning" and "start fast"
Looking at the usage examples is usually a good start to get a quick understanding on how things works. Then start to write some Unit Test code to get a better feel on the area that you are interested.

Be prepared to "switch"
This is very important when you are using less proven Open Source products. Even with proven open source product, the level of supports may fluctuate in future, not something under your control. Therefore, your application architecture should try hard to minimize the coupling (and dependencies) to the open source products.

In most cases, you need just a few features out of the whole suite of the open source product. A common technique that I use often to confine my usage is to define an interface that captures all the API that I need. And then write an implementation which depends on the Open Source product. Now the dependency is localized in the implementation class. Whenever I switch, I just need to rewrite the implementation of the interface. I can even use IOC (inversion of control) mechanism to achieve even zero dependencies between my application to the Open source product.

Ability to drill down

Quite often, the open source provides most of what your need, but not 100%. In other scenarios, you want to do some enhancement, or fixing a bug. You need to be able to drill down into the code and feel comfortable to make changes.

This is a tricky area. I know many good engineers who can write good code, but only very few can pick up code written by other people, and be able to familiar with that. As I said earlier, we are trained to write code, but not read code. In dealing with open source product, code reading skills is very important.

In fact, the skills of reading code is generally very important as most projects are in maintenance mode rather than green-field development mode. I will share code reading technique more detail in later blog when I talk about doing surgery on legacy code. But here is a quick summary of an important subset that are relevant to dissecting open source products ...

Look at its package structure, class name, method name to get some rough ideas about the code structure and data structure.
Put some trace code that prints out its execution path, then run some examples to verify your understanding.
For further drill down, put in break points and step through the code
Document your findings
Subscribe to the community mail alias and send an email to ask

Reverse Engineering Tools

There are a couple of tools that analyze existing code base to extract its design or external behavior. For example, Agitar can analyze your Java bytecode and generate JUnit tests ... Altova can analyze your code to generate the UML diagram.

These reverse engineering tools provide value to legacy application where you have a piece of running production code but very little information about its design, and you have very little test where you can understand its expected behavior. These tools can quickly analyze the code and tell you its observed behavior. It "discover" what the code is current doing and give you a better starting point to do further analysis.
Giving you a better starting point is all it provides. They are typically unable to distinguish between the intended behavior (sitting at a higher level of abstraction) from its actual implementation (sitting at a much lower level). In fact, I doubt if automatically extract the intended behavior is at all possible. The end result is that it swarms you with a lot of low-level, implementation-specific details (because the actual implementation is all it knows) and then you need to manually analyze what is important and what is unimportant to you.

In other words, it gives you a different starting point with a different format (e.g. a set of test methods and class invariants it detects instead of implementation Java code). But it doesn't raise your level of abstraction which I argue has to be done manually anyway. So does it give you a better start ? I think it is better but not too much better. By following a set of code surgery, refactoring technique, we can achieve a much better result. I will share such techniques in later blogs.

Now, is these tools useful for new development ? I don't think so. I am a stronger believer of TDD (Test-Driven-Development) where the test itself should be regarded as the "spec" and so by definition has to be written first and has to be written manually by the application designer. In TDD, you write your test first and write it manually (not generated). You put in all the expected behavior in your test methods, and you also put in Mock object to test its interaction. You diagram your UML model to capture your major classes and methods at a higher abstraction level.

Unit Testing Async Calls

Most Unit testing frameworks assumes a synchronous call model that the calling thread will block until the called method returns. So a typical test method will look like the following ...

class SyncServiceTest {
@Test
void testSyncCall() {
  Object input = ...
  Object expectedOutput = ...
  SyncService syncService = new SyncService();
  assertEqual(syncService.call(input), expectedOutput);
}
}

However, this model doesn't fit if the call is asynchronous; ie: the calling thread is not blocked and the result will come later from a separated callback thread. For example, if you have the following AsyncService, where the "call()" method returns immediately and the result comes back from a callback interface.

class AsyncService {
public static interface ResultHandler {
  void handleResult(Object result);
}

ResultHandler resultHandler;

public AsyncService(ResultHandler resultHandler) {
  this.resultHandler = resultHandler;
}

public void call(final Object request) throws InterruptedException {
  Executor executor =
    new ThreadedExecutor().execute(new Runnable() {
      public void run() {
        Object result = process(request);
        resultHandler.handleResult(result);
      }
    });
}
}

The test method need to be changed. You need to block the calling thread until the result comes back from the callback thread. The modified test code will look like the following ...

class AsyncServiceTest {

Object response;

@Test
public void testAsyncCall() {
  Object input = ...
  Object expectedOutput = ...
  final Latch latch = new Latch();

  AsyncCall asyncCall =
    new AsyncCall(new AsyncCall.ResultHandler() {
      public void handleResult(Object result) {
        response = result;
        latch.release();
      }
    });

  asyncCall.call(input);
  latch.attempt(3000);
  assertEquals(response, expectedOutput);
}
}

In this test method, the calling thread is waiting for the callback thread. A latch is used for the synchronization. You cannot use Thread.wait() and Thread.notify() for this purpose because there is a possibility that notify can be called before the wait and then one thread will wait indefinitely. Latch is a once-off switch and if the "release" is called first, then all subsequent "attempt" call will return immediately.