Customer Experience Matrix: analytics tools

Showing posts with label analytics tools. Show all posts

Thursday, May 08, 2008

Infobright Puts a Clever Twist on the Columnar Database

It took me some time to form a clear picture of analytical database vendor Infobright, despite an excellent white paper that seems to have since vanished from their Web site. [Note: Per Susan Davis' comment below, they have since reloaded it here.] Infobright’s product, named BrightHouse, confused me because it is a SQL-compatible, columnar database, which makes it sound similar to systems like Vertica and ParAccel (click here for my ParAccel entry).

But it turns out there is a critical difference: while those other products rely on massively parallel (MPP) hardware for scalability and performance, BrightHouse runs on conventional (SMP) servers. The system gains its performance edge by breaking each database column into 65K chunks called “data packs”, and reading relatively few packs to resolve most queries.

The trick is that BrightHouse stores descriptive information about each data pack and can often use this information to avoid loading the pack itself. For example, the descriptive information holds minimum and maximum values of data within the pack, plus summary data such as totals. This means that a query involving a certain value range may determine that all or none of the records within a pack are qualified. If all values are out of range, the pack can be ignored; if all values are in range, the summary data may suffice. Only when some but not all of the records within a pack are relevant must the pack itself be loaded from disk and decompressed. According to CEO Miriam Tuerk, this approach can reduce data transfers by up to 90%. The data is also highly compressed when loaded into the packs—by ratios as high as 50:1, although 10:1 is average. This reduces hardware costs and yields even faster disk reads. By contrast, data in MPP columnar systems often takes up as much or more storage space as the source files.

This design is substantially more efficient than conventional columnar systems, which read every record in a given column to resolve queries involving that column. The small size of the BrightHouse data packs means that many packs will be totally included or excluded from queries even without their contents being sorted when the data is loaded. This lack of sorting, along with the lack of indexing or data hashing, yields load rates of up to 250 GB per hour. This is impressive for a SMP system, although MPP systems are faster.

You may wonder what happens to BrightHouse when queries require joins across tables. It turns out that even in these cases, the system can use its summary data to exclude many data packs. In addition, the system watches queries as they execute and builds a record of which data packs are related to other data packs. Subsequent queries can use this information to avoid opening data packs unnecessarily. The system thus gains a performance advantage without requiring a single, predefined join path between tables—something that is present in some other columnar systems, though not all of them. The net result of all this is great flexibility: users can load data from existing source systems without restructuring it, and still get excellent analytical performance.

BrightHouse uses the open source MySQL database interface, allowing it to connect with any data source that is accessible to MySQL. According to Tuerk, it is the only version of MySQL that scales beyond 500 GB. Its scalability is still limited, however, to 30 to 50 TB of source data, which would be a handful of terabytes once compressed. The system runs on any Red Hat Linux 5 server—for example, a 1 TB installation runs on a $22,000 Dell. A Windows version is planned for later this year. The software itself costs $30,000 per terabyte of source data (one-time license plus annual maintenance), which puts it towards the low end of other analytical systems.

Infobright was founded in 2005 although development of the BrightHouse engine began earlier. Several production systems were in place by 2007. The system was officially launched in early 2008 and now has about dozen production customers.

Wednesday, April 09, 2008

Bah, Humbug: Let's Not Forget the True Meaning of On-Demand

I was skeptical the other day about the significance of on-demand business intelligence. I still am. But I’ve also been thinking about the related notion of on-demand predictive modeling. True on-demand modeling – which to me means the client sends a pile of data and gets back a scored customer or prospect list – faces the same obstacle as on-demand BI: the need for careful data preparation. Any modeler will tell you that fully automated systems make errors that would be obvious to a knowledgeable human. Call it the Sorcerer’s Apprentice effect.

Indeed, if you Google “on demand predictive model”, you will find just a handful of vendors, including CopperKey, Genalytics and Angoss. None of these provides the generic “data in, scores out” service I have in mind. There are, however, some intriguing similarities among them. Both CopperKey and Genalytics match the input data against national consumer and business databases. Both Angoss and CopperKey offer scoring plug-ins to Salesforce.com. Both Genalytics and Angoss will also build custom models using human experts.

I’ll infer from this that the state of the art simply does not support unsupervised development of generic predictive models. Either you need human supervision, or you need standardized inputs (e.g., Salesforce.com data), or you must supplement the data with known variables (e.g. third-party databases).

Still, I wonder if there is an opportunity. I was playing around recently with a very simple, very robust scoring method a statistician showed me more than twenty years ago. (Sum of Z-scores on binary variables, if you care.) This did a reasonably good job of predicting product ownership in my test data. More to the point, the combined modeling-and-scoring process needed just a couple dozen lines of code in QlikView. It might have been a bit harder in other systems, given how powerful QlikView is. But it’s really quite simple regardless.

The only requirements are that the input contains a single record for each customer and that all variables are coded as 1 or 0. Within those constraints, any kind of inputs are usable and any kind of outcome can be predicted. The output is a score that ranks the records by their likelihood of meeting the target condition.

Now, I’m fully aware that better models can be produced with more data preparation and human experts. But there must be many situations where an approximate ranking would be useful if it could be produced in minutes with no prior investment for a couple hundred dollars. That's exactly what this approach makes possible: since the process is fully automated, the incremental cost is basically zero. Pricing would only need to cover overhead, marketing and customer support.

The closest analogy I can think of among existing products are on-demand customer data integration sites. These also take customer lists submitted over the Internet and automatically return enhanced versions – in their case, IDs that link duplicates, postal coding, and sometimes third-party demograhpics and other information. Come to think of it, similar services perform on-line credit checks. Those have proven to be viable businesses, so the fundamental idea is not crazy.

Whether on-demand generic scoring is also viable, I can’t really say. It’s not a business I am likely to pursue. But I think it illustrates that on-demand systems can provide value by letting customers do things with no prior setup. Many software-as-a-service vendors stress other advantages: lower cost of ownership, lack of capital investment, easy remote access, openness to external integration, and so on. These are all important in particular situations. But I’d also like to see vendors explore the niches where “no prior setup and no setup cost” offers the greatest value of all.

Thursday, January 31, 2008

QlikView Scripts Are Powerful, Not Sexy

I spent some time recently delving into QlikView’s automation functions, which allow users to write macros to control various activities. These are an important and powerful part of QlikView, since they let it function as a real business application rather than a passive reporting system. But what the experience really did was clarify why QlikView is so much easier to use than traditional software.

Specifically, it highlighted the difference between QlikView’s own scripting language and the VBScript used to write QlikView macros.

I was going to label QlikView scripting as a “procedural” language and contrast it with VBScript as an “object-oriented” language, but a quick bit of Wikipedia research suggests those may not be quite the right labels. Still, whatever the nomenclature, the difference is clear when you look at the simple task of assigning a value to a variable. With QlikView scripts, I use a statement like:

Set Variable1 = ‘123’;

With VBScript using the QlikView API, I need something like:

set v = ActiveDocument.GetVariable("Variable1")
v.SetContent "123",true

That the first option is considerably easier may not be an especially brilliant insight. But the implications are significant, because they mean vastly less training is needed to write QlikView scripts than to write similar programs in a language like VBScript, let alone Visual Basic itself. This in turn means that vastly less technical people can do useful things in QlikView than with other tools. And that gets back to the core advantage I’ve associated with QlikView previously: that it lets non-IT people like business analysts do things that normally require IT assistance. The benefit isn’t simply that the business analysts are happier or that IT gets a reduced workload. It's that the entire development cycle is accelerated because analysts can develop and refine applications for themselves. Otherwise, they'd be writing specifications, handing these to IT, waiting for IT to produce something, reviewing the results, and then repeating the cycle to make changes. This is why we can realistically say that QlikView cuts development time to hours or days instead of weeks or months.

Of course, any end-user tool cuts the development cycle. Excel reduces development time in exactly the same way. The difference lies in the power of QlikView scripts. They can do very complicated things, giving users the ability to create truly powerful systems. These capabilities include all kinds of file manipulation—loading data, manipulating it, splitting or merging files, comparing individual records, and saving the results.

The reason it’s taken me so long time to recognize that this is important is that database management is not built into today's standard programming languages. We’re simply become so used to the division between SQL queries and programs that the distinction feels normal. But reflecting on QlikView script brought me back to the old days of FoxPro and dBase database languages, which did combine database management with procedural coding. They were tremendously useful tools. Indeed, I still use FoxPro for certain tasks. (And not that crazy new-fangled Visual FoxPro, either. It’s especially good after a brisk ride on the motor-way in my Model T. You got a problem with that, sonny?)

Come to think of it FoxPro and dBase played a similar role in their day to what QlikView offers now: bringing hugely expanded data management power to the desktops of lightly trained users. Their fate was essentially to be overwhelmed by Microsoft Access and SQL Server, which made reasonably priced SQL databases available to end-users and data centers. Although I don’t think QlikView is threatened from that direction, the analogy is worth considering.

Back to my main point, which is that QlikView scripts are both powerful and easy to use. I think they’re an underreported part of the QlikView story, which tends to be dominated by the sexy technology of the in-memory database and the pretty graphics of QlikView reports. Compared with those, scripting seems pedestrian and probably a bit scary to the non-IT users whom I consider QlikView’s core market. I know I myself was put off when I first realized how dependent QlikView was on scripts, because I thought it meant only serious technicians could take advantage of the system. Now that I see how much easier the scripts are than today’s standard programming languages, I consider them a major QlikView advantage.

(Standard disclaimer: although my firm is a reseller for QlikView, opinions expressed in this blog are my own.)

Friday, January 04, 2008

Fitting QlikTech into the Business Intelligence Universe

I’ve been planning for about a month to write about the position of QlikTech in the larger market for business intelligence systems. The topic has come up twice in the past week, so I guess I should do it already.

First, some context. I’m using “business intelligence” in the broad sense of “how companies get information to run their businesses”. This encompasses everything from standard operational reports to dashboards to advanced data analysis. Since these are all important, you can think of business intelligence as providing a complete solution to a large but finite list of requirements.

For each item on the list, the answer will have two components: the tool used, and the person doing the work. That is, I’m assuming a single tool will not meet all needs, and that different tasks will be performed by different people. This all seems reasonable enough. It means that a complete solution will have multiple components.

It also means that you have to look at any single business intelligence tool in the context of other tools that are also available. A tool which seems impressive by itself may turn out to add little real value if its features are already available elsewhere. For example, a visualization engine is useless without a database. If the company already owns a database that also includes an equally-powerful visualization engine, then there’s no reason to buy the stand-alone visualization product. This is why vendors to expand their product functionality and why it is so hard for specialized systems to survive. It’s also why nobody buys desk calculators: everyone has a computer spreadsheet that does the same and more. But I digress.

Back to the notion of a complete solution. The “best” solution is the one that meets the complete set of requirements at the lowest cost. Here, “cost” is broadly defined to include not just money, but also time and quality. That is, a quicker answer is better than a slower one, and a quality answer is better than a poor one. “Quality” raises its own issues of definition, but let’s view this from the business manager’s perspective, in which case “quality” means something along the lines of “producing the information I really need”. Since understanding what’s “really needed” often takes several cycles of questions, answers, and more questions, a solution that speeds up the question-answer cycle is better. This means that solutions offering more power to end-users are inherently better (assuming the same cost and speed), since they let users ask and answer more questions without getting other people involved. And talking to yourself is always easier than talking to someone else: you’re always available, and rarely lose an argument.

In short: the way to evaluate a business intelligence solution is to build a complete list of requirements and then, for each requirement, look at what tool will meet it, who will use that tool, what the tool will cost; and how quickly the work will get done.

We can put cost aside for the moment, because the out-of-pocket expense of most business intelligence solutions is insignificant compared with the value of getting the information they provide. So even though cheaper is better and prices do vary widely, price shouldn’t be the determining factor unless all else is truly equal.

The remaining critieria are who will use the tool and how quickly the work will get done. These come down to pretty much the same thing, for the reasons already described: a tool that can be used by a business manager will give the quickest results. More grandly, think of a hierarchy of users: business managers; business analysts (staff members who report to the business managers); statisticians (specialized analysts who are typically part of a central service organization); and IT staff. Essentially, questions are asked by business managers, and work their way through the hierarchy until they get to somebody who can answer them. Who that person is depends on what tools each person can use. So, if the business manager can answer her own question with her own tools, it goes no further; if the business analyst can answer the question, he does and sends it back to his boss; if not, he asks for help from a statistician; and if the statistician can’t get the answer, she goes to the IT department for more data or processing.

Bear in mind that different users can do different things with the same tool. A business manager may be able to do use a spreadsheet only for basic calculations, while a business analyst may also know how to do complex formulas, graphics, pivot tables, macros, data imports and more. Similarly, the business analyst may be limited to simple SQL queries in a relational database, while the IT department has experts who can use that same relational database to create complex queries, do advanced reporting, load data, add new tables, set up recurring processes, and more.

Since a given tool does different things for different users, one way to assess a business intelligence product is to build a matrix showing which requirements each user type can meet with it. Whether a tool “meets” a requirement could be indicated by a binary measure (yes/no), or, preferably, by a utility score that shows how well the requirement is met. Results could be displayed in a bar chart with four columns, one for each user group, where the height of each bar represents the percentage of all requirements those users can meet with that tool. Tools that are easy but limited (e.g. Excel) would have short bars that get slightly taller as they move across the chart. Tools that are hard but powerful (e.g. SQL databases) would have low bars for business users and tall bars for technical ones. (This discussion cries out for pictures, but I haven’t figured out how to add them to this blog. Sorry.)

Things get even more interesting if you plot the charts for two tools on top of each other. Just sticking with Excel vs. SQL, the Excel bars would be higher than the SQL bars for business managers and analysts, and lower than the SQL bars for statisticians and IT staff. The over-all height of the bars would be higher for the statisticians and IT, since they can do more things in total. Generally this suggests that Excel would be of primary use to business managers and analysts, but pretty much redundant for the statisticians and IT staff.

Of course, in practice, statisticians and IT people still do use Excel, because there are some things it does better than SQL. This comes back to the matrices: if each cell has utility scores, comparing the scores for different tools would show which tool is better for each situation. The number of cells won by each tool could create a stacked bar chart showing the incremental value of each tool to each user group. (Yes, I did spend today creating graphs. Why do you ask?)

Now that we’ve come this far, it’s easy to see that assessing different combinations of tools is just a matter of combining their matrices. That is, you compare the matrices for all the tools in a given combination and identify the “winning” product in each cell. The number of cells won by each tool shows its incremental value. If you want to get really fancy, you can also consider how much each tool is better than the next-best alternative, and incorporate the incremental cost of deploying an additional tool.

Which, at long last, brings us back to QlikTech. I see four general classes of business intelligence tools: legacy systems (e.g. standard reports out of operational systems); relational databases (e.g. in a data warehouse); traditional business intelligence tools (e.g. Cognos or Business Objects; we’ll also add statistical tools like SAS); and Excel (where so much of the actual work gets done). Most companies already own at least one product in each category. This means you could build a single utility matrix, taking the highest score in each cell from all the existing systems. Then you would compare this to a matrix for QlikTech and find cells where the QlikView is higher. Count the number of those cells, highlight them in a stacked bar chart, and you have a nice visual of where QlikTech adds value.

If you actually did this, you’d probably find that QlikTech is most useful to business analysts. Business managers might benefit some from QlikView dashboards, but those aren’t all that different from other kinds of dashboards (although building them in QlikView is much easier). Statisticians and IT people already have powerful tools that do much of what QlikTech does, so they won’t see much benefit. (Again, it may be easier to do some things in QlikView, but the cost of learning a new tool will weigh against it.) The situation for business analysts is quite different: QlikTech lets them do many things that other tools do not. (To be clear: some other tools can do those things, but it takes more skill than the business analysts possess.)

This is very important because it means those functions can now be performed by the business analysts, instead of passed on to statisticians or IT. Remember that the definition of a “best” solution boils down to whatever solution meets business requirements closest to the business manager. By allowing business analysts to perform many functions that would otherwise be passed through to statisticians or IT, QlikTech generates a hugh improvement in total solution quality.

Thursday, December 06, 2007

1010data Offers A Powerful Columnar Database

Back in October I wrote here about the resurgent interest in alternatives to standard relational databases for analytical applications. Vendors on my list included Alterian, BlueVenn (formerly SmartFocus), Vertica and QD Technology. Most use some form of a columnar structure, meaning data is stored so the system can load only the columns required for a particular query. This reduces the total amount of data read from disk and therefore improves performance. Since a typical analytical query might read only a half-dozen columns out of hundreds or even thousands available, the savings can be tremendous.

I recently learned about another columnar database, Tenbase from 1010data. Tenbase, introduced in 2000, turns out to be a worthy alternative to better-known columnar products.

Like other columnar systems, Tenbase is fast: an initial query against a 4.3 billion row, 305 gigabyte table came back in about 12 seconds. Subsequent queries against the results were virtually instantaneous, because they were limited to the selected data and that data had been moved into memory. Although in-memory queries will always be faster, Tenbase says reading from disk takes only three times as long, which is a very low ratio. The reflects a focused effort by the company to make disk access as quick as possible.

What’s particularly intriguing is Tenbase achieves this performance without compressing, aggregating or restructuring the input. Although indexes are used in some situations, queries generally read the actual data. Even with indexes, the Tenbase files usually occupy about the same amount of space as the input. This factor varies widely among columnar databases, which sometimes expand file size significantly and sometimes compress it. Tenbase also handles very large data sets: the largest in production is nearly 60 billion rows and 4.6 terabytes. Fast response on such large installations is maintained by adding servers that process queries in parallel. Each server contains a complete copy of the full data set.

Tenbase can import data from text files or connect directly to multi-table relational databases. Load speed is about 30 million observations per minute for fixed width data. Depending on the details, this comes to around 10 gigabytes per hour. Time for incremental loads, which add new data to an existing table, is determined only by the volume of the new data. Some columnar databases essentially reload the entire file during an ‘incremental’ update.

Regardless of the physical organization, Tenbase presents loaded data as if it were in the tables of a conventional relational database. Multiple tables can be linked on different keys and queried. This contrasts with some columnar systems that require all tables be linked on the same key, such as a customer ID.

Tenbase has an ODBC connector that lets it accept standard SQL queries. Results come back as quickly as queries in the system’s own query language. This is also special: some columnar systems run SQL queries much more slowly or won’t accept them at all. The Tenbase developers demonstrated this feature by querying a 500 million row database through Microsoft Access, which feels a little like opening the door to a closet and finding yourself in the Sahara desert.

Tenbase’s own query language is considerably more powerful than SQL. It gives users advanced functions for time-series analysis, which actually allows many types of comparisons between rows in the data set. It also contains a variety of statistical, aggregation and calculation functions. It’s still set-based rather than a procedural programming language, so it doesn't support features like if/then loops. This is one area where some other columnar databases may have an edge.

The Tenbase query interface is rather plain but it does let users pick the columns and values to select by, and the columns and summary types to include in the result. Users can also specify a reference column for calculations such as weighted averages. Results can be viewed as tables, charts or cross tabs (limited to one value per intersection), which can themselves be queried. Outputs can be exported in Excel, PDF, XML, text or CSV formats. The interface also lets users create calculated columns and define links among tables.

Under the covers, the Tenbase interface automatically creates XML statements written to the Tenbase API. Users can view and edit the XML or write their own statements from scratch. This lets them create alternate interfaces for special purposes or simply to improve the esthetics. Queries built in Tenbase can be saved and rerun either in their original form or with options for users to enter new values at run time. The latter feature gives a simple way to build query applications for casual users.

The user interface is browser-based, so no desktop client is needed. Perhaps I'm easily impressed, but I like that the browser back button actually works. This is often not the case in such systems. Performance depends on the amount of data and query complexity but it scales with the number of servers, so even very demanding queries against huge databases can be returned in a few minutes with the right hardware. The servers themselves are commodity Windows PCs. Simple queries generally come back in seconds.

Tenbase clients pay for the system on a monthly basis. Fees are based primarily on the number of servers, which is determined by the number of users, amount of data, types of queries, and other details. The company does not publish its pricing but the figures it mentioned seemed competitive. The servers can reside at 1010data or at the client, although the 1010data will manage them either way. Users can load data themselves no matter where the server is located.

Most Tenbase clients are in the securities industry, where the product is used for complex analytics. The company has recently added several customers in retail, consumer goods and health care. There are about 45 active Tenbase installations, including the New York Stock Exchange, Proctor & Gamble and Pathmark Stores.

Tuesday, November 27, 2007

Just How Scalable Is QlikTech?

A few days ago, I replied to a question regarding QlikTech scalability. (See What Makes QlikTech So Good?, August 3, 2007) I asked QlikTech itself for more information on the topic but haven’t learned anything new. So let me simply discuss this based on my own experience (and, once again, remind readers that while my firm is a QlikTech reseller, comments in this blog are strictly my own.)

The first thing I want to make clear is that QlikView is a wonderful product, so it would be a great pity if this discussion were to be taken as a criticism. Like any product, QlikView works within limits that must be understood to use it appropriately. No one benefits from unrealistic expectations, even if fans like me sometimes create them.

That said, let’s talk about what QlikTech is good at. I find two fundamental benefits from the product. The first is flexibility: it lets you analyze data in pretty much any way you want, without first building a data structure to accommodate your queries. By contrast, most business intelligence tools must pre-aggregate large data sets to deliver fast response. Often, users can’t even formulate a particular query if the dimensions or calculated measures were not specified in advance. Much of the development time and cost of conventional solutions, whether based in standard relational databases or specialized analytical structures, is spent on this sort of work. Avoiding it is the main reason QlikTech is able to deliver applications so quickly.

The other big benefit of QlikTech is scalability. I can work with millions of records on my desktop with the 32-bit version of the system (maximum memory 4 GB if your hardware allows it) and still get subsecond response. This is much more power than I’ve ever had before. A 64-bit server can work with tens or hundreds of millions of rows: the current limit for a single data set is apparently 2 billion rows, although I don’t know how close anyone has come to that in the field. I have personally worked with tables larger than 60 million rows, and QlikTech literature mentions an installation of 300 million rows. I strongly suspect that larger ones exist.

So far so good. But here’s the rub: there is a trade-off in QlikView between really big files and really great flexibility. The specific reason is that the more interesting types of flexibility often involve on-the-fly calculations, and those calculations require resources that slow down response. This is more a law of nature (there’s no free lunch) than a weakness in the product, but it does exist.

Let me give an example. One of the most powerful features of QlikView is a “calculated dimension”. This lets reports construct aggregates by grouping records according to ad hoc formulas. You might want to define ranges for a value such as age, income or unit price, or create categories using if/then/else statements. These formulas can get very complex, which is generally a good thing. But each formula must be calculated for each record every time it is used in a report. On a few thousand rows, this can happen in an instant, but on tens of millions of rows, it can take several minutes (or much longer if the formula is very demanding, such as on-the-fly ranking). At some point, the wait becomes unacceptable, particularly for users who have become accustomed to QlikView’s typically-immediate response.

As problems go, this isn’t a bad one because it often has a simple solution: instead of on-the-fly calculations, precalculate the required values in QlikView scripts and store the results on each record. There’s little or no performance cost to this strategy since expanding the record size doesn’t seem to slow things down. The calculations do add time to the data load, but that happens only once, typically in an unattended batch process. (Another option is to increase the number and/or speed of processors on the server. QlikTech makes excellent use of multiple processors.)

The really good news is you can still get the best of both worlds: work out design details with ad hoc reports on small data sets; then, once the design is stabilized, add precalculations to handle large data volumes. This is vastly quicker than prebuilding everything before you can see even a sample. It’s also something that’s done by business analysts with a bit of QlikView training, not database administrators or architects.

Other aspects of formulas and database design also more important in QlikView as data volumes grow larger. The general solution is the same: make the application more efficient through tighter database and report design. So even though it’s true that you can often just load data into QlikView and work with it immediately, it’s equally true that very large or sophisticated applications may take some tuning to work effectively. In other words, QlikView is not pure magic (any result you want for absolutely no work), but it does deliver much more value for a given amount of work than conventional business intelligence systems. That’s more than enough to justify the system.

Interestingly, I haven’t found that the complexity or over-all size of a particular data set impacts QlikView performance. That is, removing tables which are not used in a particular query doesn’t seem to speed up that query, nor does removing fields from tables within the query. This probably has to do with QlikTech’s “associative” database design, which treats each field independently and connects related fields directly to each other. But whatever the reason, most of the performance slow-downs I’ve encountered seem related to processing requirements.

And, yes, there are some upper limits to the absolute size of a QlikView implementation. Two billions rows is one, although my impression (I could be wrong) is that could be expanded if necessary. The need to load data into memory is another limit: even though the 64-bit address space is effectively infinite, there are physical limits to the amount of memory that can be attached to Windows servers. (A quick scan of the Dell site finds a maximum of 128 GB.) This could translate into more input data, since QlikView does some compression. At very large scales, processing speed will also impose a limit . Whatever the exact upper boundary, it’s clear that no one will be loading dozens of terabytes into QlikView any time soon. It can certainly be attached a multi-terabyte warehouse, but would have to work with multi-gigabyte extracts. For most purposes, that’s plenty.

While I’m on the topic of scalability, let me repeat a couple of points I made in the comments on the August post. One addresses the notion that QlikTech can replace a data warehouse. This is true in the sense that QlikView can indeed load and join data directly from operational systems. But a data warehouse is usually more than a federated view of current operational tables. Most warehouses include data integration to link otherwise-disconnected operational data. For example, customer records from different systems often can only be linked through complex matching techniques because there is no shared key such as a universal customer ID. QlikView doesn’t offer that kind of matching. You might be able to build some of it using QlikView scripts, but you’d get better results at a lower cost from software designed for the purpose.

In addition, most warehouses store historical information that is not retained in operational systems. A typical example is end-of-month account balance. Some of these values can be recreated from transaction details but it’s usually much easier just to take and store a snapshot. Other data may simply be removed from operational systems after a relatively brief period. QlikView can act as a repository for such data: in fact, it’s quite well suited for this. Yet in such cases, it’s probably more accurate to say that QlikView is acting as the data warehouse than to say a warehouse is not required.

I hope this clarifies matters without discouraging anyone from considering QlikTech. Yes QlikView is a fabulous product. No it won’t replace your multi-terabyte data warehouse. Yes it will complement that warehouse, or possibly substitute for a much smaller one, by providing a tremendously flexible and efficient business intelligence system. No it won’t run itself: you’ll still need some technical skills to do complicated things on large data volumes. But for a combination of speed, power, flexibility and cost, QlikTech can’t be beat.

Monday, October 08, 2007

Proprietary Databases Rise Again

I’ve been noticing for some time that “proprietary” databases are making a come-back in the world of marketing systems. “Proprietary” is a loaded term that generally refers to anything other than the major relational databases: Oracle, SQL Server and DB2, plus some of the open source products like MySQL. In the marketing database world, proprietary systems have a long history tracing back to the mid-1980’s MCIF products from Customer Insight, OKRA Marketing, Harte-Hanks and others. These originally used specialized structures to get adequate performance from the limited PC hardware available in the mid-1980’s. Their spiritual descendants today are Alterian and BlueVenn (formerly SmartFocus), both with roots in the mid-1990’s Brann Viper system and both having reinvented themselves in the past few years as low cost / high performance options for service bureaus to offer their clients.

Nearly all the proprietary marketing databases used some version of an inverted (now more commonly called “columnar”) database structure. In such a structure, data for each field (e.g., Customer Name) is physically stored in adjacent blocks on the hard drive, so it can be accessed with a single read. This makes sense for marketing systems, and analytical queries in general, which typically scan all contents of a few fields. By contrast, most transaction processes use a key to find a particular record (row) and read all its elements. Standard relational databases are optimized for such transaction processing and thus store entire rows together on the hard drive, making it easy to retrieve their contents.

Columnar databases themselves date back at least to mid-1970’s products including Computer Corporation of America Model 204, Software AG ADABAS, and Applied Data Research (now CA) Datacom/DB. All of these are still available, incidentally. In an era when hardware was vastly more expensive, the great efficiency of these systems at analytical queries made them highly attractive. But as hardware costs fell and relational databases became increasingly dominant, they fell by the wayside except for a special situations. Their sweet spot of high-volume analytical applications was further invaded by massively parallel systems (Teradata and more recently Netezza) and multi-dimensional data cubes (Cognos Powerplay, Oracle/Hyperion EssBase, etc.). These had different strengths and weaknesses but still competed for some of the same business.

What’s interesting today is that a new generation of proprietary systems is appearing. Vertica has recently gained a great deal of attention due to the involvement of database pioneer Michael Stonebraker, architect of INGRES and POSTGRES. (Click here for an excellent technical analysis by the Winter Corporation; registration required.) QD Technology, launched last year (see my review), isn’t precisely a columnar structure, but uses indexes and compression in a similar fashion. I can’t prove it, but suspect the new interest in alternative approaches is because analytical databases are now getting so large—tens and hundreds of gigabytes—that the efficiency advantages of non-relational systems (which translate into cost savings) are now too great to ignore.

We’ll see where all this leads. One of the few columnar systems introduced in the 1990’s was Expressway (technically, a bit map index—not unlike Model 204), which was purchased by Sybase and is now moderately successful as Sybase IQ. I think Oracle also added some bit-map capabilities during this period, and suspect the other relational database vendors have their versions as well. If columnar approaches continue to gain strength, we can certainly expect the major database vendors to add them as options, even though they are literally orthogonal to standard relational database design. In the meantime, it’s fun to see some new options become available and to hope that costs will come down as new competitors enter the domain of very large analytical databases.

Friday, October 05, 2007

Analytica Provides Low-Cost, High-Quality Decision Models

My friends at DM News, which has published my Software Review column for the past fifteen years, unceremoniously informed me this week that they had decided to stop carrying all of their paid columnists, myself included. This caught me in the middle of preparing a review of Lumina Analytica, a pretty interesting piece of simulation modeling software. Lest my research go to waste, I’ll write about Analytica here.

Analytica falls into the general class of software used to build mathematical models of systems or processes, and to then predict the results of a particular set of inputs. Business people typically use such software to understand the expected results of projects such as a new product launch or a marketing campaign, or to forecast the performance of their business as a whole. They can also be used to model the lifecycle of a customer or to calculate the results of key performance indicators as linked on a strategy map, if the relationships among those indicators have been defined with sufficient rigor.

When the relationships between inputs and outputs are simple, such models can be built in a spreadsheet. But even moderately complex business problems are beyond what a spreadsheet can reasonably handle: they have too many inputs and outputs and the relationships among these are too complicated. Analytica makes it relatively easy to specify these relationships by drawing them on an “influence diagram” that looks like a typical flow chart. Objects within the chart, representing the different inputs and outputs, can then be opened up to specify the precise mathematical relationships among the elements.

Analytica can also build models that run over a number of time periods, using results from previous periods as inputs to later periods. You can do something like this in a spreadsheet, but it takes a great many hard-coded formulas which are easy to get wrong and hard to change. Analytica also offers a wealth of tools for dealing with uncertainties, such as many different types of probability distributions. These are virtually impossible to handle in a normal spreadsheet model.

Apart from spreadsheets, Analytica fits between several niches in the software world. Its influence diagrams resemble the pictures drawn by graphics software: but unlike simple drawing programs, Analytica has actual calculations beneath its flow charts. On the other hand, Analytica is less powerful than process modeling software used to simulate manufacturing systems, call center operations, or other business processes. That software has many very sophisticated features tailored to modeling such flows in detail: for example, ways to simulate random variations in the arrival rates of telephone calls or to incorporate the need for one process step to wait until several others have been completed. It may be possible to do much of this in Analytica, but it would probably be stretching the software beyond its natural limits.

What Analytica does well is model specific decisions or business results over time. The diagram-building approach to creating models is quite powerful and intuitive, particularly because users can build modules within their models, so a single object on a high-level diagram actually refers to a separate, detailed diagram of its own. Object attributes include inputs, outputs and formulas describing how the outputs are calculated. Objects can also contain arrays to handle different conditions: for example, a customer object might use arrays to define treatments for different customer segments. This is a very powerful feature, since its lets an apparently simple model capture a great deal of actual detail.

Setting up a model in Analytica isn’t exactly simple, although it may be about as easy as possible given the inherent complexity of the task. Basically, users place the objects on a palette, connect them with arrows, and then open them up to define the details. There are many options within these details, so it does take some effort to learn how to get what you want. The vendor provides a tutorial and detailed manual to help with the learning process, and offers a variety of training and consulting options. Although it is accessible to just about anyway, the system is definitely oriented towards sophisticated users, providing advanced statistical features and methods that no one else would understand.

The other intriguing feature of Analytica is its price. The basic product costs a delightfully reasonable $1,295. Other versions range up to $4,000 including ability to access ODBC data sources, handle very large arrays, and run automated optimization procedures. A server-based version costs $8,000, but only very large companies would need that one.

This pricing is quite impressive. Modeling systems can easily cost tens or hundreds of thousands of dollars, and it’s not clear they provide much more capability than Analytica. On the other hand, Analytica’s output presentation is rather limited—some basic tables and graphs, plus several statistical measures of uncertainty. There’s that statistical orientation again: as a non-statistician, I would have preferred better visualization of results.

In my own work, Analytica could definitely provide a tool for building models to simulate customers’ behaviors as they flow through an Experience Matrix. This is already more than a spreadsheet can handle, and although it could be done in QlikTech it would be a challenge. Similarly, Analytica could be used in business planning and simulation. It wouldn’t be as powerful as a true agent-based model, but could provide an alternative that costs less and is much easier to learn how to build. If you’re in the market for this sort of modeling—particularly if you want to model uncertainties and not just fixed inputs—Analytica is definitely worth a look.

Friday, June 01, 2007

Dashboard Software: Finding More than Flash

I’ve been reading a lot about marketing performance metrics recently, which turns out to be a drier topic than I can easily tolerate—and I have a pretty high tolerance for dry. To give myself a bit of a break without moving too far afield, I decided to research marketing dashboard software. At least that let me look at some pretty pictures.

Sadly, the same problem that afflicts discussions of marketing metrics affects most dashboard systems: what they give you is a flood of disconnected information without any way to make sense of it. Most of the dashboard vendors stress their physical display capabilities—how many different types of displays they provide, how much data they can squeeze onto a page, how easily you can build things—and leave the rest to you. What this comes down to is: they let you make bigger, prettier mistakes faster.

Two exceptions did crop up that seem worth mentioning.

- ActiveStrategy builds scorecards that are specifically designed to link top-level business strategy with lower-level activities and results. They refer to this as “cascading” scorecards and that seems a good term to illustrate the relationship. I suppose this isn’t truly unique; I recollect the people at SAS showing me a similar hierarchy of key performance indicators, and there are probably other products with a cascading approach. Part of this may be the difference between dashboards and scorecards. Still, if nothing else, ActiveStrategy is doing a particularly good job of showing how to connect data with results.

- VisualAcuity doesn’t have the same strategic focus, but it does seek more effective alternatives to the normal dashboard display techniques. As their Web site puts it, “The ability to assimilate and make judgments about information quickly and efficiently is key to the definition of a dashboard. Dashboards aren’t intended for detailed analysis, or even great precision, but rather summary information, abbreviated in form and content, enough to highlight exceptions and initiate action.” VisualAcuity dashboards rely on many small displays and time-series graphs to do this.

Incidentally, if you’re just looking for something different, FYIVisual uses graphics rather than text or charts in a way that is probably very efficient at uncovering patterns and exceptions. It definitely doesn’t address the strategy issue and may or may not be more effective than more common display techniques. But at least it’s something new to look at.

Tuesday, May 15, 2007

Are Visual Sciences and WebSideStory Really the Same Company? (As a matter of fact, yes.)

Last week, WebSideStory announced it was going to become part of the Visual Sciences brand. (The two companies merged in February 2006 but had retained separate identities.)

The general theme of the combined business is “real time analytics”. This is what Visual Sciences has always done, so far as I can recall. It’s more of a departure for WebSideStory, which has its roots in the batch-oriented world of Web log analysis.

But what’s really intriguing is the applications WebSideStory has developed. One is a search system that helps users navigate within a site. Another provides Web content management. A third provides keyword bid management.

Those applications may sound barely relevant, but all are enriched with analytics in ways that make perfect sense. The search system uses analytics to help infer user interests and also lets users control results so the users are shown items that meet business needs. Web content management also includes functions that let business objectives influence the content presented to visitors. Keyword bid management is tightly integrated with subsequent site behavior—conversions and so on—so value can be optimized beyond the cost per click.

Maybe this is just good packaging, but it does seem to me that Visual Sciences has done something pretty clever here: rather than treating analytics and targeting as independent disciplines that are somehow applied to day-to-day Web operations, it has built analytics directly into the operational functions. Given the choice between plain content management and analytics-enhanced content management, why would anyone not choose the latter?

I haven’t really dug into these applications, so my reaction is purely superficial. All I know is they sound attractive. But even this is impressive at a time when so many online vendors are expanding their product lines through acquisitions that seem to have little strategic rationale beyond generally expanding the company footprint.

Wednesday, April 11, 2007

Operational Systems Should Be Designed with Testing in Mind

A direct marketing client pointed out to me recently that it can be very difficult to set up tests in operational systems.

There is no small irony in this. Direct marketers have always prided themselves on their ability to test what they do, in pointed contrast to the barely measurable results of conventional media. But his point was well taken. Although it’s fairly easy to set up tests for direct marketing promotions, testing customer treatments delivered through operational systems such as order processing is much more difficult. Those systems are designed with the assumption that all customers are treated the same. Forcing them to treat selected customers differently can require significant contortions.

This naturally led me to wonder what an operational system would look like if it had been designed from the start with testing in mind. A bit more reflection led me to the multivariate testing systems I have been looking at recently—Optimost, Offermatica, Memetrics, SiteSpect, Vertster. These take control of all customer treatments (within a limited domain), and therefore make delivering a test message no harder or easier than delivering a default message. If we treated them as a template for generic customer treatment systems, which functions would we copy?

I see several:

- segmentation capabilities, which can select customers for particular tests (or for customized treatment in general). You might generalize this further to include business rules and statistical models that determine exactly which treatments are applied to which customers: when you think about it, segmentation is just a special type of business rule.

- customer profiles, which make available all the information needed for segmentation/rules and hold tags that identify customers already tagged for a particular test

- content management features, which make existing content available to apply in tests. Some systems provide content creation as well, although I think this is a separate specialty that should usually remain external.

- test design functions, that help users create correctly-structured tests and then link them to the segmentation rules, data profiles and content needed to execute them. These design functions also include parameters such as date ranges, preview features for checking that they are set up correctly, workflow for approvals, and similar administrative features.

- reporting and analysis, so users can easily read test results, understand their implications, and use the knowledge effectively.

I don’t mean to suggest that existing multivariate testing systems should replace operational systems. The testing systems are mostly limited to Web sites and control only a few portions of a few pages within those sites. They sit on top of general purpose Web platforms which handle many other site capabilities. Using the testing systems to control all pages on a site without an underlying platform would be difficult if not impossible, and in any case isn’t what the testing systems are designed for.

Rather, my point is that developers of operational systems should use the testing products as models for how to finely control each customer experience, whether for testing or simply to tailor treatments to individual customer needs. Starting their design from the perspective of creating a powerful testing environment should enable them to understand more clearly what is needed to build a comprehensive solution, rather than trying to bolt on particular capabilities without grasping the underlying connections.

Thursday, March 01, 2007

Users Want Answers, Not Tools

I hope you appreciated that yesterday’s post about reports within the sample Lifetime Value system was organized around the questions that each report answered, and not around the report contents themselves. (You DID read yesterday’s post, didn’t you? Every last word?) This was the product of considerable thought about what it takes to make systems like that useful to actual human beings.

Personally I find those kinds of reports intrinsically fascinating, especially when they have fun charts and sliders to play with. But managers without the time for leisurely exploration—shall we call it “data tourism”?—need an easier way to get into the data and find exactly what they want. Starting with a list of questions they might ask, and telling them where they will find each answer, is one way of helping out.

Probably a more common approach is to offer prebuilt analysis scenarios, which would be packages of reports and/or recommended analysis steps to handle specific projects. It’s a more sophisticated version of the same notion: figure out what questions people are likely to have and lead them through the process of acquiring answers. There is a faint whiff of condescension to this—a serious analyst might be insulted at the implication that she needs help. But the real question is whether non-analysts would have the patience to work through even this sort of guided presentation. The vendors who offer such scenarios tell me that users appreciate them, but I’ve never heard a user praise them directly.

The ultimate fallback, of course, is to have someone else do the analysis for you. One of my favorite sayings—which nobody has ever found as witty as I do, alas—is that the best user interface ever invented is really the telephone: as in, pick up the telephone and tell somebody else to answer your question.. Many of the weighty pronouncements I see about how automated systems can never replace the insight of human beings really come down to this point.

But if that’s really the case, are we just kidding ourselves by trying to make analytics accessible to non-power users? Should we stop trying and simply build power tools to make the real experts as productive as possible? And even if that’s correct, must we still pretend to care about non-power users because they are often control the purchase decision?

On reflection, this is a silly line of thought. Business users need to make business decisions and they need to have the relevant information presented to them in ways they can understand. Automated systems make sense but still must run under the supervision of real human beings. There is no reason to try to turn business users into analysts. But each type of user should be given the tools it needs to do its job.

Customer Experience Matrix