Organizations have complex requirements for their information systems. A period of rapid economic growth can result in insane schedules and demands that a new information system be ready within weeks. Finally, organizations are fickle and have no compunction about changing the requirements mid-stream.
Technical people have traditionally met these challenges ... by arguing over programming tools. The data model can't represent the information that the users need, the application doesn't do what what the users need it to do, and instead of writing code, the "engineers" are arguing about Java versus Lisp versus ML versus C# versus Perl versus VB. If you want to know why computer programmers get paid less than medical doctors, consider the situation of two trauma surgeons arriving at an accident scene. The patient is bleeding profusely. If surgeons were like programmers, they'd leave the patient to bleed out in order to have a really satisfying argument over the merits of two different kinds of tourniquet.
War Story The authors were asked to help Siemens and Boston Consulting Group (BCG) realize a knowledge sharing system for 17,000 telephone switch salespeople spread among 84 countries. This was back in the 1990s when (a) telephone companies were expanding capacity, and (b) corporations invested in information systems as a way of beating competitors. Siemens had spent 6 months working with a Web development contractor that was expert in building HTML pages but had trouble programming SQL. They'd promised to launch the elaborately specified system 6 weeks from our first meeting. We concluded that many of the features that they wanted could be adapted from the source code behind the photo.net online community but that adding the "knowledge repository" would require 4 programmers working full-time for 6 weeks. What's worse, in looking at the specs we decided that the realized system would be unusably complex, especially for busy salespeople. Instead of blindly cranking out the code, we assigned only one programmer to the project, our friend Tracy Adams. She turned the human-readable design notebooks into machine-readable database metadata tables. Tracy proceeded to build a program-to-write-the-program with no visible results. Siemens and BCG were nervous until Week 4 when the completed system was available for testing. "How do you like it?" we asked. "This is the worst information system that we've ever used," they replied. "How do you compare it to your specs?" we asked. "Hmmm... maybe we should simplify the specification," they replied. After two more iterations the system, dubbed "ICN Sharenet" was launched on time and was adopted quickly, credited by Siemens with $122 million in additional sales during its first year of operation. |
If you're programming one Web page at a time, you can switch to the language du jour in search of higher productivity. But you won't achieve significant gains unless you quit writing code for one page at a time. Think about ways to write down a machine-readable description of the application and user experience, then let the computer generate the application automatically.
One thing that we hope you've learned during this course is the value of testing with users and iterative improvement of an application. If an application is machine-generated, you can test it with users, edit the specification based on their feedback, and regenerate the application in a matter of minutes, ready for a new test.
We're going to explore metadata (data about the data model) and automatic code generation in the problem domain of knowledge management.
An organization's second attempt at an information system intended to help beginners and average performers with the expertise of the most capable is typically dubbed knowledge sharing. In a knowledge-sharing system, User A has the ability to put a question in front of the community so that Users B, C, and D can write new material and/or point A to previously authored articles.
If nothing else, a knowledge-sharing system provides a means for employees at the leaf nodes of a hierarchy to converse amongst themselves. Consider the organization depicted in figure 15.1. Suppose that a worker in Singapore has a question that could be answered by a worker in Des Moines. The act of finding the coworker and getting assistance requires going all the way up the hierarchy to the chief executive in London and then back down a different path through the hierarchy to Des Moines. This bottleneck could be eliminated by eliminating the hierarchy. However, most organizations don't want to eliminate their hierarchies. It is the hierarchy that enables the corporation to reduce management complexity by establishing profit-and-loss responsibility at intermediate levels. Better to supplement the hierarchy with an informal mechanism by which the Singapore-based worker can ask for help and the Des Moines-based worker can offer it, i.e., a knowledge-sharing system.
**** insert drawing of corporate hierarchy ****
Your ontology will consist of class definitions and, because a relational database is the source of persistence behind your online community, your implementation of this ontology will consist of SQL table definitions, one table for each class. To assist you in developing a useful ontology for your community, here are a couple of examples.
Let's assume that we'll have at least the following types in our ontology:
language
and one of type person
.
Why not link to the users
table instead? John McCarthy
might not be a registered user of the system. Some of the people
you'll be referencing, e.g., John von Neumann, are dead.
Each object comprises a set of elements. An element is stored in a column. For every object in the system, we want to record the following elements:
- for the
person
type- date_of_birth, title
- for the
language
type- syntax_example, garbage_collection_p (whether the language has automatic storage allocation like Lisp or memory leaks like C), strongly_typed_p, type_inferencing_p, lexical_scoping_p, date_first_used
- for the
publication
type- this is for storing references to books and journal articles so you want all the fields that you'd expect to see when referencing something; include also an abstract field
- for the
data structure
type- complexity_for_insertion, complexity_for_retrieval (varchars containing "O(1)", "O(N)", etc.)
- for the
system
type- examples of systems are "Multics", "Xerox Alto", "TCP/IP", "MIT Lisp Machine", "Apple Macintosh", "Unix", "World Wide Web". Includes fields for date_of_conception, date_of_birth, organization_name, open_source_p. No need to include fields for the names of developers because we can instead use links to objects of type
person
to represent prime developers or promoters.- for the
problem
type- examples of problems are "traveling salesman", "dining philosophers", "sort", "query for inclusion in sorted list". We'll want elements for storing initial conditions and solution criteria. In general, objects of type
problem
will be linked to objects of typealgorithm
(algorithms that solve the problem),publication
(papers that set forth the problem), andperson
(people who were involved in stating or solving the problem)- for the
algorithm
type- examples include "Quicksort" and "binary search" elements for pseudo_code and high_level_explanation. In general, objects of type
algorithm
will be linked to objects of typeproblem
(what need the algorithm addresses),publication
(papers that describe the algorithm or implementations of it), andperson
(people who were involved in developing the algorithm)
For an example of what a completed system of this nature might look like, visit Paul Black's Dictionary of Algorithms, Data Structures, and Problems at http://www.nist.gov/dads/.
Object types include:
- for the
airplane design
type- For each kind of airplane flying, there is one entry in this table. An example might be "Cessna 172" or "Boeing 747". We need elements to specify performance such as stall_speed (how slow you can go before tumbling out of the sky), approach_speech (how fast you should go when coming near the runway to land), and cruise_speed. We want elements such as date_certified, manufacturer_name, and manufacturer_address to describe the design.
- for the
airplane
type- An entry in this table is a specific airplane, very likely a rental machine belonging to a flight school. We want elements such as date_manufactured, ifr_capable_p (legal to fly in the clouds?), and optional_equipment.
- for the
airport
type- We want to know where the airport is: lat_long; elevation; relation_to_city (distance and direction from a named town). We want to know whether the airport is military-only, private, or public. We want to know whether or not the airport has a rotating green/white beacon and runway lights. We want to store the frequencies for weather information, contacting other pilots (if non-towered) or the control tower (if towered), and air traffic control for instrument flight clearances. We need to record runway lengths and conditions. An airport may have several runways, however, thus giving rise to a many-to-one relation, which is why we model runways separately and link them to airports.
- for the
runway
type- number (e.g., "09/27"), length, condition. Note that the runway number implies the magnetic orientation: 09 implies a heading of 090 or landing facing magnetic east; if the wind favors a landing on the same strip of asphalt in the opposite direction, you're on 27, which implies a heading of 270 or due west (36 faces north; 18 faces south).
- for the
navigation aid
type- The U.S. Federal Aviation Administration maintains a nationwide network of Very High Frequency Omni Ranging beacons (VORs). These transmit two signals, one of which is constant in phase regardless of an airplane's bearing to the VOR. The second signal varies in phase as one circles a VOR. Thus a VOR receiver in the airplane can compare the phase of the two signals and determine that an airplane is, for example, on the 123-degree radial from the VOR. If you didn't have a Global Positioning System receiver in your airplane, you'd determine your position on the chart by plotting radials out from two VORs. For a navaid, we need to store its type (could be an old non-directional beacon, which simply puts out an AM radio-style broadcast), frequency, position, and Morse code ID (you want to listen to the dot-dash pattern to make sure that you're receiving the proper navaid).
- for the
restaurant
type- menu_excerpt, hours, distance_from_airport, phone_number, url, email, street_address
- for the
hotel
type- price, distance_from_airport, phone_number, url, email, street_address
For an example of a running system along these lines, visit http://www.airnav.com/airports/ and type in the name or code for your favorite airport.
For ease of evaluation and interaction with your classmates, we suggest placing the user pages at http://yourservername/km/.
Here's an example SQL data model for a metadata repository:
-- note that this is designed for the Oracle 8i/9i RDBMS -- we'll have one row in this table for every object type -- and thus for every new SQL table that gets defined; an -- object type and its database table name are the same; -- Oracle limits schema objects to 30 characters and thus -- we limit a table_name to 21 characters so that we can -- have some freedom to create schema objects whose names -- are prefixed with an object type -- a "pretty name" is a synonym used when presenting pages -- to users; the prettiness could be as simple as replacing -- underscores with spaces or spelling out abbreviations; -- e.g., for an object type of "airplane_design", the pretty -- form might be "airplane design", and pretty_plural -- "airplane designs" create table km_metadata_object_types ( table_name varchar(21) primary key, pretty_name varchar(100) not null, pretty_plural varchar(100) ); -- here is the table for elements that are unique to an object type -- (the housekeeping elements can be defined implicitly in the source -- code for the application generator); there will be one row in -- the metadata table per element create table km_metadata_elements ( metadata_id integer primary key, table_name not null references km_metadata_object_types, column_name varchar(30) not null, pretty_name varchar(100) not null, abstract_data_type varchar(30) not null, -- ie. "text" or "shorttext" "boolean" "user" -- this one is not null except when abstract_data_type is "user" oracle_data_type varchar(30), -- "varchar(4000)" -- e.g., "not null" or "check foobar in ('christof', 'patrick')" extra_sql varchar(4000), -- values are 'text', 'textarea', 'select', 'radio', -- 'selectmultiple', 'checkbox', 'checkboxmultiple', 'selectsql' presentation_type varchar(100) not null, -- e.g., for textarea, this would be "rows=6 cols=60", for select, Tcl list, -- for selectsql, an SQL query that returns N district values -- for email addresses mailto: presentation_options varchar(4000), -- pretty_name is going to be the short prompt, -- e.g., for an update page, but we also need something -- longer if we have to walk the user through a long form entry_explanation varchar(4000), -- if they click for yet more help help_text varchar(4000), -- note that this does NOT translate into a "not null" constraint in Oracle -- if we did this, it would preclude an interface in which users create rows incrementally mandatory_p char(1) check (mandatory_p in ('t','f')), -- ordering in Oracle table creation, 0 would be on top, 1 underneath, etc. sort_key integer, -- ordering within a form, lower number = higher on page form_sort_key integer, -- if there are N forms, starting with 0, to define this object, -- on which does this go? (relevant for very complex objects where -- you need more than one page to submit) form_number integer, -- for full text index include_in_ctx_index_p char(1) check (include_in_ctx_index_p in ('t','f')), -- add forms should be prefilled with the default value default_value varchar(200), check ((abstract_data_type not in ('user') and oracle_data_type is not null) or (abstract_data_type in ('user'))), unique(table_name,column_name) );
text/plain
. You can save this to your local
file system as km-generated.sql
and feed it to your SQL
client when you're satisfied.
In addition to the housekeeping elements that you've defined for your
application, each object table should have an object_id
column. The value of this column should be unique across all of the
tables in the km module, which is easy to do in Oracle if you use a
single sequence to generate all the keys. Given unique object IDs
across types, if you were to add a km_object_registry
table, you'd be able to publish cleaner URLs that pass around only
object IDs rather than object IDs and types.
In addition to the metadata-driven object table definitions, your script should define a generalized mapping table to support links between knowledge objects. Here's an Oracle-syntax example:
Notice that this table allows the users to map an object to any other object in the system, regardless of type.create table km_object_object_map ( object_id_a integer not null, object_id_b integer not null, -- the objects are uniquely identified above but let's save ourselves -- hassle by recording in which tables to find them table_name_a not null references km_metadata_object_types, table_name_b not null references km_metadata_object_types, -- User-entered reason for relating two objects, e.g. -- to distinguish between John McCarthy the developer of -- Lisp and Gerry Sussman and Guy Steele, who added lexical scoping -- in the Scheme dialect map_comment varchar(4000), creation_user not null references users, creation_date date default sysdate not null, primary key (object_id_a, object_id_b) );
For simplicity, assume that associations are bidirectional. Suppose
that a knowledge author associates the Huffman encoding algorithm
(used in virtually every compression scheme, including JPEG) with the
person David A. Huffman (1925-1999; an MIT graduate student at the
time of his invention, which was submitted as a term paper). We
should also interpret that to mean that the person David A. Huffman is
associated with the algorithm for Huffman encoding. This is why the
columns in km_object_object_map
have names such as
"object_id_a" instead of "from_object".
In an Oracle database, the primary key constraint above has the side effect of creating an index that makes it fast to ask the question "What objects are related to Object 17, where Object 17 happens to appear in the A slot?" For efficiency in querying "What objects are related to Object 17, where Object 17 happens to appear in the B slot?", create a concatenated index on the columns in the reverse order from that of the primary key constraint.
The "Trees" chapter of SQL for Web Nerds, at http://philip.greenspun.com/sql/trees, gives some examples of concatenated indices. If you're using Oracle you may want to read the composite indices section of the Performance Guide and Reference manual (online and in the product documentation) and the SQL Reference manual's section on "Create Index".
ddl-generate
, called drop-tables-generate
.
This should generate DROP TABLE statements from the metadata tables.
You probably won't get your data model right the first time, so you
might as well be ready to clear out the RDBMS and start over.
Feed the database management system the results of your data model creation and clean-up scripts until you stop getting error messages.
This traditional approach has some drawbacks. First, it adds a mouse click before the administrator can see any user names. Ideally, you want every page of an application to display information and/or potential actions rather than pure bureaucracy and navigation. Second, and more seriously, this approach doesn't scale very well. When an administrator says "I need to see users who've registered within the last 30 days, who've contributed more than 4 product reviews, and who've bought at least $100 of stuff so that I can spam them with a coupon," another option must be added to the list. Eventually the navigation page groans with choices.
Imagine instead that the very first mouse click takes the administrator to a page that shows all the users who've registered in the last 30 days, in one big long list. At the top are sliders. Each slider controls a dimension, each of which can restrict or expand the number of items in the list. Here are some example dimensions for a community e-commerce site such as amazon.com:
How to implement dimensional controls? Sadly, there is no HTML tag that will generate a little continuous slider. You can simulate a slider by offering, for each dimension, a set of discrete points along the dimension, each of which is a simple text hyperlink anchor. For example, for content quality you might offer "4 or better", "3 or better", "2 or better", "all".
/km/
directory. At the very
least, the index page should display an unordered list of object types
and, next to each type, options to "browse" or "create". You don't
have any information in the database, so you should build a script
called object-create
first. This page will query the metadata
tables to build a data entry form to create a single object of a
particular type.
When your object creation pipeline is done inserting the row into the
database, it should redirect the author's browser to a page where the
object is displayed (name the script object-display
if
you don't have a better idea). Presumably the original author has
authority to edit this object and therefore this page should display
small hyperlinks to edit single fields. All of these links can target
the URL object-edit-element
with different arguments.
The object display page should also summarize all the currently linked
objects and have an "add link" hyperlink whose target is
link-add
.
The page returned by link-add
will look virtually
identical to the index page, i.e., a list of object types. Each
object type can be a hyperlink to a multi-purpose script at
one-type-browse
. When called with only a
table_name
argument, this page will display a table of object
names with dimensional controls at the top. The dimensions should be
"mine|everyone's" and "creation date". The user ought to be able to
click on a table header and sort by that column.
When called with extra arguments, one-type-browse
will
pass those arguments through to object-summarize
, a
script very similar to object-display
, but only showing
enough information that the author can positively identify the object
and with the additional ability to accept arguments for a potential
link, e.g., table_name_a
and object_id_a
.
It is important to create an incentive system that rewards users for exhibiting the desired behavior. At amazon.com, for example, the site owners want users to write a lot of reader reviews. At the same time, they apparently don't want to pay people to write reviews. The solution circa 2003 is to recognize contributors with a "reviewer rank". If a lot of other Amazon users have clicked to say that they found your reviews useful, you may rise above 1000 and a "Top 1000 Reviewer" icon appears next to your name. From the home page of Amazon, navigate to "Friends and favorites" (under "Special Features"). Then, underneath "Explore", click on "Top Reviewers". Notice that some of the top 10 reviewers have written more than 5000 reviews, all free of charge to Amazon!
What makes sense to reward in an online community? We could start with a couple of obvious activities: content authoring and question answering. Every night our system could query the content tables and update user ranks according to how many articles and answers they'd posted into the database. Is it really a good idea to reward users purely on the basis of volume? Shouldn't we give more weight to content that has actually helped people? For example, suppose that there are ten answers to a discussion forum question. It makes sense to give the maximum reward to the author of the answer that the person asking the question felt was most valuable. If a question can be marked "urgent" by the asker, it probably makes sense to give greater rewards to people who answer urgent questions than non-urgent ones. An article is nice, but an article that prompts another user to say "I reused this idea in my area of the organization" is much nicer and should be encouraged with a greater reward.
Modifycreate sequence km_object_view_id; create table km_object_views ( object_view_id integer primary key, -- which user user_id not null references users, -- two columns to specify which object object_id integer not null, table_name varchar(21) not null, view_time timestamp(0) not null, reuse_p char(1) default 'f' check(reuse_p in ('t','f')) );
object-view-one
so that it will insert a row into
the km_object_views
table if and only if there isn't
already a log row for this user/object pair within twenty-four hours. You
can do this with the following procedure:
Let's start with Oracle. Here's an example of an INSERT statement that only has an effect if there isn't already a row in the table:
insert into km_object_views (object_view_id, user_id, object_id, table_name, view_time)
select km_object_view_id.nextval, 227, 891, 'algorithm', current_timestamp(0)
from dual
where 0 = (select count(*)
from km_object_views
where user_id = 227
and object_id = 891
and view_time > current_timestamp - interval '1' day);
The structure of this statement is "insert into KM_OBJECT_VIEWS the
result of querying the 1-row system table DUAL". We're not pulling
any data from the DUAL table, only including constants in the SELECT
list. Nor is the WHERE clause restricting results based on
information in the DUAL table; it is querying KM_OBJECT_VIEWS. This
is a seemingly perverse way to use SQL, but in fact is fairly
conventional because there are no IF statements in standard SQL.
Suppose, however, that two copies of this INSERT start simultaneously. Recall that a transaction processing system provides the ACID guarantees: Atomicity, Consistency, Isolation, and Durability. Oracle's implementation of isolation, "the results of a transaction are invisible to other transactions until the transaction is complete", works by giving each user a virtual version of the database as it was when the transaction started.
Session A | Session B |
---|---|
Sends INSERT to Oracle at system change number ("SCN", a pseudo-time internal to Oracle) 30561.
Oracle counts the rows in km_object_views and finds 0.
Oracle inserts a row into km_object_views at SCN 30567
(took a while for the COUNT(*) to complete; meanwhile other users have
been inserting and updates rows in other tables).
|
Sends INSERT to Oracle at system change number 30562, a tick after
Session A started its transaction but several ticks before Session A
accomplished its insertion.
Oracle, busy with other users, doesn't start counting rows in km_object_views until SCN 30568, after the insert from
Session A. The database, however, will return 0 blocks because it is
presenting Session B with a view of the database as it was at SCN
30562, when the transaction started.
Having found 0 rows in the count, the INSERT proceeds to insert one row, thus creating a duplicate log entry. |
Now consider the same query running in SQL Server:
insert into km_object_views (user_id, object_id, table_name, view_time)
select 227, 891, 'algorithm', current_timestamp
where 0 = (select count(*)
from km_object_views
where user_id = 227
and object_id = 891
and datediff(hour, view_time, current_timestamp) < 24)
There are minor syntatic differences from the Oracle statement above,
but the structure is the same. A new row is inserted only if no
matching rows are found within the last twenty-four hours.
SQL Server achieves the same isolation level as Oracle ("Read Committed"), but in a different way. Instead of creating virtual versions of the database, SQL Server holds exclusive locks during data-modification operations. In the example above, Session B's INSERT cannot begin until Session A's INSERT has completed. Once it is allowed to begin, Session B will see the result of Session A's insert, and will therefore not insert a duplicate row.
More: See the "Understanding Locking in SQL Server" chapter of SQL Server Books Online, the Microsoft SQL Server documentation.
Whenever you are performing logging, it is considerate to do it on the server's time, not the user's. In many Web development environments, you can do this by calling an API procedure that will close the TCP connection to the user, which stops the upper-right browser corner icon from spinning/waving. Meanwhile your thread (IIS, AOLserver, Apache 2) or process (Apache 1.x) is still alive on the server and can run whatever code is necessary to perform the logging. Many Web servers allow you to define filters that run after the delivery of a page to the user.
Help with date/time arithmetic: see the "Dates" chapter of SQL for Web Nerds at http://philip.greenspun.com/sql/dates.
object-view-one
to add a "I reused this knowledge"
button. This should link to object-mark-reused
, a page
that updates the reuse_p
flag of the most recent relevant
row in km_object_views
. The page should raise an error
if it can't find a row to update.
object-view-one
that does its
logging on the server's time, explain the concurrency problem that
arises in Exercise 7 and talk about ways to address it.
Write up your solutions to these non-coding exercises either in your
km module overview document or in a file named
metadata-exercises
in the same directory.
km_object_views
that will make the
code in Exercises 6 and 7 go fast.
/km/admin/statistics
to
show, by day, the number of objects viewed and reused. This report
should be broken down by object type and all the statistics should be
links to "drill-down" pages where the underlying data are exposed,
e.g., which actual users viewed or reused knowledge and when.