2007-03-08

XPath as a first-class citizen

The DOM standard packs great power, one of them that of XPath, for slicing and dicing DOM trees (whether originally from XML or HTML input). Compared to the very good integration of similar javascript language features -- E4X in recent javascript implementations, or the RegExp object, with us since Javascript's earliest days, the DOM XPath interface is a very sad story. I will refrain from telling it here, and primarily focus on how it should be redone, to better serve script authoring. My scope in this post is the web browser, so as not to confuse things with, say, ECMAScript, which is a language used in many non-DOM-centric environments too.

First, we might recognize that an XPath expression essentially is to a DOM Node tree what a RegExp is to a string. It's a coarse but valid comparison, and one that should guide how we use XPath in web programming. From here, let's thus copy good design and sketch the outlines of an XPath class that is as good and versatile for the DOM as RegExp is for strings.

Javascript RegExps are first class objects, with methods tied to them that take a data parameter to operate on. XPaths should be too. Instantiation, using the class constructor, thus looks like:
var re = new RegExp("pattern"[, "flags"]);
var xp = new XPath("pattern"[, "flags"]);
The respective patterns already have their grammars defined and are thus left without further commentary here. RegExp flags are limited to combinations of:
"g"
global match
"i"
case insensitive
"m"
multiline match
The XPath flags would map to their XPathResultType counterparts (found on the XPathResult object) for specifying what properties of the resulting node set you are interested in (if you write docs for that horrible API, please copy this enlightening table):
Nodes wantedBehaviourUnOrderedOrdered
MultipleIteratorUNORDERED_NODE_ITERATOR_TYPE=4ORDERED_NODE_ITERATOR_TYPE=5
MultipleSnapshotUNORDERED_NODE_SNAPSHOT_TYPE=6ORDERED_NODE_SNAPSHOT_TYPE=7
SingleSnapshotANY_UNORDERED_NODE_TYPE=8FIRST_ORDERED_NODE_TYPE=9

that are reducible to permutations of whether you want a single or multiple items, want results sorted in document order or don't care, and if you want a snapshot, or something that just lets you iterate through the matches until you perform your first modification to a match. There are really ten options in all, but NUMBER_TYPE=1, STRING_TYPE=2 and BOOLEAN_TYPE=3 were necessitated by a design flaw we shall not repeat, and ANY_TYPE=0 is the automatic pick between one of those or UNORDERED_NODE_ITERATOR_TYPE=4. Let's have none of that.

Copying some more good design, let's make those options a learnable set of three one-letter flags, over the hostile set of ten types, averaging 31.6 characters worth of typing each (or a single digit, completely devoid of semantic memorability). When desigining flags, we get to pick a default flag-less mode and an override. In RegExp the least expensive case is the default, bells and whistles invokable by flag override; we might, for instance, heed the same criteria (or, probably better still, deciding on what constitutes the most useful default behaviour, and naming the flags after the opposite behaviour instead):
"m"
Multiple nodes
"o"
Ordered nodes
"s"
Snapshot nodes
I briefly mentioned a design error we shouldn't repeat. The DOM document.evaluate, apart from having a long name on its own, and further drowning you in mandatory arguments and 30-to-40 character type names, does not yield results you can use right away as part of a javascript expression. Instead it hands you some ravioli, in the form of an XPathResult object, which you may pry the actual results off, by jumping through a few hoops. This is criminally offensively bad interface design, in innumerable ways. Again, let's not go there.

It might be time we decided on calling conventions, so we have some context to anchor up what the results returned are with. Our XPath object (which, contrary to a result set, makes lots of sense sticking into an object, to keep around for doing additional queries with later on without parsing the path again) has an exec() method, as does RegExp, and it takes zero to two arguments.

xp.exec( contextNode, nsResolver );

The first argument is a context node, from which the expression will resolve. If undefined or null, we resolve against document.documentElement. The context node may be anything accepted as a context node by present day document.evaluate, or an E4X object.

The second argument is, if provided as a function, a namespace resolver (of type XPathNSResolver, just as with the DOM API). If we instead provide an object, do a lookup for namespace prefixes from it by indexing out the value from it, as with an associative array. In the interest of collateral damage control, should the user have cluttered up Object.prototype, we might be best off to only pick up namespaces from it whose nsResolver.hasOwnProperty(nsprefix) yields true.

The return value from this method is similar in intended spirit to XPathResult.ANY_TYPE, but without the ravioli. XPaths yielding number, string or boolean output returns a number, string or boolean. And the rest, which return node sets, return a proper javascript Array of the nodes matched. Or, if for some reason an enhanced object be needed, one which inherits from Array, so that all (native or prototype enhanced) Array methods; shift, splice, push and friends, work on this object, too.

Finally, RegExps enjoy a nice, terse, literal syntax. I would argue that XPaths should, as well. My best proposal (as most of US-ASCII is already allocated) is to piggy-back onto the RegExp literal syntax, but mandate the flag "x" to signify being an XPath expression. Further on, as / is a too common character in XPath to even for a moment consider having to quote it, make the contents of the /.../ containment a '-encased string, following all the common string quoting conventions.

var links_xp = /'.//a[@href]'/xmos;
var posts_xp = /'//div[@class="post"]'/xmos;


for instance, for slicing up a local document.links variant for some part of your DOM tree, and for picking up the root nodes of all posts on this blog page respectively. And it probably already shows that the better default is to make those flags the default behaviour, and name the inverse set instead. Perhaps these?
"s"
Single node only
"u"
Unordered nodes
"i"
Iterable nodes

When requesting a single node, you get either the node that matched, or null. Illegal combinations of flags and XPath expressions ideally yield compile time errors. Being able to instantiate new XPath objects off old ones, given new flags, would be another welcome feature. There probably are additional improvements and clarifications to make. Good ideas shape the future.

2007-03-07

Firefox content type bugfix extension

The not much heard of Open in browser extension is a great improvement over the Firefox deficiency of disallowing user overrides to the Content-Type header passed by web servers. The Content-Type header, if unfamiliar, is what states the data type of files you download, so that your browser can pick a suitable mode of decoding (if it is a JPEG picture, use the native JPEG decoder, for instance, while showing text files as text) and presenting the data.

Web server admins are people too that make mistakes, or occasionally have weird ideas contrary to yours about how to present a file (prompting with a Save As... dialog for plain text, HTML, or images, most frequently), instead of showing it right in the browser, and a native Firefox lets them rule you. This extension grants you the option of choice.

It presently (v1.1) does not allow you to specify any legal content type override, but it handles the basic xxx/yyy types, while considering "text/plain; charset=UTF-8", for instance, illegal. This seems to be a common misconception about Content Types (or MIME types, as they are also commonly called), which I would like to see fade away. interested in references, §14.17 of RFC 2616 (HTTP) states that the leading type/subtype declaration may be followed by any number of {semicolon, attribute=value pair} blocks, so if you are tempted to do validation on legal content type declarations, for some reason, don't disallow those.

Excerpt of the relevant ABNF, if you want to generate a proper grammar validator:

media-type     = type "/" subtype *( ";" parameter )
type = token
subtype = token
parameter = attribute "=" value

attribute = token
value = token | quoted-string
token = 1*<any CHAR except CTLs or separators>

CHAR = <any US-ASCII character (octets 0 - 127)>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT

CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>

quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
<"> = <US-ASCII double-quote mark (34)>

TEXT = <any OCTET except CTLs,
but including LWS>
OCTET = <any 8-bit sequence of data>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
LWS = [CRLF] 1*( SP | HT )

CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>

quoted-pair = "\" CHAR


If not, and you, say, want to do it with a Javascript regexp instead, here is a free regexp to choke on, if you really do want to do strict content type validity checking by (javascript) regexp, rather than, say, check for validity using a more lax variant, perhaps /.+\/.+/:

var validMIME = /^[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+\/[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+(;[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+=([^\0- "(),\/:-@\[-\]{}\x80-\xFF]+|"(\\[\0-\x7F]|[^\\"\0-\x08\x0B\x0C\x0E-\x1F])*?"))*$/;

As you see, regexps are really horrible tools for doing this kind of thing, though, but with a bit of pain they can do the work. I'd suggest keeping a link to this page in a short comment in your code, should you adopt that monster, in case you will ever have any issues with it, or need to work out why it bugs out. Chances are your IDE does not let you mark up token semantics the way I did above.

2007-03-02

Google Pages command-line upload

Google Pages offers you 100 megabytes of free web storage, where you can put html, images, text, javascript, music, video and pretty much whatever you like. You also get five free sub-domains of your choice to googlepages.com. That's on the plus side.

(I have been toying with Exhibit showcase hacks there, gathering up my Exhibit hacks and mashups as I write them.)

On the minus side, you can presently only drop files in a single directory level, files are typically not served at greased-weasel speed and latency and you have to use either a Firefox or Internet Explorer browser to post them, and using an ajaxy form at that -- no sftp, ftp, webdav or HTTP PUT access in sight. (I also believe I've read about a top number of files per site in the 500 range.)

Anyway, I tried to craft my first shaky ruby hack last week, to get a command line client which would let me upload files in batch. I unfortunately failed to navigate the Google login forms rather miserably (should someone want to point out what I do wrong, or rather how I ought to do instead, the broken code is here; a good-for-nothing dozen-liner of WWW:Mechanize).

So I resorted to the classic semi-automated way: logging in by browser, stealing my own cookies and doing the rest of the work with curl. It works, and is less painful than tossing up fifty-something files by mayhem-clicking an ajax form upload, however comparatively nice they made it with a web default style form. This recipe requires a working knowledge of your shell, an installed curl, and being logged in to Google Pages and having chosen the appropriate site.

Then invoke the cookie stealer bookmarklet and copy the value to your clipboard. I suggest you head over to your shell right away, type export googlecookie='' and paste your cookie between the single quotes.

Head back to your browser window, to invoke the auth token post url stealer bookmarklet, which picks up the form target url. Copy it to your clipboard, head back to the shell and type export googletarget='' (again pasting the value between the single quotes). Now you're set.

To upload a file now, all you need to do is run curl -F file=@$filename --cookie $googlecookie $googletarget and it gets dropped off as it should. And zooming up a whole junkyard of files is no more painful:

zsh> for i in *.png; curl -F file=@$i --cookie $googlecookie $googletarget

It's not pretty, but it is some pain relief, if you're hellbent on playing with free services rather than getting dedicated hosting. It's also a "because it's possible" solution -- and those are fun to craft, once in a while. I'd love to find out what I didn't figure out about taming the Google login form via Ruby, or vice versa. Your competent input most welcome.