Monday, November 13, 2006

Closures Esoterica: completion transparency

The Closures for Java draft specifiation (currently at v0.3) is the product of a lot of work. Everything in the spec has been discussed and scrutinized before being placed into the specification, and is there for a reason. From your point of view, you have seen snapshots of our specification, with very little explanation of why things are the way they are. Changes from one revision to another may seem arbitrary. I am not writing detailed notes on the progress and evolution of the specification and its prototype, because there is too much to explain and too little time would be left for me to work on the specification and prototype. I am, after all, working on this on my own time. I suspect few people would care for that level of detail anyway. Much of the work on the specification takes place during face-to-face meetings, after which I update the specification to reflect our decisions. Some issues get resolved through email discussions. We've been keeping an eye on the comments on this blog, various message boards, mailing lists, and elsewhere for input, and meeting with interested folks, and that has been a very useful part of the process. Thank you for your help!

I'm preparing the next revision of the specification, and we just resolved an issue by email. Unless you know what the issue is, the change in the upcoming version of the specification will appear totally arbitrary. Because the issue was resolved by email, I have a nice record of the problem and its resolution. Here is an edited excerpt of the the start of our discussion:

We'd like programmers to be able to define their own control constructs. One thing that would make this easier would be if programmer-defined control constructs act like built-in ones for the purposes of handling reachability of statements (and "can complete normally"). That's why we added the bottom type java.lang.Unreachable to the specification. But just having that isn't enough. Watch as I try to define a withLock method that is completion transparent.

First, ignoring reachability, the library writer might define

<throws E> void withLock(Lock lock, {=>void throws E} block) throws E { ... }

this achieves exception transparency, but not completion transparency. If the block returns a value, this would work:

<T, throws E> T withLock(Lock lock, {=>T throws E} block) throws E { ... }

this works because a closure that can't complete normally can be converted, via the closure conversion, to {=>Unreachable}. In practice, any specific invocation of withLock will either have a block that doesn't return anything (i.e. =>void) or can't complete normally (i.e. =>Unreachable}. You might think, therefore, that the library writer need only write these two overloaded methods to achieve completion transparency, and let overload resolution take care of the rest.

Unfortunately it isn't that simple. With these two methods, an invocation of withLock using a closure that can't complete normally can be an invocation of either of these methods. That's because the closure conversion can convert a closure that results in Unreachable to an interface whose method returns void. Since both are applicable, the compiler must select the most specific. Neither of these methods is more specific than the other (the closures are unrelated, given our scheme of mapping function types to interfaces), so the invocation is ambiguous.

I don't propose that we mess with the specification for "most specific". I'm afraid that would be a disaster, though maybe you can reassure me that it isn't. Instead, I propose that the closure conversion be allowed to convert a closure that results in void to an interface whose method's return type is java.lang.Void. The generated code would naturally return a null value. Then the library writer would write only the second version, above, and it would work both for the void-returning case and the Unreachable-returning case. I think being able to write control abstractions as a single method (instead of two overloadings) is a significant advantage. Additionally, this API is more flexible because it can be used by programmers to pass a value back through from the closure to the caller.

We discussed this issue and found the proposed resolution our best option. The next version of the proposal will include in the closure conversion a provision allowing conversion of a closure that returns void to an interface type whose method returns java.lang.Void. You can see hints in this email thread that the syntax of a function type has changed slightly (the throws clause has moved inside the curly braces), and that a function type is now specified to be an interface type, rather than having its own type rules. The specification is getting simpler, which is definitely a move in the right direction!

Sunday, November 05, 2006

Reified Generics for Java

Many people are unsatisfied with the restrictions caused by the way generics are implemented in Java. Specifically, they are unhappy that generic type parameters are not reified: they are not available at runtime. Generics are implemented using erasure, in which generic type parameters are simply removed at runtime. That doesn't render generics useless, because you get typechecking at compile-time based on the generic type parameters, and also because the compiler inserts casts in the code (so that you don't have to) based on the type parameters.

Generics are implemented using erasure as a response to the design requirement that they support migration compatibility: it should be possible to add generic type parameters to existing classes without breaking source or binary compatibility with existing clients. I wrote about this two years ago. Migration compatibility is exploited widely in JDK5; all of the collection classes and interfaces use generics, yet existing code using collections continues to work. Without migration compatibility, the collection APIs could not be retrofitted use generics; we would probably have added a separate, new set of collection APIs that use generics. That was the approach used by C# when generics were introduced, but Java did not take this approach because of the huge amount of pre-existing Java code using collections.

While solving one set of problems, erasure adds a set of its own problems. For a type parameter T, you can't write a class literal T.class. You can't use instanceof to test if an object is of type T. And you can't create an array of T. But it gets worse. You also can't write class literals for generic types like List<String>.class, or test if an object is an instanceof List<String>, or create an array of List<String>. The implementation of generics using erasure also causes Java to have unchecked operations, which are operations that would normally check something at runtime but can't do so because not enough information is available. For example, a cast to the type List<String> is an unchecked cast, because the generated code checks that the object is a List but doesn't check whether it is the right kind of list.

It isn't too late to add reified generics to Java. There are two basic approaches. The first uses the language as it exists today but adds the generic type information at runtime. In the ideal world, we wait until every bit of Java code in the world has been modified to use generics safely, and then go through a transition in which we start recording type parameters at runtime by using a new javac and VM. There are two main difficulties with this approach. First, it is not compatible. It is not source compatible because you would no longer be allowed to use raw types (i.e., it is impractical to wait until all code has been generified); it is not binary compatible because new clients would invoke new kinds of constructors for generic classes that also record the type parameters, but existing classes do not have those constructors. (A hybrid approach is being investigated in which the system records type parameters only when they are available, but I haven't yet seen a workable proposal along these lines.) The second problem is more insidious: lots of existing code uses generics but uses them in an unsafe way. For example, I have seen code that creates an ArrayList<Object>, but later casts it to a List<String> where the author knows that it only contains objects of type String. I would not be surprised to find such code in the JDK. Such code must fail at runtime when generics are reified, so some existing (but working) code would be broken.

The other approach modifies the language so that the declaration of a generic parameter distinguishes reified type parameters from erased ones. It is a pure extension of the language. The existing syntax for generics would continue to designate generics by type erasure, but a newly introduced syntax would be used to declare reified type parameters. Perhaps something like this:

class NewCollection<class E> extends Collection<E> { ... }
class NewList<class E> extends NewCollection<E>, List<E> { ... }

The rules for these reified type parameters are straightforward. When using a reified generic class, the type argument has to be a reified type. You would not be allowed to create a NewCollection in its raw form. Code that uses only reified types would never get an unchecked warning from the compiler, because the compiler can always generate the correct checking code. If you have a reference of type Collection that refers to a NewCollection<String>, you would get a runtime error if you attempt to put anything other than a String into it. In this sense reified collections are like java.util.Collections.checkedCollection, only better because the element type can itself be generic and the guarantees are stronger. You can even create an array of E inside an implementation of this type. But there is a disadvantage: the strategy requires a new set of reified Collections APIs parallel to the existing erasure-based ones. This is why C# introduced new collections APIs when they introduced generics.

As the above declaration illustrates, the new reified collection interfaces could be extensions of the existing collection interfaces. Consequently, existing APIs that receive old collections can be passed new collections. For example, you could pass a NewCollection<String> to an API that is declared to receive a Collection<String>. In addition, the new reified collection classes (that is, the implementations) could be extensions of the existing collection classes. That allows the implementation of the old and new collection classes to be shared, and provides a nice migration path. Existing APIs that return old collections can be modified to return new collections. Over time, most libraries can be migrated to use the new APIs without breaking backward compatibility (though it would break unsafe clients). In this way we can have a more gradual migration than would be possible with the first approach.

I don't mean to trivialize the work involved: this requires a significant revision of the VM and language specifications and a significant effort from the teams who implement VMs, java compilers (i.e. javac and Hotspot), and the core JDK libraries. From the programmer's point of view, however, the language change is small and mostly involves removing restrictions.

Approaching the problem this way has a secondary benefit. Because the new reified collection interfaces don't exist yet, it is possible for methods to be added to them when they are introduced. Perhaps sorting methods belong in NewList? If reified generics are added at the same time as (or after) closures, a number of useful methods could be added to take advantage of closures. Filters, mappers, reducers, and visitors are just some of the ideas.