Well, when I say fun, what I really mean is lots of pain and misery persuading NetBeans to use alternative versions of javac, such as the one now available here.
Makes for a nice screenshot though:
Blog Archive
About Me
Sunday, August 24, 2008
Thursday, June 26, 2008
Schwartzian Transforms in Java
Libraries like Java's Collections API contain numerous algorithms tuned to give good performance in a theoretical 'typical case', but it can be very easy to forget that these algorithms aren't magical, and have certain requirements of their own if that goal is to be achieved.
On numerous occasions, I've written or encountered code that seemed to work perfectly well during development and testing, but performed very poorly when it had to deal with a larger or more complex data-set, something which often happens only after the code has been put into production use.
Here's a program which lists the contents of a specified directory ordered by size, with the size of a child directory being calculated from the total size of all of its contents:
Now, how long does this program take to execute?
Potentially a lot longer than it should, that's how long.
The problem here is that during the sort operation, each item being sorted may be compared against the other items multiple times - exactly how many depending on the sorting algorithm in use and the size and ordering of the initial data-set. If that comparison operation involves some expensive calculation, as it does in this rather blatant example, it's normally wasted effort to perform that calculation every time. In such cases, it's often preferable to cache the result of the calculation the first time it's performed, and reuse that cached version if it's needed again.
A common technique used by Perl programmers faced with this problem is to employ something known as the Schwartzian Transform, which looks something like this:
The idea here is that we start with the unsorted data (read the code from the bottom line up), and apply a mapping function to it which maps each item to a 2 element array containing the original item at index 0 and its associated calculated value at index 1. These are then sorted, using the element at index 1 for the comparisons, and the result is finally fed through another mapping function to extract just the original items, now in the desired order.
We can try something similar in Java. We'll need a mapping method, and let's make it general enough that we can give it any Iterable (such as a List) and a Mapper to apply to each element:
To match the Perl example, we'll also want a sort method which sorts and returns a new List rather than modifying the one passed to it:
Finally we'll need some kind of data structure to hold the item together with its calculated-value. A type-safe Pair class will do:
Now we can replace the Collections.sort call in the original example with a Schwartzian Transform:
Well, that's pretty horrible, whichever way you try to lay out the code. Still, there's one improvement we could make - since we're just writing a Comparator which extracts the Long values and compares them, we could overload our sort method to accept a Mapper instead of a Comparator:
That allows us to shorten the code a little:
It's still far too cumbersome though - if I were writing this 'for real' I'd probably have given up by now.
Let's try it using closures instead of anonymous classes:
That's much better to my eyes - maybe not quite as succinct as the Perl version, but it's digestible.
The Mapper interface can be replaced by function types so let's lose that and apply a bit more closures magic to the map and sort methods:
Of course we could also tuck all that map/sort/map stuff away in a handy reusable method:
Leaving us with:
Much nicer!
Sometimes, it's not immediately obvious what a method like that should look like, or how it might be implemented. Breaking the algorithm down into a series of functions, as the Schwartzian Transform does, can be a useful way to approach the problem at hand, but only if the language you're using has sufficiently practical constructs.
updated 29/06/08 to fix a bug in map() pointed out by Konstantin Triger.
updated 03/07/08 to fix a bug in the Comparator pointed out by bjkail.
On numerous occasions, I've written or encountered code that seemed to work perfectly well during development and testing, but performed very poorly when it had to deal with a larger or more complex data-set, something which often happens only after the code has been put into production use.
Here's a program which lists the contents of a specified directory ordered by size, with the size of a child directory being calculated from the total size of all of its contents:
import java.io.*;
import java.util.*;
public class FileSizes {
public static void main(String[] args) {
File directory = new File(args[0]);
List<File> files = new ArrayList<File>(Arrays.asList(directory.listFiles()));
Collections.sort(files, new Comparator<File>() {
public int compare(File file1, File file2) {
long file1Size = calculateSize(file1);
long file2Size = calculateSize(file2);
if (file1Size == file2Size)
return 0;
return file1Size < file2Size ? -1 : 1;
}
});
for (File file : files)
System.out.println(file);
}
static long calculateSize(File file) {
long size = 0;
if (file.isDirectory())
for (File child : file.listFiles())
size += calculateSize(child);
else
size = file.length();
return size;
}
}
Now, how long does this program take to execute?
Potentially a lot longer than it should, that's how long.
The problem here is that during the sort operation, each item being sorted may be compared against the other items multiple times - exactly how many depending on the sorting algorithm in use and the size and ordering of the initial data-set. If that comparison operation involves some expensive calculation, as it does in this rather blatant example, it's normally wasted effort to perform that calculation every time. In such cases, it's often preferable to cache the result of the calculation the first time it's performed, and reuse that cached version if it's needed again.
A common technique used by Perl programmers faced with this problem is to employ something known as the Schwartzian Transform, which looks something like this:
my @sorted_data =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [$_, expensive_calculation($_)] }
@unsorted_data;
The idea here is that we start with the unsorted data (read the code from the bottom line up), and apply a mapping function to it which maps each item to a 2 element array containing the original item at index 0 and its associated calculated value at index 1. These are then sorted, using the element at index 1 for the comparisons, and the result is finally fed through another mapping function to extract just the original items, now in the desired order.
We can try something similar in Java. We'll need a mapping method, and let's make it general enough that we can give it any Iterable (such as a List) and a Mapper to apply to each element:
interface Mapper<T,U> {
U map(T item);
}
<T,U> Iterable<U> map(final Iterable<T> items, final Mapper<? super T, ? extends U> mapper) {
return new Iterable<U>() {
public Iterator<U> iterator() {
Iterator<T> iter = items.iterator();
return new Iterator<U>() {
public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.map(iter.next()); }
public void remove() { iter.remove(); }
};
}
};
}
To match the Perl example, we'll also want a sort method which sorts and returns a new List rather than modifying the one passed to it:
<T> List<T> sort(Iterable<T> items, Comparator<? super T> comparator) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, comparator);
return list;
}
Finally we'll need some kind of data structure to hold the item together with its calculated-value. A type-safe Pair class will do:
class Pair<A,B> {
public static <A,B> Pair<A,B> of(A fst, B snd) {
return new Pair<A,B>(fst, snd);
}
private A fst;
private B snd;
private Pair(A fst, B snd) {
this.fst = fst;
this.snd = snd;
}
public A fst() { return fst; }
public B snd() { return snd; }
}
Now we can replace the Collections.sort call in the original example with a Schwartzian Transform:
Iterable<File> sortedFiles =
map(sort(map(files,
new Mapper<File, Pair<File,Long>>() {
public Pair<File,Long> map(File f) {
return Pair.of(f, calculateSize(f));
}
}),
new Comparator<Pair<File,Long>>() {
public int compare(Pair<File,Long> p1, Pair<File,Long> p2) {
return p1.snd().compareTo(p2.snd());
}
}),
new Mapper<Pair<File,Long>,File>() {
public File map(Pair<File,Long> p) {
return p.fst();
}
});
Well, that's pretty horrible, whichever way you try to lay out the code. Still, there's one improvement we could make - since we're just writing a Comparator which extracts the Long values and compares them, we could overload our sort method to accept a Mapper instead of a Comparator:
<T, U extends Comparable<U>> List<T> sort(Iterable<T> items, final Mapper<? super T, U> mapper) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, new Comparator<T>() {
public int compare(T t1, T t2) { return mapper.map(t1).compareTo(mapper.map(t2)); }
});
return list;
}
That allows us to shorten the code a little:
Iterable<File> sortedFiles =
map(sort(map(files,
new Mapper<File, Pair<File,Long>>() {
public Pair<File,Long> map(File f) {
return Pair.of(f, calculateSize(f));
}
}),
new Mapper<Pair<File,Long>,Long>() {
public Long map(Pair<File,Long> p) {
return p.snd();
}
}),
new Mapper<Pair<File,Long>,File>() {
public File map(Pair<File,Long> p) {
return p.fst();
}
});
It's still far too cumbersome though - if I were writing this 'for real' I'd probably have given up by now.
Let's try it using closures instead of anonymous classes:
Iterable<File> sortedFiles =
map(sort(map(files,
{File f => Pair.of(f, calculateSize(f))}),
{Pair<File,Long> p => p.snd()}),
{Pair<File,Long> p => p.fst()});
That's much better to my eyes - maybe not quite as succinct as the Perl version, but it's digestible.
The Mapper interface can be replaced by function types so let's lose that and apply a bit more closures magic to the map and sort methods:
<T,U> Iterable<U> map(Iterable<T> items, {T=>U} mapper) {
return {=>
Iterator<T> iter = items.iterator();
new Iterator<U>() {
public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.invoke(iter.next()); }
public void remove() { iter.remove(); }
}
};
}
<T,U extends Comparable<U>> List<T> sort(Iterable<T> items, {T=>U} c) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, {T t1, T t2 => c.invoke(t1).compareTo(c.invoke(t2))});
return list;
}
Of course we could also tuck all that map/sort/map stuff away in a handy reusable method:
<T,U extends Comparable<U>> Iterable<T> schwartzianSort(Iterable<T> items, {T=>U} mapper) {
return map(sort(map(items,
{T t => Pair.of(t, mapper.invoke(t))}),
{Pair<T,U> p => p.snd()}),
{Pair<T,U> p => p.fst()});
}
Leaving us with:
public static void main(String[] args) {
File directory = new File(args[0]);
List<File> files = new ArrayList<File>(Arrays.asList(directory.listFiles()));
for (File file : schwartzianSort(files, {File f => calculateSize(f)}))
System.out.println(file);
}
Much nicer!
Sometimes, it's not immediately obvious what a method like that should look like, or how it might be implemented. Breaking the algorithm down into a series of functions, as the Schwartzian Transform does, can be a useful way to approach the problem at hand, but only if the language you're using has sufficiently practical constructs.
updated 29/06/08 to fix a bug in map() pointed out by Konstantin Triger.
updated 03/07/08 to fix a bug in the Comparator pointed out by bjkail.
Thursday, May 22, 2008
Nothing's guaranteed
Kasper B. Graversen's blog entry on 'Exception handling problems in Java' came up in my news reader today, courtesy of dzone. He wants to be able to extract common exception handling code from catch clauses into methods, allowing reusability of that code, and removing clutter from the main logic.
Here's a silly example, replete with dubious practices for the sake of conciseness :)
Kasper points out that the 'return null;' statement at the end of the makeLunch() method is pointless - although we can see that it will never be executed, the compiler requires it because the signature of the handleException() method only says that it may throw an exception, but does not guarantee it.
There's some discussion of the 'problem' here, including alternative ways to structure the code.
With the BGGA closures prototype though, we can rewrite the example as follows:
There are three changes to the original example here:
Which is nice.
Here's a silly example, replete with dubious practices for the sake of conciseness :)
Food makeLunch() {
try {
return kitchen.makeSandwich();
}
catch (KitchenClosedException ex) {
handleException(ex, "Couldn't make a sandwich :(");
}
catch (NoBreadException ex) {
handleException(ex, "Couldn't make a sandwich :(");
}
return null;
}
void handleException(Exception ex, String message) throws RuntimeException {
log.error("Something went wrong", ex);
throw new RuntimeException(message, ex);
}
Kasper points out that the 'return null;' statement at the end of the makeLunch() method is pointless - although we can see that it will never be executed, the compiler requires it because the signature of the handleException() method only says that it may throw an exception, but does not guarantee it.
There's some discussion of the 'problem' here, including alternative ways to structure the code.
With the BGGA closures prototype though, we can rewrite the example as follows:
Sandwich makeLunch() {
try {
return kitchen.makeSandwich();
}
catch (KitchenClosedException | NoBreadException ex) {
handleException(ex, "Couldn't make a sandwich :(");
}
}
Nothing handleException(Exception ex, String message) throws RuntimeException {
log.error("Something went wrong", ex);
throw new RuntimeException(message, ex);
}
There are three changes to the original example here:
- BGGA allows us to catch multiple types of exception with one catch clause.
- The signature of the handleException() method now has a return type of java.lang.Nothing, which guarantees that the method will not 'complete normally' (because it always throws an exception). The compiler can enforce this guarantee, and we can take advantage of it:
- Because of the above change, the 'return null;' is neither required, nor allowed, since the compiler now knows that the statement can never be executed.
Which is nice.
Thursday, March 6, 2008
Taking control
I've been back in the world of GUI development in the last few days, which I really enjoy when I get a chance to do it.
It's not too often that I hear other people say that they actually enjoy building GUIs though - many of the developers I've worked with or interviewed have made it very clear that they are much happier writing 'back-end stuff': database persistence, messaging systems, business logic. The kind of thing you can define with use cases, implement, test with Unit and Fit tests, and put a tick in the box when you've met all the requirements.
Which is fair enough - when you've come up with simple, robust and maintainable solution that meets all the requirements, it can be very satisfying being able to put that tick there.
GUI development can seem so... messy. Vague specifications, awkward coding models and difficult testing are all common problems. And the bugs can bite so much harder.
When a network messaging sub-system goes wrong, there's a fair chance it 'just' means that the data the user's seeing is out of date. Maybe the developer who wrote the messaging code had the presence of mind, and the time, to implement some decent error handling, so you might get to see the problem in log files and figure out what went wrong before the shouting starts.
When a GUI goes awry though, Murphy's Law says it'll Go Bang in the user's face (and probably during a demo, in a way that has never happened before) and you damn well know about it when it happens... you'll get a screenshot as proof if you're lucky.
In the wild and wacky world of multi-threaded, object-oriented, event-driven programming, we find ourselves creating various coding patterns to deal with the complexity:
"Every time I update the model, I need to ensure a flag is set".
"Every time I display some data, I need to check the flag and perform some action depending on its state...".
And so on. Simple rules, designed to introduce invariants, checks, guards, some kind of control into a seemingly non-deterministic model. And in doing so we drag in boilerplate code, with all its inherent problems as it's copied, pasted and maintained over time.
We can often attempt to avoid these problems by pushing this boilerplate down into our existing layers of object abstractions, or by creating new layers to deal with this application-specific logic. But they don't always fit - sometimes because it's like fitting a square peg into a round hole, and sometimes because the layers of classes, interfaces, mapping, delegation, configuration and so forth are already so complex that pushing more in at one end just results in different things dropping out at the other end.
Some programming languages acknowledge this by helping us implement this boilerplate control logic as procedural abstractions instead, providing constructs to take the code that really matters, and pass it to methods which execute it wrapped safely inside our control logic.
I like programming GUIs once in a while - building something that works really slickly can be great fun.
I could really make use of closures and control abstractions right now however. Boilerplate is no fun.
It's not too often that I hear other people say that they actually enjoy building GUIs though - many of the developers I've worked with or interviewed have made it very clear that they are much happier writing 'back-end stuff': database persistence, messaging systems, business logic. The kind of thing you can define with use cases, implement, test with Unit and Fit tests, and put a tick in the box when you've met all the requirements.
Which is fair enough - when you've come up with simple, robust and maintainable solution that meets all the requirements, it can be very satisfying being able to put that tick there.
GUI development can seem so... messy. Vague specifications, awkward coding models and difficult testing are all common problems. And the bugs can bite so much harder.
When a network messaging sub-system goes wrong, there's a fair chance it 'just' means that the data the user's seeing is out of date. Maybe the developer who wrote the messaging code had the presence of mind, and the time, to implement some decent error handling, so you might get to see the problem in log files and figure out what went wrong before the shouting starts.
When a GUI goes awry though, Murphy's Law says it'll Go Bang in the user's face (and probably during a demo, in a way that has never happened before) and you damn well know about it when it happens... you'll get a screenshot as proof if you're lucky.
In the wild and wacky world of multi-threaded, object-oriented, event-driven programming, we find ourselves creating various coding patterns to deal with the complexity:
"Every time I update the model, I need to ensure a flag is set".
"Every time I display some data, I need to check the flag and perform some action depending on its state...".
And so on. Simple rules, designed to introduce invariants, checks, guards, some kind of control into a seemingly non-deterministic model. And in doing so we drag in boilerplate code, with all its inherent problems as it's copied, pasted and maintained over time.
We can often attempt to avoid these problems by pushing this boilerplate down into our existing layers of object abstractions, or by creating new layers to deal with this application-specific logic. But they don't always fit - sometimes because it's like fitting a square peg into a round hole, and sometimes because the layers of classes, interfaces, mapping, delegation, configuration and so forth are already so complex that pushing more in at one end just results in different things dropping out at the other end.
Some programming languages acknowledge this by helping us implement this boilerplate control logic as procedural abstractions instead, providing constructs to take the code that really matters, and pass it to methods which execute it wrapped safely inside our control logic.
I like programming GUIs once in a while - building something that works really slickly can be great fun.
I could really make use of closures and control abstractions right now however. Boilerplate is no fun.
Sunday, February 17, 2008
Wednesday, February 6, 2008
Say it again SAM
CICE is designed to make it easier to create instances of SAM types - interfaces and abstract classes which have a Single Abstract Method. The proposal itself also contains the following statement:
At the same time as this facility is added to the language, it would make sense to add a few more single method interface types, such as Predicate, Function, and Builder.
It's no coincidence that the jsr166y.forkjoin framework from Doug Lea and the rest of the jsr166 group is defining a number of interfaces like these for that framework's own needs - in fact the commonality is explicitly acknowledged in the documentation:
Preliminary release note: Some of the declarations in this class are likely to be moved elsewhere in the JDK libraries upon actual release, and most likely will not all nested in the same class.
I've just uploaded a new build of the CICE/ARM prototype, which includes a number of interfaces derived from those currently defined by jsr166y.
The differences between the interfaces in the prototype, and those provided by jsr166y are:
All of these changes are possible consequences of removing the interfaces from the context of a specific framework, and in particular, a framework for which only a limited number of these interfaces are useful.
Unfortunately it adds up to a massive increase in the number of interfaces defined, from 132 in jsr166y.forkjoin.Ops, to 1828 in java.util.ops, and it's possible to define even more variations. Ouch :( Kingdom of the Nouns indeed...
Still, it's a straw-man, and no doubt it can be improved. Some thoughts I've had on this are:
Stephen Colebourne recently compared writing the same snippet of code in BGGA, FCM and CICE, and noted that a CICE can be rather more verbose due to the repetition of type names. His example used an interface called PairMatcher, defined as follows:
Creating an instance of this using earlier versions of the prototype would look something like this:
However, it has been pointed out that the parameter types are not really necessary, since there is only one method which a CICE can legally implement. So the latest version of the prototype allows you to do just that:
I'm not a big fan of that syntax personally - I'd rather lose the type arguments - but it's an option.
Thoughts very welcome, as usual.
At the same time as this facility is added to the language, it would make sense to add a few more single method interface types, such as Predicate, Function, and Builder.
It's no coincidence that the jsr166y.forkjoin framework from Doug Lea and the rest of the jsr166 group is defining a number of interfaces like these for that framework's own needs - in fact the commonality is explicitly acknowledged in the documentation:
Preliminary release note: Some of the declarations in this class are likely to be moved elsewhere in the JDK libraries upon actual release, and most likely will not all nested in the same class.
I've just uploaded a new build of the CICE/ARM prototype, which includes a number of interfaces derived from those currently defined by jsr166y.
The differences between the interfaces in the prototype, and those provided by jsr166y are:
- I've put them in a new package - java.util.ops
- It includes versions for all primitive types
- All Op types come in two forms - one which can throw checked exceptions, and one which cannot
- Op types no longer have anything useful to say in their javadoc
All of these changes are possible consequences of removing the interfaces from the context of a specific framework, and in particular, a framework for which only a limited number of these interfaces are useful.
Unfortunately it adds up to a massive increase in the number of interfaces defined, from 132 in jsr166y.forkjoin.Ops, to 1828 in java.util.ops, and it's possible to define even more variations. Ouch :( Kingdom of the Nouns indeed...
Still, it's a straw-man, and no doubt it can be improved. Some thoughts I've had on this are:
- Placing them all in java.util.ops means Comparator is defined in two different packages, so it's not ideal.
- It could be argued that providing versions for all primitive types is overkill. However, without knowing the requirements of future libraries which may make use of them, it's difficult to know which ones can be omitted. Type promotion helps for the parameter types, but it's no use with return types.
- Is it useful providing checked exception versions of all the Op types? Would it be more appropriate to do this for Procedure and friends?
Stephen Colebourne recently compared writing the same snippet of code in BGGA, FCM and CICE, and noted that a CICE can be rather more verbose due to the repetition of type names. His example used an interface called PairMatcher, defined as follows:
public interface PairMatcher<T,U> {
boolean matches(T arg1, U arg2);
}
Creating an instance of this using earlier versions of the prototype would look something like this:
PairMatcher<String,Integer> pm = PairMatcher<String,Integer>(String str, Integer val) {
return (str != null && str.length() > 0 && val != null && val > 10);
};
However, it has been pointed out that the parameter types are not really necessary, since there is only one method which a CICE can legally implement. So the latest version of the prototype allows you to do just that:
PairMatcher<String,Integer> pm = PairMatcher<String,Integer>(str, val) {
return (str != null && str.length() > 0 && val != null && val > 10);
};
I'm not a big fan of that syntax personally - I'd rather lose the type arguments - but it's an option.
Thoughts very welcome, as usual.
Thursday, January 17, 2008
Exception handling with ARM blocks
One of the big questions in my mind around ARM blocks is exactly how they should behave when exceptions are thrown while disposing of managed resources. It seems likely that we'd want some flexibility in this - there are probably cases where we'd like the ARM block to suppress these exceptions, but others where that would be inappropriate.
The initial version of the prototype only supported the do() { } form of ARM block, and forced the programmer to catch any checked exceptions thrown by the resources' close() methods, which I've found generally leads to wrapping the do() block in a try { ... } catch (IOException e) { ... }, or adding throws IOException to the enclosing method's signature, neither of which is necessarily useful.
Build 03 changes the default semantics of the do() block so that it no longer throws exceptions raised by calling the close() method, and adds support for a try() form of ARM block which can be used when the programmer does wish to explicitly handle such exceptions.
So in the following example using Closeable resource 'fooFactory' , there's no longer any need to provide explicit handling of IOExceptions:
whilst this code forces us to catch exceptions of type FooException, which is a checked exception:
and the new try() block allows us to do that more neatly:
A further option for the try() block would be to suppress any exceptions thrown during automatic disposal which are not explicitly caught, allowing us to choose which ones we wish to handle. This doesn't seem quite so nice if the main body of the try block also throws IOExceptions though.
Thoughts welcome.
The initial version of the prototype only supported the do() { } form of ARM block, and forced the programmer to catch any checked exceptions thrown by the resources' close() methods, which I've found generally leads to wrapping the do() block in a try { ... } catch (IOException e) { ... }, or adding throws IOException to the enclosing method's signature, neither of which is necessarily useful.
Build 03 changes the default semantics of the do() block so that it no longer throws exceptions raised by calling the close() method, and adds support for a try() form of ARM block which can be used when the programmer does wish to explicitly handle such exceptions.
So in the following example using Closeable resource 'fooFactory' , there's no longer any need to provide explicit handling of IOExceptions:
do (fooFactory) {
System.out.println("innocuous");
}
whilst this code forces us to catch exceptions of type FooException, which is a checked exception:
do (fooFactory) {
Foo foo = fooFactory.createFoo();
if (foo == null)
throw new FooException("no foo :(");
}
and the new try() block allows us to do that more neatly:
try (fooFactory) {
Foo foo = fooFactory.createFoo();
if (foo == null)
throw new FooException("no foo :(");
}
catch (FooException e) {
// something went wrong in the block
}
catch (IOException e) {
// something went wrong disposing of fooFactory
}
A further option for the try() block would be to suppress any exceptions thrown during automatic disposal which are not explicitly caught, allowing us to choose which ones we wish to handle. This doesn't seem quite so nice if the main body of the try block also throws IOExceptions though.
Thoughts welcome.
Subscribe to:
Posts (Atom)