Code Cop: Word Wrap

Showing posts with label Word Wrap. Show all posts

16 August 2016

Absolute Priority Premise, an Example

Transformation Priority Premise
I would like to start this article with Uncle Bob Martin's Transformation Priority Premise - TPP for short - which he defined in 2011. With the premise Uncle Bob gave an algorithm how to find the next test when doing Test Driven Development. During (classic) TDD we use transformations to change the code to get from red to green. For example when starting with a new method the very first test is usually red because the empty or generated method returns null. We fake the first test by changing that null to return a constant value, e.g. return 5. This transformation is called nil->constant and has a high priority. A later test might force us to add a conditional to enable another fake. This transformation is called unconditional->if and has a medium priority. Replacing the value of a variable, i.e. variable->assignment has a low priority.

According to Uncle Bob, these transformations have a preferred ordering. We should prefer higher priority transformations to pass tests and chose tests in a way that they can be passed with higher priority transformations. When an implementation seems to require a low priority transformation, we could backtrack to see if there is another test to pass first which does not need that transformation. The theme behind the TPP is that as the tests get more specific, the code gets more generic. If you want to know more about TPP, see Uncle Bob's cartoon follow-up on the TPP and Sorting and his talk on the Transformation Priority Premise.

Absolute Priority Premise
In 2012 Micah Martin set out to define some heuristics to compare code objectively and proof that some code is better than some other code. Building on the TPP's priorities he defined the Absolute Priority Premise - APP for short. The APP knows six components, i.e. basic building blocks of code, and assigns them a mass. The building blocks and their weights were

constant, a value in code has mass of 1.
binding, a name or variable has a mass of 1, too.
invocation, calling a method or function - mass 2.
conditional, any form of if, switch or case - mass 4.
loop, for or while loops - mass 5.
assignment, replacing the value of a variable - mass 6.

More complex building blocks have a higher mass. The total mass of a piece of code is the sum of the mass of its respective components. A lower value is better. The best set of specific values of mass are unknown and I will use Micah's weights. If you chose to count different, please let us discuss!

A detailed explanation of the Absolute Priority Premise is given in the two presentations of 8th Light University (8LU) - Part One and Part Two. See also Micah's Coin Changer Kata as a complete example of applying the premise.

Measuring the Mass of Code
I like Micah's idea to measure the mass of code. It might not be a direct indication of readability but simpler code is always better. I wrote six different versions of the Word Wrap kata and was wondering which one would be considered the "best". I should have calculated the mass of these algorithms manually, but I followed Terence Parr's advice, to avoid working by hand five days what I can spend five years of my life automating. I had to create a tool to calculate the mass of Java code - which of course took me much longer, especially as I verified the mass of each algorithm manually to make sure my code worked as expected.

Absolute Priority Counter
I created abpricou, the ABsolute PRIority COUnter. It parses Java source and collects its mass as defined by the Absolute Priority Premise. It is written in Python, because it was Python month when I started working on it. It uses the Python Java Parser plyj. plyj offers a Visitor API for parser events,

class CountingVisitor(m.Visitor):
    def __init__(self):
        super(CountingVisitor, self).__init__()
        ...

Constants and bindings are staight forward, e.g.

    def visit_Literal(self, literal):
        return self._a_constant_value()

    def visit_FormalParameter(self, parameter):
        return self._a_name()

Literals in the code are counted as constants and parameters are only names for values, thus bindings. A method or function counts as a constant for the code it represents and a binding for its name.

    def visit_MethodDeclaration(self, declaration):
        self._a_name()
        return self._code_is_a_constant()

Invocations, conditionals and loops are the same. For example

    def visit_MethodInvocation(self, invocation):
        return self._an_invocation()

    def visit_IfThenElse(self, conditional):
        return self._a_conditional()

    def visit_While(self, loop):
        return self._a_loop()

The only interesting case is the assignment. Not every assignment in Java is counted as an assignment, only re-assignments which modify values are counted. final fields or local variables are just names for expressions similar to parameters.

    def visit_Assignment(self, assignment):
        if ... :
            # code to ignore
            # final field is assigned in constructor
            # final local is assigned in block
        else:
            return self._an_assignment()

Get the tarball and installable egg of the Absolute Priority Counter.

Let's see some code: Word Wrap Kata
As I said before, I developed the counter to calculate the mass of my different implementations of Word Wrap. I was interested in the mass of the algorithm and skipped constructs like class definitions. (I suppose a class definition is a constant for the code and a binding for the name.) I also ignored Annotations, Enumerations and Generics. The first algorithm uses recursion to loop over the blanks to find the end of each line. Its code is

   final char BLANK = ' ';
   final char NEWLINE = '\n';

   String wrapRecursive(String line, int maxLineLen) {
      if (line.length() <= maxLineLen) {
         return line;
      }

      int indexOfBlank = line.lastIndexOf(BLANK, maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > -1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = maxLineLen;
         offset = 0;
      }
      return line.substring(0, split) + NEWLINE
           + wrap(line.substring(split + offset), maxLineLen);
  }

including the components

| Component   | Mass  | Count |
| constant    |   1   |   7   |
| binding     |   1   |   8   |
| invocation  |   3   |  10   |
| conditional |   4   |   2   |
| loop        |   5   |   0   |
| assignment  |   6   |   0   |

resulting in a total mass of 53. Further implementations have the following masses:

The tail recursive solution has neither loops nor assignments similar to the recursive one and has a total mass of 71. (The source of this and all further variants is given in Word Wrap Kata Variants.)
The looping variant contains a loop and an assignment resulting in a mass of 68.
The optimised loop which avoids temporary String objects, contains a loop and an assignment and some more invocations. Its mass is 80.
The loop using a buffer saves even more heap allocations and needs more invocations summing up to 105.
The solution using Regular Expressions as shown in the original article has a mass of 55. Merging the parts of the expression into a single expression would save five invocations, but makes the expression less readable. I argue that the Regular Expression itself has some weight, as it contains one conditional ("|" in line 3) and two loops ("(.{1," + maxLineLen + "})" in lines 1 and 4). So a fair weight is 69.
The functional version using flatMap and reduce is quite verbose due the lack of inferred pair types in Java. As plyj does not support Java 8 I was unable to measure its mass.

First Conclusion on Algorithms
The most basic, recursive version of Word Wrap has the least weight. What does it mean? Is it the best version? It is the most compact version without playing Code golf, at least in Java. It does not mutate any variables but it puts the highest load on the Garbage Collector. All the discussed algorithms have different memory and run time performance characteristics. I had hoped for a clear answer what the best version would be and I am not seeing that. I challenge you to leave a comment with a version of Word Wrap with a smaller weight. Would it be considered better code?

The mass of the basic loop version seems too much compared to the recursive version. The current weights favour functional programming by putting a penalty on loops and assignments. Mutating a local variable has a smaller weight than mutating the state of an object. The looping version uses StringBuilder instead of plain String + which needs two more invocations for its construction. Then its components are

| Component        | Mass  | Count |
| constant         |   1   |   8   |
| binding          |   1   |  10   |
| local assignment |   1   |   3   | like a new local binding
| invocation       |   3   |  10   |
| conditional      |   4   |   1   |
| loop             |   5   |   1   |
| assignment       |   6   |   0   |

and its mass is 60 which is a bit more code than the recursive solution.

More Structure
The different wrap() functions above are just algorithms. They are not factored into parts that might change at different speeds over time and definitely violate the SRP. They also break the OCP as it is impossible to change the strategy of splitting without touching the logic of collecting. To address this I wrote another recursive version following Tell, don't ask,

interface Page {
   void renderLine(String lineOfProperLength);
}

class Wrapper {

   final char BLANK = ' ';

   final Page page;
   final int maxLineLen;

   Wrapper(Page page, int maxLineLen) {
      this.page = page;
      this.maxLineLen = maxLineLen;
   }

   void wrap(String line) {
      if (line.length() <= maxLineLen) {
         page.renderLine(line);
         return;
      }

      int indexOfBlank = line.lastIndexOf(BLANK, maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > -1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = maxLineLen;
         offset = 0;
      }

      page.renderLine(line.substring(0, split));
      wrap(line.substring(split + offset));
   }
}

which is very similar to the tail recursive solution above, with a total weight of 56. The number is smaller because it does not contain the necessary implementation of Page.renderLine(). The shown code does less, which makes it difficult to compare. On the other hand Page.renderLine() might be used by other parts of the code as well, so it is not strictly a part of Word Wrap.

Moar Structure!1!!
Later I created an extremely factored and decomposed implementation of Word Wrap that separated concepts of rendering, splitting, accumulating lines and hyphenation rules which should fulfil both SRP and OCP. Ignoring the HyphenationRule because that is not covered by the other solutions, its mass is 167 due to many method invocations and parameter names:

| Component   | Mass  | Count |
| constant    |   1   |  17   |
| binding     |   1   |  33   |
| invocation  |   3   |  30   |
| conditional |   4   |   4   |
| loop        |   5   |   1   |
| assignment  |   6   |   1   |

There is a loop and an assignment, at its core this implementation is a looping one. Using the recursive solution I should be able to get rid of the loop and the assignment.

Conclusion
The Absolute Priority Premise counts the different components of a program and sums their weights. If the weights are correct than the program with the smallest mass would be considered the "best" program. This is not true for algorithms like the Word Wrap because the APP ignores features like memory usage or performance optimisation. For general purpose code the validity of the APP is unclear. It just measures the number of code constructs, favouring more compact, functional code. On the other hand the four rules of simple design encourage us to introduce explaining variables to reveal the intent of the code which is more important than fewer elements.

16 July 2013

Word Wrap Kata revisited

Last year in November, my old post about Word Wrap Kata variants suddenly got a lot of visits. Claus told me that well known German blogger Ralf Westphal had linked to it from one of his own articles about Test Driven Development. By using examples from Coding Dojos he explained that using Test Driven Development alone is not enough to get clean code. During the Dojo the participants focused only on the red and green bars and other concerns like design or code readability were neglected. He used my code as examples of solutions got by such a behaviour. While it made me unhappy to serve as negative example, I had to agree with Ralf. When I performed the katas two years ago I explored different algorithms and did not use them to practice TDD. My solutions were technical, instead of reflecting the problem domain.

When I attended a Coding Dojo afterwards I paid close attention and was sad to see he that all Ralf's concerns came true. The group did not even understand the requirements completely, when people already started proposing language features and libraries which would solve the problem quickly. Solve which problem quickly? We should think about the problem first and then create code driven by our tests that is clean, readable and open for future change. Ralf had some ideas how to improve these issues behind programming. Obviously code needs to have many qualities beyond its simple correctness.

So I tried again. I got help from my friend Thomas Sundberg and we worked five remote pairing sessions on the Word Wrap code kata. We performed the kata in the "usual" way, without much thought up front and being driven by our tests. Similar to a Coderetreat, we chose additional constraints: We focused on business related names, SRP and Tell, don't ask. As a kata is a learning exercise, we took everything to the extreme. We literally spent hours discussing if a particular name was reflecting the problem domain, renaming it three times or more until we were satisfied. And we spent at least six pomodoros entirely on refactoring and cleaning up. Probably we could still improve it but we grew tired of the exercise and switched to another kata.

When I see the code now, it looks a bit weird, likely because taking "Tell, don't ask" to the extreme means never asking for any state. So the first thing we need is a class that receives the wrapped output, as we cannot ask for it. Word Wrap is an algorithm inside a word processor when a paragraph of text, which does not contain any newlines, is rendered to a page where it needs to be broken into lines of proper length:

interface Page {
    void renderLine(String lineOfProperLength);
}

In breaking the paragraph into lines of proper length, we see several responsibilities: splitting the paragraph,

interface ParagraphSplitter {
    void wrap(String paragraph);
}

which breaks the stream of text down into words, accumulating the words into lines of proper length,

interface LineAccumulator {
    void addAndHyphenateIfNeeded(String word);
    void addWithoutHyphenation(String part);
    void addCarriageReturn();
}

and maybe a hyphenation rule to determine if a word which is too long can be broken into shorter pieces,

interface HyphenationRule {
    void doHyphenate(String word, LineAccumulator lineAccumulator);
}

We did not start with this design but arrived at it when removing duplication and multiple responsibilities mercilessly ;-) The obvious implementation of ParagraphSplitter is to split on each blank, e.g.

class SeparateWordsOnBlanks implements ParagraphSplitter {

    private final LineAccumulator accumulator;

    // constructor omitted

    public void wrap(String paragraph) {
        for (String word : paragraph.split(BLANK)) {
            accumulator.addAndHyphenateIfNeeded(word);
        }
        accumulator.addCarriageReturn();
    }
}

which is not exciting at all. The heavy lifting is done by the accumulator which checks if adding the current word would exceed the maximum line length, invokes the hyphenation and adds blanks where needed. Whenever a line is complete it is rendered to the page.

class LineLengthAccumulator implements LineAccumulator {

    private final Page page;
    private final HyphenationRule hyphenation;
    private final int maximumLineLength;
    private StringBuilder currentLine = new StringBuilder();

    // constructor omitted

    public void addAndHyphenateIfNeeded(String word) {
        if (exceedingMaximumLineLength(SPACE, word)) {
            hyphenation.doHyphenate(word, this);
            return;
        }
        insertSpaceIfNeeded();
        appendToCurrentLine(word);
    }

    public void addWithoutHyphenation(String part) {
        if (exceedingMaximumLineLength(EMPTY, part)) {
            addCarriageReturn();
        }
        appendToCurrentLine(part);
    }

    private boolean exceedingMaximumLineLength(String separator, String word) {
        return currentLine.length() + separator.length() + word.length() > maximumLineLength;
    }

    public void addCarriageReturn() {
        if (currentLineHasWords()) {
            renderCurrentLine();
            lineFeed();
        }
    }

    // some private methods omitted

}

There are a bunch of unit tests using a NoneHyphenationRule or different anonymous mock rules to drive the functionality of the LineLengthAccumulator. For example

@Test
public void shouldNotWrap() {
    Page mockOutput = mock(Page.class);

    LineAccumulator accumulator = new LineLengthAccumulator(mockOutput, 78);
    accumulator.addAndHyphenateIfNeeded("This");
    accumulator.addAndHyphenateIfNeeded("is");
    accumulator.addCarriageReturn();

    verify(mockOutput).renderLine("This is");
    verify(mockOutput, times(1)).renderLine(anyString());
}

The HyphenationRule needs to know its accumulator and the other way round, which creates a conceptual cycle. Additionally we need a second method addWithoutHyphenation in the LineAccumulator to avoid endless loops during hyphenation. This is the result of following "Tell, don't ask" to the letter. Maybe HyphenationRule would be better off just returning the hyphenated word.

class SplitOnCamelCase implements HyphenationRule {

    public void doHyphenate(String word, LineAccumulator lineAccumulator) {
        if (foundAnUpperCaseLetterIn(word)) {
            String remainingWords = splitFirstSyllableFrom(word, lineAccumulator);
            doHyphenate(remainingWords, lineAccumulator);
        } else {
            lineAccumulator.addWithoutHyphenation(word);
        }
    }

    // private methods omitted
}

To unit test the hyphenation rules, a mock accumulator is used to verify the proper syllables.

@Test
public void shouldHyphenateWords() {
    LineAccumulator mockAccumulator = mock(LineAccumulator.class);
    HyphenationRule strategy = new SplitOnCamelCase();

    strategy.doHyphenate("ShortWord", mockAccumulator);

    InOrder inOrder = inOrder(mockAccumulator);
    inOrder.verify(mockAccumulator).addWithoutHyphenation("Short");
    inOrder.verify(mockAccumulator).addWithoutHyphenation("Word");
    verify(mockAccumulator, times(2)).addWithoutHyphenation(anyString());
}

I am pleased with our result, although it took us too long to complete it, which is a sign how much practice we still need ;-) See its full source code here. You might still find one or another thing that is not optimal and never will be but in comparison with my first version this Word Wrap expresses its problem domain more clearly and is easy to understand even without digging down into all the details.

12 August 2011

Word Wrap Kata Variants

It's time for some exercise - time for a code kata. I like the simple ones because they don't take much time and still provide a certain amount of training. After having done the Prime Factors Kata more than 20 times using Java, Ruby, C#, Turbo Pascal, BASIC and even Forth, I feel like trying a new one for a change. I choose the Word Wrap Kata, also by Uncle Bob. Unlike Prime Factors, Word Wrap seems to be less popular, there are only a few experiences with it published, e.g. Word Wrap using Python.

Recursive
The kata's task is to write a function that, like a word processor, breaks the line by replacing the last space in a line with a newline. The most straight forward solution to this is IMHO the recursive one. I only need to consider the first blank or forced break and call myself with the remaining text.

public String wrap(String line, int maxLineLen) {
   if (line.length() <= maxLineLen) {
      return line;
   }
   int indexOfBlank = line.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   return line.substring(0, split) + NEWLINE +
      wrap(line.substring(split + offset), maxLineLen);
}

Usually kata is not about the final solution, but about the process to get there. Still I want to compare different solutions, so let's analyse this one. If the line needs to be split n times (into n+1 shorter lines) then this solution creates 3*n String objects and additional n StringBuilders for string concatenation. ... 4*n objects are created.

Tail Recursive
To be tail recursive a function's last statement must be the recursive call.

public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   wrap(line, maxLineLen, accumulator);
   return accumulator.toString();
}
private void wrap(String remainingLine, int maxLineLen, StringBuilder accumulator) {
   if (remainingLine.length() <= maxLineLen) {
      accumulator.append(remainingLine);
      return;
   }
   int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   accumulator.append(remainingLine.substring(0, split));
   accumulator.append(NEWLINE);
   wrap(remainingLine.substring(split + offset), maxLineLen, accumulator);
}

This solution creates the new String in the very end. It needs 2 Strings per new line and only one StringBuilder and a final String to return it. ... 2*n+2 objects are created.

Loop
A tail recursive function can be rewritten to reuse the stack frame transforming it into a plain loop. As Java does not support that optimisation, I do it by hand.

public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   String remainingLine = line;
   while (remainingLine.length() > maxLineLen) {
      int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > -1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = maxLineLen;
         offset = 0;
      }
      accumulator.append(remainingLine.substring(0, split));
      accumulator.append(NEWLINE);
      remainingLine = remainingLine.substring(split + offset);
   }
   accumulator.append(remainingLine);
   return accumulator.toString();
}

The loopy :-) solution creates the same number of objects as the tail recursive one, but has reduced call overhead. ... Still 2*n+2 objects are created.

Optimised Loop
Let's optimise away the splitting of the remaining line because it gets split again in the next call, so all these Strings are only temporarily used.

public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   int pos = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > pos - 1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = pos + maxLineLen;
         offset = 0;
      }
      accumulator.append(line.substring(pos, split));
      accumulator.append(NEWLINE);
      pos = split + offset;
   }
   accumulator.append(line.substring(pos));
   return accumulator.toString();
}

Now only one String is created per new line, one for the last remaining part and one after the final concatenation. ... n+3 objects are created.

Using a Buffer
If I could get rid of half of the String splitting, why not drop the other half too?

public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   accumulator.append(line);
   int pos = 0;
   int inserted = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      if (indexOfBlank > pos - 1) {
         accumulator.setCharAt(inserted + indexOfBlank, NEWLINE);
         pos = indexOfBlank + 1;
      } else {
         accumulator.insert(inserted + pos + maxLineLen, NEWLINE);
         pos = pos + maxLineLen;
         inserted++;
      }
   }
   return accumulator.toString();
}

Only one StringBuilder and one String are created. ... 2 objects are created. This is definitely the most garbage collector friendly solution because it creates only one temporary object, the StringBuilder.

Copying Characters
If there are blanks in line then all characters get copied once in append() and all solutions behave similar. (The method substring() does not copy characters, it just creates a new String with different pointers in the character array of the original String.) But if there are no blanks in line then the last solution copies all characters after the split point around one or more times. Copying large numbers of characters might be slower than allocating an object, especially as the JVM/Hotspot is optimised for short lived, small objects.

Resizing the StringBuilder
For all solutions using an explicit StringBuilder another optimisation is to avoid automatic resizing. The size of accumulator must be large enough to contain all additional newlines. If a line has a blank then it's replaced, so no new character is added. Only when a line contains no blank, then a newline is inserted. This can happen up to lineLen / maxLineLen times. Flooring (rounding down in integer division) is the right thing because after the last part, e.g. remainingLine, nothing is added.

...
   StringBuilder accumulator =
      new StringBuilder(calcMaxSize(line.length(), maxLineLen));
...
private int calcMaxSize(int lineLen, int maxLineLen) {
   int maxCharsAdded = lineLen / maxLineLen;
   return lineLen + maxCharsAdded;
}

Note that all these optimisations are highly theoretical. In a typical web or database application it doesn't matter at all if a few temporary objects are created or not. Remember that "Early optimisation is the root of all evil" (Donald Knuth). I would stick with the first solution because it's the shortest and easy to understand.

Bonus Round

public String wrap(String line, int maxLineLen) {
   return line.replaceAll("([^ ]{" + maxLineLen + "})" + // 1
      "(?=[^ ])" +                                       // 2
      "|" +                                              // 3
      "(.{1," + maxLineLen + "})" +                      // 4
      " ",                                               // 5
      "$1$2" + NEWLINE);
}

This solution is even shorter, just one line using a Regular Expression. It replaces the split points with a newline or adds one. The pattern matches areas of line which (1) contain exact maxLineLen characters that are not blanks, (2) which must be followed by a character that's not a blank (and which is not consumed) (3) or it matches areas of (4) one up to maxLineLen characters (5) which are followed by a blank. The match is replaced with the first (1) and second (3) groups together with a newline. The single blank (5) is not a member of any group and is dropped.

(Get the source.)

Code Cop

16 August 2016

Absolute Priority Premise, an Example

16 July 2013

Word Wrap Kata revisited

12 August 2011

Word Wrap Kata Variants

About Me

In Public

More Creations

Content Series

Community

Labels

Archive