Code Cop: Assembly

Showing posts with label Assembly. Show all posts

20 March 2021

11 Years of Prime Factors Kata

In this post I want to celebrate the Prime Factors Code Kata. Prime Factors is a small coding exercise first used by Robert C. Martin in 2005. It is my favourite code kata and it has been almost nine years since I last wrote about it - time to change that. Weird enough, the first code kata I ever worked on - outside of university assignments that turned out to be katas later - was Roman Numerals in 2007. The first time I did the Prime Factors was during Christmas holidays 2009 in Java and Ruby:

import java.util.ArrayList;
import java.util.List;

public class PrimeFactors {
  public static List<Integer> generate(int n) {
    List<Integer> primes = new ArrayList<Integer>();

    for (int candidate = 2; candidate <= n; candidate++) {
      for (; n % candidate == 0; n /= candidate) {
        primes.add(candidate);
      }
    }

    return primes;
  }
}

Now the Java code is exactly the code Robert Martin showed, I was following his example. The Ruby version from back then looks pretty similar, just using while instead of for.

module PrimeFactors
  def generate(n)
    prime_factors = []

    candidate = 2
    while n > 1 do
      while n % candidate == 0 do
        prime_factors << candidate
        n /= candidate
      end
      candidate += 1
      candidate = n if candidate > Math.sqrt(n) # performance fix
    end

    prime_factors
  end
end

If you are wondering how I am still able to find the code, I organise my code katas to allow lookup and comparison. Since then I did the kata 141 times and it has many uses.

Learn a language
Prime Factors is one of the first pieces of code I write - Test Driven of course - when revisiting old languages, like Commodore BASIC or looking at a new language, like Forth, using Gforth 0.7:

: prime_factors ( n -- n1 n2 n3 n4 )
  DUP 1 > IF           \ test for ?DO
    DUP 2 ?DO
      BEGIN
        DUP I MOD 0 =  \ test candidate I
      WHILE
        I SWAP I /     \ add candidate, reduce number
      REPEAT
    LOOP
  THEN
  DUP 1 = IF DROP THEN ;

T{ 1 prime_factors -> }T
T{ 2 prime_factors -> 2 }T
T{ 3 prime_factors -> 3 }T
T{ 4 prime_factors -> 2 2 }T
T{ 6 prime_factors -> 2 3 }T
T{ 8 prime_factors -> 2 2 2 }T
T{ 9 prime_factors -> 3 3 }T

Gforth came with a modified testing framework based on John Hayes S1I's tester.fs, defining the functions T{, -> and }T for testing. Note that the given function prime_factors is not realistic as the number of returned arguments is not known by the caller.

When I had a look at Scala, of course I did Prime Factors:

import java.lang.Math.sqrt

object PrimeFactors {
  def generate(number: Int): List[Int] = {

    def fold(current: Pair[Int, List[Int]], candidate: Int): Pair[Int, List[Int]] = {
      if (current._1 % candidate == 0)
        fold((current._1 / candidate, candidate :: current._2), candidate)
      else
        current
    }

    val (remainder, factors) =
      (2 to sqrt(number).intValue).foldLeft((number, List[Int]()))(fold)

    if (remainder > 1)
      (remainder :: factors).reverse
    else
      factors.reverse
  }
}

This is crazy. The code creates a sequence of all candidate primes and folds it starting from left by dividing by the candidate recursively, appending to the begin of the list, which is cheap. Because of that the list is reversed at the end. I have no idea why I created this, probably I was playing around with foldLeft. This is not a good example, please do not copy it. After all these years, the Forth solution seems easier to grasp. ;-)

So which languages are missing? PowerShell looks much like my PHP (shown below) and my Python Prime Factors looks similar too, just with Python specific range(2, number + 1) and //= inside. And of course JavaScript is missing:

PrimeFactors = function() {
  this.factors = [];
};

PrimeFactors.prototype.generate = function(num) {
  var candidate;
  for (candidate = 2; candidate <= num; candidate += 1) {
    num = this.tryCandidate(num, candidate);
  }
  return this.factors;
};

PrimeFactors.prototype.tryCandidate = function(num, candidate) {
  while (num % candidate === 0) {
    num = this.reduceByFactor(num, candidate);
  }
  return num;
};

PrimeFactors.prototype.reduceByFactor = function(num, factor) {
  this.factors.push(factor);
  return num / factor;
};

Isn't that lovely? Again this is not good code, please do not copy it. At least I showed some creativity using prototype functions.

Learn a testing framework
Using TDD to write some known code is a perfect way to learn more about a testing framework. So I explored XSLTunit using Prime Factors in XSLT or NUnit in C#:

using NUnit.Framework;

[TestFixture]
public class PrimeFactorsTest
{
  [TestCase(new int[0], 1)]
  [TestCase(new int[] { 2 }, 2)]
  [TestCase(new int[] { 3 }, 3)]
  [TestCase(new int[] { 2, 2 }, 4)]
  [TestCase(new int[] { 2, 2, 2 }, 8)]
  [TestCase(new int[] { 3, 3 }, 9)]
  public void TestFactors(int[] expected, int number)
  {
    CollectionAssert.AreEqual(expected, PrimeFactors.Generate(number));
  }

  [Test]
  [Timeout(100)]
  public void TestLarge()
  {
    CollectionAssert.AreEqual(new int[] { int.MaxValue },
                              PrimeFactors.Generate(int.MaxValue));
  }
}

Test your own testing framework
Sometimes, when I need to create my own unit testing framework, e.g. TPUnit for old Turbo Pascal, assert-scm (Scheme R5RS) or ASM Unit for Windows Assembly, I use Prime Factors as test cases:

_prime_factors:
  mov     esi, 0          ; esi = number of factors
  mov     edi, ebx        ; edi = address of factors
  mov     ecx, eax        ; ecx = current number
  mov     ebx, 1          ; ebx = candidate

  jmp .not_diviseable

.loop_over_candidates:
  inc     ebx             ; next candidate

.break_if_candidate_is_larger_than_square:
; if candidate * candidate <= number then try candidate
  mov     eax, ebx
  mul     ebx
  cmp     eax, ecx
  jbe     .try_candidate

; else number is a (large) prime and we are done
  mov     [edi + esi * register_size], ecx
  add     esi, 1
  jmp     .done

.try_candidate:
; if number % candidate == 0 then add candidate
  mov     eax, ecx
  xor     edx, edx
  div     ebx
  cmp     edx, 0          ; remainder is 0
  jne     .not_diviseable

.is_diviseable:
  mov     [edi + esi * register_size], ebx
                          ; store candidate in factors
  add     esi, 1          ; we found a factor
  mov     ecx, eax        ; number is remainder of division
  jmp     .try_candidate  ; try current candidate again

.not_diviseable:
; if number > 1 then try next candidate
  cmp     ecx, 1
  jne     .loop_over_candidates

.done:
; return number of factors
  mov     eax, esi
  ret

This piece of assembly calcultes the prime factors of the number passed in EAX into in the dword array address EBX.

TDD Demo
In 2012 I started practising Prime Factors as kata performance, minimising the number of keys I pressed. I ran it around 50 times. In the end I used the practice to calm down when I was anxious - it was like mediation. I still have to perform it somewhere, adding music and all... I have used it demoing TDD in uncounted presentations - actually around 40: during my university guest lectures, user group meetings and at my clients. Most demos were in Java and some in PHP,

<?php
class PrimeFactors {
  static function generate($n) {
    $factors = [];
    for ($candidate = 2; $candidate <= $n; $candidate += 1) {
      while ($n % $candidate == 0) {
        $factors[]= $candidate;
        $n /= $candidate;
      }
    }
    return $factors;
  }
}

and a single demo of test driving R code,

primeFactors <- function(number) {
  factors <- vector(mode="integer")

  candidate <- 2
  while (candidate <= sqrt(number)) {
    while (number %% candidate == 0) {
      factors <- c(factors, candidate)
      number <- number / candidate
    }
    candidate = candidate + 1
  }

  if (number > 1) {
    factors <- c(factors, number)
  }

  factors
}

It seems, most programming languages look the same. The last sentence is not true for NATURAL, Cobol's cousin, which is ugly. I will not show it here as it would destroy this lovely post.

Conclusion
By writing this post, I learned that I still need to create Prime Factors in the programming languages Go, Kotlin, OpenOffice Basic, Oracle PL/SQL and of course TypeScript - I could - and I will, it is just a matter of time. So Prime Factors - in fact any small enough code kata - is a great tool for exploring, studying and practising programming languages, testing frameworks, IDE tools and Test Driven Development in general. I will leave you with my latest addition to my collection of Prime Factors, using C99. Have fun!

#include <math.h>

typedef struct {
  unsigned char count;
  unsigned int values[31];
} PrimeFactors;

void PrimeFactors_init(PrimeFactors* factors)
{
  (*factors).count = 0;
}

void PrimeFactors_add(PrimeFactors* factors, const unsigned int factor)
{
  int count = (*factors).count;
  (*factors).values[count] = factor;
  (*factors).count = count + 1;
}

void generate(const unsigned int number, PrimeFactors* factors)
{
  PrimeFactors_init(factors);

  unsigned int remaining = number;
  for (unsigned int candidate = 2; candidate <= sqrtl(remaining); candidate += 1) {
    while (remaining % candidate == 0) {
      PrimeFactors_add(factors, candidate);
      remaining /= candidate;
    }
  }

  if (remaining > 1) {
    PrimeFactors_add(factors, remaining);
  }
}

21 June 2020

Conway's Squasher Coroutine

Since some time I am researching coroutines. Wikipedia says that a coroutine is a computer program component that generalises subroutines, allowing execution to be suspended and resumed. [...] Subroutines are special cases of coroutines. When subroutines are invoked, execution begins at the start, and once a subroutine exits, it is finished; an instance of a subroutine only returns once, and does not hold state between invocations. By contrast, coroutines can exit at any time, and execution may later return to these points. [...] A coroutine instance holds state, and varies between invocations. Coroutines are useful for many things. For example the Kotlin programming language lists several uses cases including asynchronous computations, generators and implementing state machines in a sequential way. More and more languages provide them.

The Paper
According to Donald Knuth, Melvin E. Conway coined the term coroutine in 1958 when he applied it to the construction of an assembly program. He first published it 1963 in his paper Design of separable transition-diagram compiler. Conway states that the coroutine idea was concurrently developed by him and by Joel Erdwinn. The paper is about implementing a COBOL compiler, the first two pages introduce the coroutine together with its implementation in assembly for the Burroughs model 220, a late vacuum-tube computer.

Code Example
Conway shows an example, the asterisk squasher subroutine, in both subroutine and coroutine form. The mentioned assembly commands are unknown to me and the flowcharts are not 100% readable, so I decided to reimplement the example in modern assembly to see what is going on. The asterisk squasher is a program which reads cards and writes the string of characters it finds, column 1 of card 1 to column 80 of card 1, then column 1 of card 2, and so on, with the following wrinkle: every time there are adjacent asterisks they will be paired off from the left and each "**" will be replaced by the single character "^". This operation is done with the exponentiation operator of FORTRAN and COBOL.

Asterisk Squasher Subroutine
The problem is that out - the output area of the SQUASHER routine - only holds a single byte, which means the routine can only return and subsequentially write a single character. When looking for the second asterisk, the routine reads a second element and thus needs somehow to return it on the next call unless it was a "*". Conway continues that the simple requirement of compressing asterisks requires the introduction of a switch which bifurcates the subroutine and selects the part to be used at each call depending on the situation at the last call. The reason for the switch is that each call of the subroutine must result in output of exactly one character. The second read character is stored in a (module) global variable t2 until the next call. Here is its flowchart:

Figure 1: Asterisk squasher subroutine with environment

I use my trusted Windows assembly setup for Intel IA-32 assembly. The assembly is in NASM format. First define the data:

card_len equ    80

switch: resd    1
%define ON      dword 1
%define OFF     dword 0

i:      resd    1
card:   resb    card_len

t1:     resd    1
t2:     resd    1

out:    resd    1

Let's warm up with the code for RDCRD, which reads 80 characters of a card. The code strictly follows the flowchart above.

RDCRD:
        mov     eax, [i]
        cmp     eax, card_len
        jne     .exit

        mov     [i], dword 0

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _ReadFile@20 to read 80
        ; characters into [card].

.exit:
        ret

The main method is SQUASHER with the switch variable.

SQUASHER:
        mov     eax, [switch]
        cmp     eax, OFF
        je      .off
.on:
        mov     eax, [t2]
        mov     [out], eax

        mov     [switch], OFF
        jmp     .exit

.off:
        call    RDCRD

        mov     esi, [i]
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t1], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t1]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_ast

.equal_ast:
        call    RDCRD

        mov     esi, [i]                ; redundant, value still in register
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t2], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t2]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_second_ast

.equal_second_ast:
        mov     [t1], dword '^'
        jmp     .not_equal_ast

.not_equal_second_ast:
        mov     [switch], ON
        jmp     .not_equal_ast

.not_equal_ast:                         ; label 1
        mov     eax, [t1]
        mov     [out], eax

.exit:
        ret

There is nothing special happening in the code above. There are the two main branches .on and .off. To run the whole program on a single test card I need a WRITE and a main method. WRITE iterates a card and prints the result of each call to SQUASHER to standard out.

WRITE:
.loop:
        call    SQUASHER

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _WriteFile@20 to write 1
        ; character from [out].

        mov     eax, [i]
        cmp     eax, card_len
        jne     .loop

        ret

; ----------------------------------------
        global  _main
_main:
        ; set up switch
        mov     [switch], OFF

        ; set up global data
        mov     [i], dword card_len

        call    WRITE

        push    0
        call    _ExitProcess@4

Asterisk Squasher as Coroutine
Conway writes further in his paper: The coroutine approach to the same problem accomplishes the switching job implicitly by use of the subroutine calling sequence. When coroutines A and B are connected so that A sends items to B, B runs for a while until it encounters a read command, which means it needs something from A. Then control is transferred to A until it wants to "write," whereupon control is returned to B at the point where it left off. In this situation A is the SQUASHER and B is the WRITE coroutine. The following flowchart shows SQUASHER when both it and the using program (WRITE) are coroutines.

Figure 2: Asterisk squasher as coroutine

Coroutine in Assembly?
While some architectures provide support for coroutines, according to the Art of Assembly Language 80x86 CPUs do not provide a COCALL instruction. Conway describes the coroutine linkage on the Burroughs 220 as follows: The UNCONDITIONAL BRANCH instruction BUN A works by placing its address A into the P-register (the Instruction Pointer). The STORE P instruction STP B places the contents of P plus one into the address part of the contents of location B. The standard subroutine call is

1:      STP EXIT                        ; store 3 into target address of address EXIT
2:      BUN ENTRANCE                    ; jump to address ENTRANCE
3:

where location EXIT contains a BUN instruction whose address will be altered by the STP instruction in the call. (So this is self-modifying code and the return of the subroutine is then EXIT: BUN 3 or something like that.) A pair of subroutines becomes a pair of coroutines by adding to each an isolated BUN instruction which we can call its router, and by changing the addresses of the STP and BUN calls as follows: when coroutine A calls coroutine B the call is

        STP AROUTER
        BUN BROUTER

That seems a small change for the Burroughs but how could I recreate that?

Coroutine in Assembly!
The BUN is a JMP on Intel. Unfortunately not all instructions are one byte, so using the current IP + 1 as done by STP does not work. I tried to follow the original idea with a short macro. When coroutine A calls coroutine B the call is coro A B,

%macro coro 2
        mov     ebx, %%_cnt             ; basically IP + 1
        mov     [route_%1], ebx         ; store it in target of router A
        jmp     %2                      ; jump to router B
%%_cnt: nop
%endmacro

route_SQUASHER: resd 1
route_WRITE:    resd 1

SQUASHER:
        jmp     [route_SQUASHER]

WRITE:
        jmp     [route_WRITE]

This needs storage for the route of each coroutine. The call of the macro does not look so good as it needs the name of the current coroutine as first argument for the macro to use the right route storage. If each coroutine would be its own module, I could drop that and just use [route]. The router variables [route_*] will be initialised at startup. The code of the two routines SQUASHER and WRITE changes slightly, strictly following the second flowchart above.

; router
SQUASHER:
        jmp     [route_SQUASHER]

; coroutine
SQUASHER_CORO:                          ; label 1
        call    RDCRD

        mov     esi, [i]
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t1], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t1]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_ast

.equal_ast:
        call    RDCRD

        mov     esi, [i]                ; redundant, value still in register
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t2], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t2]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_second_ast

.equal_second_ast:
        mov     [t1], dword '^'
        jmp     .not_equal_ast

.not_equal_second_ast:
        mov     eax, [t1]
        mov     [out], eax
        coro    SQUASHER, WRITE

        mov     eax, [t2]
        mov     [out], eax
        coro    SQUASHER, WRITE

        jmp     SQUASHER_CORO

.not_equal_ast:                         ; label 2
        mov     eax, [t1]
        mov     [out], eax
        coro    SQUASHER, WRITE

        jmp     SQUASHER_CORO

; ----------------------------------------
; router
WRITE:
        jmp     [route_WRITE]

; coroutine
WRITE_CORO:
.loop:
        coro    WRITE, SQUASHER

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _WriteFile@20 to write 1
        ; character from [out].

        mov     eax, [i]
        cmp     eax, card_len
        jne     .loop

        ret

; ----------------------------------------
        global  _main
_main:
        ; set up coroutine routers
        mov     ebx, SQUASHER_CORO
        mov     [route_SQUASHER], ebx

        mov     ebx, WRITE_CORO
        mov     [route_WRITE], ebx

        ; set up global data
        mov     [i], dword card_len

        call    WRITE

        push    0
        call    _ExitProcess@4

Conslusion
I hope that clarifies how the first coroutine presented in 1963 worked. Re-coding the example was a lot of work and in the end it paid off because everything is clear now, even more so after I wrote this blog post about it. This is the golden triangle of learning: theory (e.g. reading about it), practice (e.g. creating it) and finally teaching (documenting and sharing it).

Full Source Code
After I published this article Dmitry Kandalov pushed me to make the code available on GitHub. And then we went to port it to macOS, so he could play with it as well. To compile and run the code you need GNU make, NASM and a linker depending to your platform. Thanks to Samir Talwar's Smoke, it even comes with tests. Give it a try: Conway's Squasher Coroutine @ GitHub.

7 August 2015

How to Unit-Test Assembly

now i just need a pirate and a waterfall...

After my last trip into Assembly I got the feedback that I should have used TDD, especially as it was supposed to be a code kata. (Thank you Emmanuel Gaillot for reminding me of my duties.) And yes, I should have. But as I said, I could not find any real unit testing support, at least not without resorting to C or C++ unit testing frameworks. On the other hand, creating an xUnit implementation is a great exercise to get to know a language. So I decided to create my own, minimal unit testing framework for Windows IA-32 Assembly using NASM. (The following code uses stdcall call convention because it is used in the Microsoft Win32 API anyway.)

Failure
So what are the most required features for unit testing support? I believe the most important feature is to mark a test as failed. I also want to stop executing this test in case of failure, like JUnit does. This is the fail() method,

%macro fail 0
        log msg_failed, msg_failed_len

        leave
        ret 0 ; test method never has any arguments
%endmacro

msg_failed:     db 'FAILED'
msg_failed_len  equ $ - msg_failed

In fact it is a NASM macro which gets copied into each invocation. For simplicity, it does not support a message argument right now, hence zero macro parameters. It uses the log macro, which prints the given string to Standard Out, just like _printGrid of my Assembly Minesweeper. Then it resets the stack frame (leave) and returns from the current function. I expect all test methods to have no arguments, and fail's code gets copied into each test method, this skips further execution of the test method and returns immediately to the calling code, i.e. the test runner. Adding the count of failed tests in fail is straight forward but I will leave that for later.

Assertion
Next I need assertions to check my expectations, at least an assert_equals,

%macro assert_equals 2
        cmp     %1, %2
        je      .%1_%2_end
        fail
.%1_%2_end:
        call    _show_progress
%endmacro

This has to be a macro as well so fail works correctly as explained above. But as the macro gets expanded into the calling code, the local label .%1_%2_end has to be unique. While different assertions are possible, a specific assertion like assert_equals eax, 2 can only be used once per test method. This is a weird constraint but I do not mind as my tests tend to have a single assertion per test method anyway. (Later I changed the macro to use a macro local label %%_end which removed this problem.)

The function _show_progress just prints a single dot to show the progress during test execution. It needs to preserve all registers because it is called in the middle of my test code.

_show_progress:
        push    eax
        push    ebx
        push    ecx
        push    edx

        push    dot_len
        mov     eax, dot
        push    eax
        call    _print

        pop     edx
        pop     ecx
        pop     ebx
        pop     eax

        ret

dot:    db      '.'
dot_len equ     1

Test Cases
For my test cases I want descriptive names. Using long, regular labels for function names and the assertion macros I write my first test

_should_add_one_and_one_to_be_two:
        create_local_variables 0

        mov     eax, 1
        add     eax, 1
        assert_equals eax, 2

        leave
        ret

The test succeeds and prints a single dot to Standard Out. The create_local_variables macro creates the stack frame and local variables if needed. If a test fails, e.g.

_should_add_one_and_two_to_be_three:
        create_local_variables 0

        mov     eax, 1
        add     eax, 2
        assert_equals eax, 2 ; (6)

        ; not reached
        hlt

        leave
        ret

it prints FAILED and stops execution in line 6.

Test Runner
Finally I want to run all my tests one by one. In this simple case I just call the test methods from the main entry point, followed by printing DONE at the end.

global  _main

_main:

        ; welcome message
        log     msg_hello, msg_hello_end - msg_hello

        call    _should_add_one_and_one_to_be_two
        call    _should_add_one_and_two_to_be_three
        ; (10)

        ; completion message
        log     msg_done, msg_done_end - msg_done

        jmp     _exit

msg_hello:      db 'HELLO asmUnit'
msg_hello_end:

msg_done:       db 'DONE'
msg_done_end:

This gets the job done but I am not happy with it. I need to add new test method invocations by hand in line 10. By seeing the new test fail I cannot forget to add the call of the new test method, but automatic test case discovery would be more convenient.

Before and After
What else does a unit testing framework need? For complex cases, before and after test execution hooks might be necessary. A before macro defines the proper label and the begin_test macro just calls it.

%macro before 0-1 0
        %ifdef ctx_before
                %error "before used more than once"
        %endif

        %define ctx_before, 1
_before_hook:
        create_local_variables %1
%endmacro

%macro begin_test 0-1 0
        create_local_variables %1
        %ifdef ctx_before
                call _before_hook
        %endif
%endmacro

The after and end_test macros work accordingly, just for the test clean-up. The actual test code stays the same, the macro calls just changed.

_should_add_one_and_one_to_be_two:
        begin_test

        mov     eax, 1
        add     eax, 1
        assert_equals eax, 2

        end_test

While this works well, I am unhappy with all the (NASM specific) macros in the code. There remains little real Assembly, but I guess that is the way to go. There are also plenty of missing features but this minimal first version gets me started. I will see how it works out in my next Assembly kata. (The complete source of asmUnit is here.)

7 July 2015

Minesweeper "Near the Metal"

Minesweeper
The Minesweeper code kata is a little programming exercise focusing on a subset of the well-known Minesweeper game that comes with a certain operation system. Its assignment is to write a program which accepts an input of arbitrary mine fields, with mine locations marked, and produces the full mine field with all hint values calculated. For example, given the following 4 times 4 mine field

4 4
*...
....
.*..
....

, the same mine field including the hint numbers would look like

The Challenge
Recently the local functional programming group, the Lambdaheads, announced their Minesweeper Challenge. Being a functional programming group, most of the presented solutions used Haskell, F# and other functional languages. I participated and showed them my highly factored object orientated solution in Java, which surely looked as strange to them as their Haskell code was for me.

After seeing a solution in plain C, one participant joked that someone should use Assembly to show an even more low-level implementation. I agreed, it seemed like a great idea at that time, almost as cool as implementing Game of Life in XSLT. As usual I could not let got of the crazy idea. After my first steps in Windows 32 Assembly I was ready for a more complex task.

What about testing?
Although there are people test driving their Assembly, I could not find real unit testing support. The official way to go is to use C/C++ unit testing frameworks and write tests against the global functions in the generated OBJ files. I did not want to learn C this day and created (sort of) an integration test by feeding some mine field into Minesweeper's Standard Input and checking the output manually. Yes, I could have created a minimalist unit testing framework, which is a great exercise to get to know a new language by itself and I have done so in the past, just to keep the flow of TDD. I will come back to this topic later. (Thank you Emmanuel for reminding me of my duties ;-)

Getting Feedback
I wanted to get as much feedback as possible. Feedback is critical when working in an environment as unforgiving as Assembly. In the beginning my integration test did not help me and I split the implementation of my Minesweeper into four parts: Reading Input -> Creating Hints -> Preparing Output -> Printing Results. I decided to work backwards so I could use the result printing to give me feedback about the other stages.

Printing Results
In the beginning the mine field's grid would be a constant string.

max_dimension equ 99

; temp data
grid:   times max_dimension * max_dimension db 'x'
width:  dd      4
height: dd      4

Starting with Hello World I incrementally added to the output until _printGrid would print grid line by line to Standard Out. Then I added support for mine fields of any dimension, using width and height.

_printGrid:
        ; local variables, starting memory[ebp - 4]
numberOfBytesWritten equ -1 * register_size

        enter_method 1                  ; stack frame (macro)
        push    esi
        push    edi

        ; get handle for stdout
        push    STD_OUTPUT_HANDLE
        call    _GetStdHandle@4
        mov     ebx, eax                ; hStdOut

.loop_all_lines:                        ; for (edi = height; edi > 0; edi--)
        mov     esi, 0                  ; address of current line
        mov     edi, [height]           ; loop counter

.loop:
        ; write the next line
        ; BOOL WriteFile( hstdOut, message, length(message), &bytes, null);
        push    NULL
        lea     edx, [ebp + numberOfBytesWritten]
        push    edx
        push    dword [width]
        lea     eax, [grid + esi]
        push    eax
        push    ebx                     ; hstdOut
        call    _WriteFile@20

        ; write a new-line
        push    NULL
        lea     edx, [ebp + numberOfBytesWritten]
        push    edx
        push    len_cr
        push    cr
        push    ebx
        call    _WriteFile@20

.next:
        add     esi, max_dimension      ; go down to next line
        sub     edi, 1                  ; edi--
        jnz     .loop                   ; edi > 0

        pop     edi                     ; drop stack frame
        pop     esi
        exit_method
        ret

cr:
        db      13, 10                  ; Windows \n\r
len_cr  equ     $ - cr

The code above just loops all the grid's lines in esi and prints them followed by a Windows new-line each.

Creating Hints
Next I made the grid mutable by moving it into the data segment and put a real mine field into it.

section .bss

max_dimension equ 4
last_field_in_grid equ max_dimension * max_dimension - 1

grid:   resb    max_dimension * max_dimension
width:  resd    1
height: resd    1

...

_main:

        ; temp copy
        mov     [width], dword 3        ; 3x3 inside 4x4 total space
        mov     [height], dword 3

        mov     eax, [fake_grid + 0]
        mov     [grid + 0], eax
        ...

...

; test cases for solving the game
fake_grid:
        db      '---0---0---00000'      ; empty
;        db      'x--0---0---00000'      ; upper left clip
;        db      '--x0---0---00000'      ; upper right clip

Using _printGrid and these "test cases", I developed _solveGrid to count the number of mines each field is adjacent to.

_solveGrid:
        push    esi
        push    edi

.loop_all_fields_backwards:             ; for (edi = last_field_in_grid; edi >= 0; edi--)
        mov     edi, last_field_in_grid

.loopFields:
        cmp     [grid + edi], byte is_bomb ; is it a bomb? (or incremented bomb field)
        jb      .nextField              ; branch if less than bomb

        ; it is bomb, increment all neighbouring hints
.loop_all_neighbours_backwards:         ; for (ecx = 7; ecx >= 0; ecx--)
        mov     ecx, 7                  ; 8 neighbour coordinates to test

.loopNeighbours:
        mov     esi, edi                ; add to current bomb's position...
        add     esi, dword [around + ecx * register_size]
        cmp     esi, last_field_in_grid ; is the neighbour position valid?
        ja      .nextNeighbour          ; neighbour position is outside

        add     [grid + esi], byte 1    ; else add bomb count for neighbour

.nextNeighbour:
        sub     ecx, 1                  ; ecx--
        jnc     .loopNeighbours         ; ecx >= 0

.nextField:
        sub     edi, 1                  ; edi--
        jnc     .loopFields             ; edi >= 0

        pop     edi
        pop     esi
        ret

is_bomb equ     'x'

around: dd      - max_dimension - 1
        dd      - max_dimension
        dd      - max_dimension + 1
        dd      - 1
        dd        1
        dd        max_dimension - 1
        dd        max_dimension
        dd        max_dimension + 1

_solveGrid iterates all fields in grid using edi, even the ones outside the mine field. If the field is a bomb (contains an 'x'), The code loops ecx for all eight neighbours and increments their field indexed by esi. This adds one to byte value that is there, even if it is a bomb. That works as long as the bomb symbol has an ASCII value of at least 9 higher than the symbol of the empty space. In the edges of the field not all neighbours exist, so esi is clipped by last_field_in_grid which prevents array overflow. Interestingly there is no need to test for underflow, i.e. negative indices in esi, because these are also skipped by branch instruction ja. (The actual bit value of negative values is above last_field_in_grid.)

Preparing Output
After counting the adjacent bombs, grid contains the needed information but is not ready to be printed. For example hints need to be numbers.

_formatGrid:
        push    esi
        push    edi

.loop_all_fields_backwards:             ; for (edi = last_field_in_grid; edi >= 0; edi--)
        mov     edi, last_field_in_grid

.loopFields:
        cmp     [grid + edi], byte is_bomb ; is it a bomb?
        jnb      .formatBomb            ; bombs are increased as well, so we check for >=

.formatHint:                            ; it is a hint, need to make it a number
        add     [grid + edi], byte emptyToHint
        jmp     .nextField

.formatBomb:                           ; it is a bomb, reset symbol
        mov     [grid + edi], byte bomb

.nextField:
        sub     edi, 1                  ; edi--
        jnc     .loopFields             ; edi >= 0

        pop     edi
        pop     esi
        ret

is_empty equ    '-'
emptyToHint equ '0' - is_empty
bomb    equ     'B'

This resets bombs to their symbol and converts hints into decimal numbers. emptyToHint is the difference between the input character of the original mine field and the wanted hint digits. For example, if a field was incremented three times it contains '-' + 3 + '0' - '-' = '0' + 3 = '3' now, which is exactly what is need.

Reading Input
Finally I reached the fourth part of the implementation, reading the mine field from Standard Input. Reading the actual grid was similar to writing it, just the opposite direction of data flow, e.g. using _ReadFile@20 system call instead of _WriteFile@20 and skipping the new-line instead of writing it. More interesting was reading the first line containing the dimensions of the grid because these contain decimal numbers of arbitrary length.

_readDigit:
        ; parameters
fileReadHandle equ 4 + 1 * register_size
        ; local variables
numberOfBytesRead equ -1 * register_size
byteBuffer        equ -2 * register_size
number            equ -3 * register_size

        enter_method 3

        mov     [ebp + number], dword 0
.read_next_character:
        lea     edx, [ebp + numberOfBytesRead] ; lpNumberOfBytesRead
        lea     eax, [ebp + byteBuffer] ; lpBuffer
        mov     ebx, [ebp + fileReadHandle] ; hstdIn

        ; read a single character
        push    NULL
        push    edx
        push    dword 1
        push    eax
        push    ebx
        call    _ReadFile@20

        mov     eax, [ebp + byteBuffer]
        and     eax, 0xff
        cmp     eax, '0'
        jb      .finished
        sub     eax, '0'

        ; multiply by 10 and add new number
        mov     ebx, [ebp + number]
        shl     ebx, 1                  ; * 2
        mov     ecx, ebx
        shl     ecx, 2                  ; * 8
        add     ebx, ecx                ; = * 10

        add     ebx, eax
        mov     [ebp + number], ebx
        jmp     .read_next_character

.finished:
        mov     eax, [ebp + number]

        exit_method
        ret

_readDigit reads bytes from its input until it finds something that is white-space. In fact everything that has an ASCII value less than '0' is considered white-space. The currently read value is converted to a number by subtracting the ASCII value of '0' and added as next digit. Of course this is extremely simplified, as extra white-space, Windows new-lines or unexpected characters completely mess up the logic. Fortunately this is just a kata and no production code ;-)

Completed
The reading of the input completed the Minesweeper Kata in Assembly. To avoid repetition I separated the Standard In/Out handles from their actual usage and provided them as arguments to the functions. The final main entry point looked like that:

global  _main

_main:

        call    _getStdInFileHandle
        push    eax                     ; hstdOut 3 times for next calls
        push    eax

        ; read width
        push    eax
        call    _readDigit
        add     esp, register_size      ; clean up parameter
        mov     [width], eax

        ; read height
        ; 2nd eax still pushed
        call    _readDigit
        add     esp, register_size
        mov     [height], eax

        ; read grid
        ; 3rd eax still pushed
        call    _readGrid
        add     esp, register_size

        call    _solveGrid
        call    _formatGrid

        call    _getStdOutFileHandle

        push    eax                     ; hstdOut
        call    _printGrid
        add     esp, register_size

        jmp     _exit

If you want to see the whole kata, the complete source in here.

Done?
The kata is finished, but there are several issues remaining. Its design was influenced by my need for feedback through the console output. I am wondering how an implementation would look like if I had followed a strict TDD approach. I guess it would be less coupled to system calls at least. Also I am unhappy with its internal coupling. All four major functions depend on the grid and its internal structure, e.g. they need to know max_dimension. Following D.L. Parnas' criteria to decompose systems, it would be favourable to decouple the data from the different algorithms. The grid could be wrapped in its own module (OBJ) allowing only API access, which would add indirection, a lot of subroutine calls and the need to copy data around. This feels against the spirit of raw Assembly. It seems I will have to revisit Minesweeper again.

3 July 2015

Hello World Windows 32 Assembly

During my time at university I wrote a lot of Assembly for the 80x86 but stopped doing so when I moved into commercial software development. Recently - just 20 years later ;-) - I felt an urge to go back and refresh my memories. This was not the first time I felt retro, but this time I did not know hot to get started. I did not know anything about current tooling and had never created a native (Windows) application before. Usually the code I write today, be it in Java, C#, Ruby or otherwise, relies on underlying interpreters or virtual machines. (Yes I know, some people still write C and C++...)

Getting Started
Just by coincidence, one of the user groups in Vienna, the We Love Programming Languages group, dedicated one of its meetups to Assembly. There Angeliki Chrysochou showed a Hello World in Assembly which gave me a good idea what I had to do. During the following research I discovered how to write Hello World in Assembly under Windows on StackOverflow, which gave me all the information I needed to get started, or so I thought. In the end it still took me several hours to create a Hello World that worked, which is why I decided to write this summary.

Language
The modern Intel IA-32 flavour of 80x86 Assembly looks pretty much like the one I worked with in the 90's. I had no problem with that. Of course CPU architectures got more complex and you have to watch out for far more things when optimising for performance, e.g. branch prediction, pipeline stalls and so forth, but the general idea stayed the same.

The Assembler
Angeliki used NASM, the Netwide Assembler in her presentation and it looked good, so I tried it as well. It worked great and I did not check out other tools (like MASM). The command to translate (does Assembly get compiled?) an ASM file for Windows with NASM is

nasm -fwin32 helloworld.asm

This creates an OBJ file which needs to be linked.

Operation System Calls
As soon as you want to do anything regarding input, output or even exiting the current application, you need operation system calls. You have to search the MSDN Windows API Index for the calls you need and have to decorate them according to the Windows Application Binary Interface (ABI). (You can also use C functions, but that would not be pure Assembly, wouldn't it.) The decorated function name has to be external and its parameters are put on the stack in CDECL order. (In fact Windows 32 system calls are syscall.) The source of the shortest NASM Windows application is

global _main
    extern  _ExitProcess@4

    section .text
_main:
    push    0
    call    _ExitProcess@4

It took me time to figure out why ExitProcess is _ExitProcess@4 but MessageBox is _MessageBoxA@16.

The Linker
Finally the generated OBJ code has to be combined with the used system libraries into a single unified executable program. I found several ways for doing this under Windows. The native Windows way is to use link.exe from Microsoft Visual C++, e.g.

link /entry:main helloworld.obj /subsystem:console /nodefaultlib kernel32.lib

The kernel32.lib is necessary for the ExitProcess to be found. Getting the linker was a bit of a hassle because you need to install Visual C++ or at least some Windows SDK to get it. I did not want to do that, but I remembered that I had used the command line part or VC 6.0 aka VC98 to compile native extensions for Ruby 1.8.6 before the Ruby DevKit became popular. After recovering VC98.zip from the depths of my hard-disc, I enabled it by calling its vcvars32.bat. This sets the PATH, INCLUDE and LIB environment variables. For example the kernel32.lib is located in the library folder. As far as I know (from StackOverflow), vcvars32.bat is still available in VC++. I guess I would need a newer Kernel library to use the latest Windows 32 functions, but this was enough for Hello World and I was able to link my OBJ.

Another way to link under Windows is to use Unix tool ports like MinGW, a minimalist development environment for native Microsoft Windows applications. MinGW 32 comes with ld.exe, e.g.

ld -o helloworld.exe helloworld.obj -e_main -lkernel32
      --enable-stdcall-fixup %SYSTEMROOT%\system32\kernel32.dll

Again I did not want to download and install huge tools, I just wanted to create a little Assembly Hello World. This should not be so hard. As I said before, MinGW is used to compile native C extensions for Ruby since 1.8.7, and I just used the one that came with the Ruby DevKit. After setting its devkitvars.bat I was able to create my helloworld.exe.

There exist standalone linkers like GoLink which might work, but I did not check any of them after I had success with both link and ld. But in general I would prefer something small, like NASM which is just a single executable. (Edit: Yes Golink does work and is just a single executable, just what I need.)

Execution
One beauty of Assembly is the size of created application. I remember one of my smallest DOS applications was around 20 bytes (!) in total, a COM application which did not require any linking. Agreed it did not do much, just turned off Num-Lock, but it was useful to me at that time. The Hello World's object file is 445 bytes and the executable helloworld.exe created by ld is around 4kB, the one created by link 12 kB. (I did not check for any settings to optimise for size, remove debug information etc. - anyway the size of the compiled program does not matter.)

Code
Here is the complete Assembly source code of the pure Windows IA-32 Hello World, pretty much like it was answered by caffiend,

STD_OUTPUT_HANDLE equ -11
%define NULL    dword 0

    extern  _GetStdHandle@4
    extern  _WriteFile@20
    extern  _ExitProcess@4

    global _main

    section .text

_main:
    ; local variable
bytesWritten equ -4
    mov     ebp, esp
    sub     esp, 4

    ; hStdOut = GetstdHandle(STD_OUTPUT_HANDLE)
    push    STD_OUTPUT_HANDLE
    call    _GetStdHandle@4
    mov     ebx, eax

    ; WriteFile(hstdOut, message, length(message),
    ;           &bytesWritten, null);
    push    NULL
    lea     eax, [ebp + bytesWritten]
    push    eax
    push    (message_end - message)
    push    message
    push    ebx
    call    _WriteFile@20

    ; ExitProcess(0)
    push    0
    call    _ExitProcess@4

    ; never here
    hlt
message:
    db      'Hello, World', 10
message_end: