Showing posts with label Math. Show all posts
Showing posts with label Math. Show all posts

Monday, 14 June 2010

Project Euler 59: Cracking an XOR code

Code breaking is about as glamorous as it gets for a Geek. So when it comes to Project Euler Problem 59, think 24, with Jack Bauer unscathed amidst a hail of bullets, memory stick in one hand, terrorist’s neck throttled in the other, threatening terrible retribution to the bad guy’s family if the secret goes to the grave with him. Terrorist rasps out,

“There’s a file … cipher1.txt … ASCII format … urrrh … XOR … 3 … letter … password … lower case … all English”

What would Chloe O’Brian make of that when she received the file over a secure socket from Jack’s Smart Phone? Being the bright girl she is, I’ve no doubt that she’d fire up MonoDevelop (only the dumb FBI would use Visual Studio and Windoze) and write the following:

var encryptedBytes = File.ReadAllText(dataPath)
                        .Split(',')
                        .Select(byte.Parse)
                        .ToArray();

That would get her the encrypted data from the file as a sequence of bytes. Now to generate all possible passwords:

const byte FirstChar = (byte)'a';
const byte LastChar = (byte)'z';

var allPossiblePasswords = from a in FirstChar.To(LastChar)
                           from b in FirstChar.To(LastChar)
                           from c in FirstChar.To(LastChar)
                           select new[] { a, b, c };

XOR encryption requires that the password be repeated cyclically, abcabcabc.... So given a sequence containing the characters of password, how about a Repeat extension method that repeats the sequence ad infinitum? (if you look at the following code and hear the Infinite Loop alarms go off, remember that iterators are lazily evaluated):

public static IEnumerable<T> Repeat<T>(this IEnumerable<T> sequence)
{
    while (true)
    {
        foreach (var item in sequence)
        {
            yield return item;
        }
    }
}

While she’s at it, she’ll need to Xor two sequences together:

public static IEnumerable<byte> Xor(this IEnumerable<byte> sequenceA, IEnumerable<byte> sequenceB)
{
    return sequenceA.Zip(sequenceB, (left, right) => (byte)(left ^ right));
}

This uses the Zip method, new in .Net 4.0, which merges two sequences together, allowing elements from left and right sequences to be processed in pairs (if you’re surprised, as I was, that the (byte) cast is necessary, see this question on Stack Overflow).

Now she can try out all the passwords, seeing what they give her if she uses them to decrypt the text. Note that Chloe has already adopted .Net 4.0 and the PLINQ extensions, so she’s using AsParallel() to max out her 128 Core workstation. (ConvertAsciiBytesToString is another extension method that simply delegates to ASCIIEncoding.GetString).

var possibleDecryptions = from trialPassword in allPossiblePasswords.AsParallel()
                          let repeatedPassword = trialPassword.Repeat()
                          select encryptedBytes
                                          .Xor(repeatedPassword)
                                          .ConvertAsciiBytesToString();

But she’s hardly going to spend the next 24 hours trawling through the possible decryptions to pick out the one in recognisable English from the 17575 in gobledygook. So she needs some means of assessing them automatically. This is where Chloe is forever in my debt. She read last weeks episode of Functional Fun, so she knows about my LanguageAnalyser that uses frequency analysis to determine how likely a sample of text is to be written in a particular language.

Rather than using the analyser to determine which of several languages the sample is written in, Chloe can use her prior knowledge that the coded message is written in English, and determine which of the possible decryptions has letter frequencies that most closely match those in English.

var languageAnalyser = new LanguageAnalyser();

var decryptionsWithDistanceToEnglish = from decryptedString in possibleDecryptions
                                       select new
                                                  {
                                                      DecryptedString = decryptedString,
                                                      Distance = languageAnalyser.DetermineCloseness(decryptedString, LanguageProfiles.English)
                                                  };

var mostLikelyDecryption = decryptionsWithDistanceToEnglish.MinItem(p => p.Distance);

And that's it. Thanks to Linguistic Geometry and LINQ, Chloe can report the decrypted (and surprisingly biblical!) message to Jack, so that he can get on with saving the world.

Ahem. Before we got carried away we were trying to solve Project Euler Problem 59, and that asks for the sum of the Ascii values of the decrypted message. So we need one more step:

var sumOfAsciiValues = mostLikelyDecryption.DecryptedString
                        .ConvertStringToAsciiBytes()
                        .Aggregate(0, (runningTotal, next) => runningTotal += next);

The full code is shown below, and is also available on MSDN Code Gallery.

namespace ProjectEuler
{
    [EulerProblem(59, Title="Using a brute force attack, can you decrypt the cipher using XOR encryption?")]
    public class Problem59
    {
        public long Solve()
        {
            const byte FirstChar = 97;
            const byte LastChar = 122;

            var dataPath = @"ProblemData\Problem59.txt";

            var encryptedBytes = File.ReadAllText(dataPath)
                                    .Split(',')
                                    .Select(byte.Parse)
                                    .ToArray();

            var allPossiblePasswords = from a in FirstChar.To(LastChar)
                                       from b in FirstChar.To(LastChar)
                                       from c in FirstChar.To(LastChar)
                                       select new[] { a, b, c };

            var possibleDecryptions = from trialPassword in allPossiblePasswords.AsParallel()
                                      let repeatedPassword = trialPassword.Repeat()
                                      select encryptedBytes
                                          .Xor(repeatedPassword)
                                          .ConvertAsciiBytesToString();

            var languageAnalyser = new LanguageAnalyser();

            var decryptionsWithDistanceToEnglish = from decryptedString in possibleDecryptions
                                                   select new
                                                              {
                                                                  DecryptedString = decryptedString,
                                                                  Distance = languageAnalyser.DetermineCloseness(decryptedString, LanguageProfiles.English)
                                                              };

            var mostLikelyDecryption = decryptionsWithDistanceToEnglish.MinItem(p => p.Distance);

            var sumOfAsciiValues = mostLikelyDecryption.DecryptedString
                .ConvertStringToAsciiBytes()
                .Aggregate(0, (runningTotal, next) => runningTotal += next);

            return sumOfAsciiValues;
        }
    }
}

Thursday, 10 June 2010

Practical Linq #5: Guessing the language of a piece of text

Google Chrome is magic: getting ready for my summer holiday to Brugge, I hit the city’s homepage. A little bar popped up at the top of the page: “This page is in Dutch. Would you like to translate it?”

image

How does Chrome know that? I suppose the domain name “.be” is one clue, but they speak French as well as Dutch in Belgium, so that by itself is not enough. Chrome does the same trick for pages written in a whole raft of languages, from Afrikaans to Yiddish.

I don’t know what Google’s secret sauce is, but I can show you one simple and surprisingly effective technique with an elegant implementation using LINQ in C#.

Studies in Linguistic Geometry

It is built around the observation that in each sufficiently large sample of text in a given language, letters of the alphabet will all occur with roughly the same frequencies. In English for example, the letter ‘e’ tops the popularity charts, making up about 13% of the average body of text. ‘t’ and ‘a’ are next, accounting for 9% and 8%. By contrast, in Portuguese, ‘a’ is the most frequent, at 15%, with ‘e’ and ‘o’ close behind.

So letter frequencies can be used as a kind of signature, or DNA profile, for a language. But given a sample of text, how do we decide which language signature it matches most closely? This is where we switch from linguistics to geometry.

In 2-dimensional space, we calculate the distance between two points using Pythagoras’ theorem. image The same thing works for two points in the real, 3-d, world. The distance from (x1, y1, z1) to (x2, y2, z2) is

image

But why stop there? What about the distance between two points in 26-dimensional space? You guessed it!

image

26-d space? Why bring that into the equation? Well think about those language signatures, giving the frequency of occurrence of each letter. You can reinterpret those signatures as points in 26-d space: the point representing the signature for English would be at 0.13 along the ‘e’ axis, 0.09 along the ‘t’ axis, 0.08 along the ‘a’ axis, and so on.

This then gives us our straightforward method of determining which language our sample text belongs to. We do a frequency analysis on its letters, then imagine the result of that analysis as a point in 26-d space, and work out which language signature it lies closest to using Pythagoras. Easy!

So how does it look in LINQ?

Linq to Linguistics

We start off with the frequency analysis (taking care to normalise the text to lower-case):

IEnumerable<CharacterFrequency> CalculateCharacterFrequencies(string sample)
{
    var characterFrequencies = sample
        .Select(char.ToLower)
        .GroupBy(c => c)
        .Select(group => new CharacterFrequency
        {
            Character = group.Key,
            Frequency = (double)group.Count() / sample.Length
        });

    return characterFrequencies;
}

Next we introduce a LanguageSignature class to hold the signature of each language:

[DebuggerDisplay("Language = { Language }")]
public class LanguageSignature
{
    private IDictionary<char, double> _frequencies;

    public LanguageSignature(string language, IDictionary<char, double> characterFrequencies)
    {
        Language = language;
        _frequencies = characterFrequencies;
    }

    public string Language { get; protected set; }

    public double GetFrequency(char character)
    {
        double frequency;
        return _frequencies.TryGetValue(character, out frequency) ? frequency : 0;
    }
}

Here are some snippets of sample signatures, data courtesy of Wikipedia

public static class LanguageSignatures
{
    public static LanguageSignature English =
        new LanguageSignature(
            "English",
            new Dictionary<char, double>
                {
                    {'a', 0.08167},
                    {'b', 0.01492},
                    {'c', 0.02782},
                    /// etc.
                });

    public static LanguageSignature French =
        new LanguageSignature(
            "French",
            new Dictionary<char, double>
                {
                    {'a', 0.07636},
                    {'b', 0.00901},
                    // etc.
                });

    
}

Now we do our sums in 26-d space, ignoring any letters that we don’t have frequency data for (note that Sqrt is a little extension method I created on the Double class that simply calls Math.Sqrt):

double CalculateDistanceFromSignature(LanguageSignature signature, IEnumerable<CharacterFrequency> characterFrequencies)
{
    var distance = characterFrequencies
        .Where(characterFrequency => signature.GetFrequency(characterFrequency.Character) > 0)
        .Select(characterFrequency
            => Math.Pow(characterFrequency.Frequency - signature.GetFrequency(characterFrequency.Character), 2))
        .Sum()
        .Sqrt();
    return distance;
}

Now we’re ready to determine which language our sample is closest to:

public LanguageSignature DetermineMostLikelyLanguage(string sample, IEnumerable<LanguageSignature> signatures)
{
    var characterFrequencies = CalculateCharacterFrequencies(sample);

    var closestLanguage = signatures.Select(signature => new
    {
        Language = signature,
        Distance = CalculateDistanceFromSignature(signature, characterFrequencies)
    })
    .MinItem(languageDistance => languageDistance.Distance);

    return closestLanguage.Language;
}

MinItem is an extension method I knocked up that returns the item from a sequence that yields the smallest value from the expression you supply (in this case, the smallest distance):

public static T MinItem<T, TCompare>(this IEnumerable<T> sequence, Func<T, TCompare> comparatorSelector) where TCompare : IComparable<TCompare>
{
    var minItem = sequence.Aggregate(
        sequence.First(), 
        (current, min) => comparatorSelector(current).CompareTo(comparatorSelector(min)) < 0 ? current : min);

    return minItem;
}

And we’re done.

Testing, Testing, Un, deux, trois

To prove it, I created a little test that analysed samples of texts in several languages taken from Project Gutenberg (the test uses MbUnit’s very handy XmlData feature to read the different test cases from my xml file).

public class LanguageAnalyserTests
{
    [Test]
    [XmlData("//Sample", FilePath = "ProblemData\\LanguageSamples.xml")]
    public void TestAnalysis([Bind("Text")]string text, [Bind("@Language")]string expectedLanguage)
    {
        var analyser = new LanguageAnalyser();
        var determinedLanguage = analyser.DetermineMostLikelyLanguage(
            text
            new[] { LanguageSignatures.English, LanguageSignatures.French, LanguageSignatures.German, LanguageSignatures.Dutch, LanguageSignatures.Spanish, LanguageSignatures.Italian });

        Assert.AreEqual(expectedLanguage, determinedLanguage.Language);
    }
}

In my test I’ve got 9 samples in German, English, French, Dutch, Spanish and Italian, extracted at random from the Gutenberg archives and each sample is correctly identified by my analyser. I also did a little experiment to see how much text it needed to correctly distinguish the samples. Limiting the samples to 600 characters caused some confusion between Spanish and English, but 700 characters was sufficient in my tests to enable the analyser to pick the right language for all samples.

A Challenge

I actually developed this LanguageAnalyser to help me solve a problem on Project Euler. Which one was it? Answers in the comments please, or by email, and bonus marks to anybody who can guess what my solution looks like. Check back Monday for the answer, and for full source code to today’s post.

Update: The answer's here, and the source code is now on MSDN Code Gallery

Saturday, 23 January 2010

Simulating Snakes and Ladders with a PLINQ Turbo boost

“Daddy, will you play a game with me?”

I’d been left in charge, whilst my wife had some well-earned time to herself. Baby brother had been bounced to sleep, and now big sister was wanting some attention.

“Sure! What shall we play?”

My daughter fished around in the cupboard, then thumped a box on the table. My heart sank. “Snakes and Ladders”, she announced.

Snakes and Ladders

For the benefit of any reader who has the good fortune never to have thus occupied their children, Snakes and Ladders is a game played on 10 x 10 board wherein players take it in turns to move across the board boustrophedonically (that is, left to right then right to left – I’ve been itching to work that word into a blog post), on each go moving the number of spaces indicated by the throw of a die. The winner is the first person to reach the last space on the board. The only modicum of excitement to be found in the game is provided by the eponymous snakes and ladders, which connect certain squares on the board. Land on a square at the foot of a ladder, and you zoom ahead to the square at its top. But beware the squares where snakes lie in wait: land here, and you must slide back down the board to the snake’s tail.

You see then, why I wasn’t thrilled at my daughter’s choice: I don’t expect much from a children’s game, but it is nice to be sure that it will finish before one’s brain goes into power-save mode. And no mathematician would be prepared to give such a guarantee about this game: it is perfectly possible to find yourself within a throw of winning, but the next moment unceremoniously snaking your way back to the beginning, to start all over again.

So as I fixed my grin in place and started rolling the die, I got to wondering: how long till we can play something else? How many turns does a game of Snakes and Ladders last, on average? By the time I’d slid down my first snake, I knew how I could find out: I would set my computer playing snakes and ladders, over and over, and instruct it to count how many turns each game took. Apply a little high-school Maths and Stats to the results, and I’d have my answer.

I will confess the game board received little of my attention from then on, as code began to assemble in the editor in my mind (which has syntax highlighting, but is lacking Intellisense - CommonSense too, my wife would tell you). When I realised that I could use PLINQ to parallelise the code and speed up the simulation, my fingers began itching to get at the keyboard.

Here’s how the code turned out, once I fired up Visual Studio.

Modelling the Game

First, a fluent interface for defining the game board:

var board = new BoardBuilder()
            .Length(100)
            .Ladder().From(1).To(38)
            .Ladder().From(4).To(14)
            .Ladder().From(9).To(31)
            .Snake().From(16).To(6)
            .Ladder().From(21).To(42)
            .Ladder().From(28).To(84)
            .Ladder().From(36).To(44)
            .Snake().From(47).To(26)
            .Snake().From(49).To(11)
   //...
   .Build()

Then the game logic. The GameBoard built by the BoardBuilder consists of a number of Places. Each Place has a reference to the Place following it on the board, and if it is at the foot of a ladder or the head of a snake then it has a reference to the Place at the other end of the link. Most of the logic of the game simulation is contained in the MoveOn method. This method is defined recursively. It is passed the numberOfPlaces that the player should move, and it returns a reference to the Place they end up at, taking into account any snakes or ladders leaving the final square. For simplicity, I’ve decided that players cannot move beyond the last space on the board – they simply get stuck there, whatever they throw.

    class Place
    {
        private Place _ladderTo;
        private Place _chuteTo;
        private Place _nextPlace;

  // ...

        public Place MoveOn(int numberOfPlaces)
        {
            Contract.Requires(numberOfPlaces >= 0, "numberOfPlaces must be positive");

            if (numberOfPlaces > 0)
            {
                if (IsLastSpace)
                {
                    return this;
                }
                else
                {
                    return _nextPlace.MoveOn(numberOfPlaces - 1);
                }
            }

            if (_ladderTo != null)
            {
                return _ladderTo;
            }
            else if (_chuteTo != null)
            {
                return _chuteTo;
            }
            else
            {
                return this;
            }
        }

        // ...
    }
}

Running the Simulation

At last, the simulation:

var gameLengths
    = 1.To(numberOfTrials)
    .Select(trial =>
        {
            var die = new Die();
            return board.
           FirstPlace
           .Unfold(place => place.MoveOn(die.Roll()))
           .TakeWhile(place => !place.IsLastPlace)
           .Count();
        })
    .ToList();

Not much to look at, is it? I start by generating a simple sequence, 1.To(numberOfTrials), and for each item in that sequence I instruct the computer to play a LINQified game of Snakes and ladders, and to count how many moves it takes to get to the end of the board. Each game starts by creating a new Die (would be handy if we could do that in real life – we’re always losing ours). Then, using the Unfold method, seeded with the first place on the board, we generate a sequence of all the spaces landed on during that game. We pull items out of this sequence until we hit the last place on the board, at which point we count how many moves the game took.

Analysing the Results

Of course, we want to do some analysis on the results. LINQ makes it trivial to find the average length of a game:

var averageLength = games.Average();

This turns out to be 35.8 moves.

More interesting is the frequency distribution of game lengths which tells us how often we can expect to play games of a given length.

var frequencyDistribution =
    gameLengths
   .GroupBy(gameLength => gameLength)
   .OrderBy(gameLengthGroup => gameLengthGroup.Key)
   .Select(gameLengthGroup => gameLengthGroup.Key + "," + gameLengthGroup.Count() + "," + gameLengthGroup.Count() / (double)numberOfTrials)
   .ToList();
System.IO.File.WriteAllLines("SnakesAndLadderResult.csv", frequencyDistribution);

gameLengths is the sequence containing the length of each simulated game. This query is counting how many games of each length were encountered in the simulation by grouping together all games of the same length. After ordering the groups, shortest game length first, it creates a string for each group, listing the length, the absolute number of games of that length and the frequency of that length of game. File.WriteAllLines is a neat method that (from .Net 4.0 onwards) takes an IEnumerable<string> and writes each string as a new line to a file, giving us an instant CSV file from which Excel can draw a pretty chart: SnakesAndLadders_FrequenciesOfGameLengths

As you can see at a glance, the shortest game I can hope for takes 7 moves, but that’s pretty unlikely to happen – about once in a thousand games. The most common length of game encountered in the simulation was 19. Now let’s transform this into a cumulative frequency chart (which shows how often games take a given number of moves or fewer): imageThis shows that in at least half of all games I’ll need to throw the dice 28 times or more before getting to the end. Worse, there’s a 10% chance that it’ll be 68 moves or more before I beat those slippery snakes.

The PLINQ Part

Where does PLINQ come into all this? And what is PLINQ anyway? PLINQ is an extension to LINQ, part of the Parallel Tasks library that is shipping as part of .Net 4.0. You know how you splashed out for that quad-core, hyper-threading enabled processor and geeked out over all those little processor utilisation charts in Task Manager – then noticed that your CPU rarely hits more than 12.5% total utilisation? Well PLINQ helps you put those lazy cores to work. MultiCoreTaskManagerWatch closely:

var gamesLengths
    = 1.To(numberOfTrials)
    .AsParallel()
    .Select(trial =>
        {
            var die = new Die();
            return board.
        FirstPlace
        .Unfold(place => place.MoveOn(die.Roll()))
        .TakeWhile(place => !place.IsLastPlace)
        .Count();
        })
    .ToList();

Spot it? Check out line 3. AsParallel(), that’s all you need to add. It takes a standard Linq-to-Objects query, and splits the work across multiple threads. On my bog-standard dual-core home laptop, that change took the time to run 1 million iterations down from 58 seconds to 36. That’s a 1.5x speed-up – by changing one line of code! And the best thing is, the speed-up increases as the number of cores in your machine goes up. On my Core i7 Quad core box at work, the simulation took 7 seconds in its parallelised form – and 9 seconds without the AsParallel magic! Still a 1.3x speed up – but overshadowed by the sheer awesomeness of the Nehalem architecture!

I should point out one subtlety here. You might have been wondering why it was necessary to create a new Die for every game? Why not share a die between games, like you do in the real world? The reason is the multiple threads introduced by the AsParallel() call. The Roll() method on Die calls through to Random.Next() under the covers, and the Next method isn’t thread-safe.

The easiest fix then, is to give each game its own Die. But there’s another thing: with all those threads beavering away simultaneously there’s a good chance that several Die will be created at the same time, each initializing their own instance of Random at the same time. Since Random by default seeds itself from the system clock, this would result in identical games being played – not exactly representative of the real world.

So what I do is have a static instance of Random which I use to seed all the others, taking care to take a lock on it of course:

class Die
{
    private Random _random;
    private static Random _seedGenerator = new Random();

    public Die()
    {
        lock (_seedGenerator)
        {
            _random = new Random(_seedGenerator.Value.Next());
        }
    }

    public int Roll()
    {
        return _random.Next(1, 7);
    }
}

Try it for yourself

As always the complete source code is available for download from MSDN Code Gallery. Try it out, and let me know what you think.

Monday, 8 June 2009

Are Biology Students devolving?

As if she wasn't busy enough bringing up our daughter, and growing the new addition to the family, my wife is a part-time private tutor to several A-Level Biology students (A-Level, for those not from the UK, being the final hurdle students have to clear before University). These students must all have passed their GCSEs before being accepted on to A-Level courses, with exams including core subjects like English and Maths. But you'd never know it from the answers they give as my wife goes over practice exam questions with them.

Take percentages. One question involved converting a percentage to a frequency: if 5% of the population have the recessive form of gene X, how many is that out of 100? Blank stare.

Then there was a question about interpreting Echocardiogram (ECG) traces. You’ll have seen one these in the movies if ever there was a hospital death scene. As the patient breathes his last, the camera invariably pans to an ECG monitor at his bedside. The sound-track turns somber, the sun is blotted out by a cloud, and on the monitor the peaks and troughs of heart activity subside to the infamous flat line.

On an ECG, the trace shows activity up the vertical axis and time along the horizontal, so to calculate the heart rate (as the question required) you begin by measuring the distance between peaks on the chart to get the duration of each beat. The student could do that - counting blocks was within her skill set. But the next step was to convert that to a rate in beats per minute.

ECG Monitor: Photo Credit, Wikipedia

"It's not fair that they ask Maths questions in a Biology paper!", complained the student. My wife was taken aback for a moment – the girl apparently wasn’t joking.

"So what do you do now?", my wife prompted once she’d recovered.

"I don't know", began the student; then with sudden inspiration: "It's got something to do with 60 hasn't it?".

"Well, yes, it has,", my wife said encouragingly. "You know how long each beat lasts. Now you need to work out how many of those beats will fit in a minute. So what do you do?"

"Ah, I've got it now". The student's face lights up. My wife smiles hopefully. "You divide the time by 60!"

Tuesday, 3 March 2009

Happy Holidays

How will you be celebrating square root day?