Playing with Ruzzle in Python

February 12, 2013

tags:
python

Ruzzle is becoming a popular game on smartphones and tablets. Inspired by Boggle, it consists in finding as many words as possible in a grid of 4x4 letters. Here I'll show how one can easily generate and resolve grids automatically in Python.

Generating grids

A grid is a \(N \times M\) matrix of letters (\(N=M=4\) here). The letters are randomly sampled according to some probability distribution. We'll see how we can generate grids with a reasonably good number of possible words.

Sampling letters

The easiest way of sampling letters is to use a uniform distribution over the alphabet. However, uncommon letters will appear as frequently as the most common letters, which will yield awkward grids with very few existing words. What we can do is take the frequency of letters across all existing words in a given language, and sample letters according to this distribution. The frequency list can be found on Wikipedia for example (here for English).

I created two text files, one with the list of all letters by decreasing order of frequency, one with the list of frequencies for each letter. These files can be easily opened with Numy's loadtxt. Then, to sample letters according to this distribution, we can use the following code:

frequencies_cum = cumsum(frequencies)
dig = digitize(rand(count), frequencies_cum)
grid = letters[dig].reshape((rows, columns))

Here, letters is a 26-long vector with the letters, frequencies is a 26-long vector with the letter frequencies, count is the number of letters to sample, and rows and columns are the number of rows and columns in the grid. The idea is to partition the interval \([0,1]\) into 26 boxes, each box size being equal to the corresponding letter's frequency. By sampling uniform values in \([0,1]\) and getting the corresponding boxes in which they appear, we obtain random values between 0 and 25 that correspond to random letters respecting the frequencies. Mathematically, this method is called inverse transform sampling.

The cumsum function yields the cumulative probability distribution, and digitize represents the inverse function. The dig variable contains random integer indices between 0 and 25, and finally grid is a Numpy array with the random letters.

It would be possible to extend this generation method by taking second-order statistics into account (i.e. the frequency of each pair of successive letters across all words) and generating the grid by taking these second-order correlations into account. However this would be much more complicated and probably overkill for small grids!

Using the IPython notebook

Now, we can define a simple Python class for generating a grid and displaying a nice representation in the IPython notebook. The principle is to create a _repr_html_ method for the class so that a HTML table is displayed in the notebook. Here is an example of such a class:

class Grid(object):
    def __init__(self, rows, columns):
        self.rows = rows
        self.columns = columns
        self.count = self.rows * self.columns
        # generating the grid
        frequencies_cum = cumsum(frequencies)
        dig = digitize(rand(self.count), frequencies_cum)
        self.grid = letters[dig].reshape((self.rows, self.columns))

    def _repr_html_(self):
        style_td = """
        width: 50px;
        height: 50px;
        font-size: 24pt;
        text-align: center;
        vertical-align: middle;
        text-transform: uppercase;
        font-weight: bold;
        background: #eee;
        """
        html = '<table>\n'
        for row in xrange(self.rows):
            html += '<tr>\n'
            for column in xrange(self.columns):
                html += '<td style="{0:s}">{1:s}</td>\n'.format(
                    style_td, self.grid[row, column])
            html += '</tr>\n'
        html += '</table>'
        return html

The __init__ constructor just contains the grid generation code we described above. The interesting part is in the _repr_html_ method. We define a HTML table, some basic CSS styles and we return the code. Then, in the IPython notebook, displaying a grid is a simple as this:

The rich display feature can also be used to display SVG, PNG, JPEG, LaTeX or JSON representations. In the future, there will be the possibility to write custom Javascript extensions, and we can expect to have rich representations using libraries such as D3, ThreeJS, WebGL, etc. I can't even imagine the incredibly cool stuff we're going to see in the notebook in the months and years to come.

Solving grids

Now that we generated grids, how about solving them? I won't describe how to implement the game in Python, but rather how to code a robot that solves a grid automatically.

Using a dictionary

The first step is to find a dictionary with the list of all possible words in a given language. For the French language, I found this dictionary with 336,531 words. It's a few megabytes large. I had to get rid of the accents, and I used the following code snippet (found here):

import unicodedata
with open('dictionary_accents.txt', 'r') as f:
    dictionary = f.read()
if isinstance(dictionary, str):
    dictionary = unicode(dictionary, 'utf8', 'ignore')
dictionary = unicodedata.normalize('NFKD', dictionary).encode('ascii','ignore')
with open('dictionary.txt', 'w') as f:
    f.write(dictionary.lower())

Then, the dictionary can be simply loaded in Python using Numpy's loadtxt function:

dictionary = loadtxt('dictionary.txt', dtype=str)

Using an efficient data structure for the dictionary

We're going to write an algorithm that automatically resolves a grid by finding all possible words according to the dictionary. It's a computationally intensive task, even for small grids, since there's a combinatorial explosion. Using naive algorithms and data structures won't work for 4x4 grids. Therefore I'll detail the easiest techniques that allow to solve a 4x4 grid instantaneously (a few tens of milliseconds on an old laptop).

The algorithm will work by starting from any letter, and recursively going through all neighbors in the grid, checking at each iteration that the current word exists in the dictionary. The number of paths is huge, and the dictionary is several hundreds of thousands of words long. Searching each word in the dictionary by looking for every existing word individually is largely infeasible. A possibility is to use a more efficient data structure than just a linear list of possible words.

The data structure I chose is a trie. It is a tree-like data structure that is particularly adapted here. Indeed, it offers a very efficient way of checking if one word appears in the dictionary, or if it's the prefix of at least one existing word. The latter point is crucial, because it allows to know in advance when an exploratory path is condemned, i.e. when no other word can be found by appending letters to the current word. In this case the solving algorithm will backtrack and try other paths directly.

In this tree, the root corresponds to the empty string, every internal node corresponds to a prefix, and every leaf corresponds to an existing word in the dictionary.

I found a code snippet on StackOverflow implementing a trie in Python. It is particularly simple, because a structure with nested Python dictionaries is perfectly adapted for tries.

_end = '_end_'
def make_trie(*words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict = current_dict.setdefault(_end, _end)
    return root

This function takes a list of words as an argument, and converts it into a trie. To get all words starting with sta, we just use trie['s']['t']['a']. Now, here is the function to efficiently check if a word is in the dictionary:

def in_trie(trie, word):
    current_dict = trie
    for letter in word:
        if letter in current_dict:
            current_dict = current_dict[letter]
        else:
            return False
    else:
        if _end in current_dict:
            return True
        else:
            return False

This function explores the tree following the letters in the word, and returns True if it ends on a leaf.

Here is the function to check is a string is a prefix to at least one word in the dictionary:

def prefix_in_trie(trie, word):
    current_dict = trie
    for letter in word:
        if letter in current_dict:
            current_dict = current_dict[letter]
        else:
            return False
    return True

Solving algorithm

Now, here is the solving algorithm. We first start from any letter in the grid. Then, we check if 1) the current word is in the dictionary and 2) whether the path is not condemned, i.e. there are other words to find on this current path. If everything's ok, we go through all neighbors of the current position and we apply recursively the same function on the expanded paths. The exploration corresponds to a depth-first search in a graph.

We use several data structures. First, words is the list of all words found so far, initially empty. The current path is stored in a list positions containing a list of tuples (i,j). It allows us to avoid crossings in the paths, i.e. we can't explore an already visited position in a given path.

Here is the code:

def get_word(positions):
    """Get the word corresponding to a path (list of positions)."""
    return ''.join([g.grid[(i, j)] for (i, j) in positions])

neighbors = [(0, 1), (1, 0), (0, -1), (-1, 0), (1, 1), (-1, -1), (1, -1), (-1, 1)]

def explore(positions, words):
    # process current word
    word = get_word(positions)
    # check if the word is in the dictionary
    if len(word) >= 2 and in_trie(trie, word) and word not in words:
        words.append(word)
    # stop if this path is condemned, i.e. no more word possible
    if not prefix_in_trie(trie, word):
        return
    # go through all neighbors of the last position
    pos = positions[-1]
    for neighbor in neighbors:
        npos = (pos[0] + neighbor[0], pos[1] + neighbor[1])
        # check if the neighbor is admissible
        if npos[0] >= 0 and npos[0] < g.rows and npos[1] >= 0 and npos[1] < g.columns:
            # avoid self-intersections
            if npos not in positions:
                # we create a copy of the list positions instead of
                # updating the same list!
                npositions = positions + [npos]
                # explore the new path
                explore(npositions, words)

def find_words(grid):
    """Return all possible words in a grid."""
    words = []
    for row in xrange(grid.rows):
        for column in xrange(grid.columns):
            explore([(row, column)], words)
    # sort words by decreasing order of length
    words = sorted(words, cmp=lambda k,v: len(k)-len(v))[::-1]
    return words

This algorithm appears to be sufficiently performant on 4x4 grids thanks to our efficient trie data structure. A naive implementation with a Numpy array for the dictionary takes several minutes instead of a few tens of milliseconds with the trie.

As an example, here is the list of words found on the grid shown above (in French):

barbue, reacs, ruera, barbu, cabus, ubacs, scare, jura, jure, reac, crue, crus, urus, abus, ruer, bacs, busc, bure, brus, cuba, cura, cure, cabs, guru, grue, ubac, suca, surs, rea, eau, cru, ure, are, rue, rua, rus, bar, bac, bus, bue, bru, cab, car, gus, suc, sur, ra, re, eu, cl, au, ru, bu, ca, us, su.

Finally, the full code, including the dictionary, can be downloaded here.