Archive for November, 2012

Choosing data structures

I decided earlier in the week that I wanted to write an implementation of a general Markov chain in C; this is not particularly hard, but it does raise some interesting questions along the way—in particular, about data structures. A Markov chain really consists of two nested data structures: the set of the nodes in the list, and the set of outbound edges from each node; these structures serve vastly different purposes, as the order in which the nodes in the Markov chain are accessed depends on how edges are selected, but edges need to be sorted in a standard order so as to match how a Markov process functions. Accordingly, I found myself basing my decision on a few criteria:

Markov nodes:

  • Order of access not absolutely required to be stable
  • Key most likely in a totally ordered set
  • Minimizing access time a handy thought

Edge nodes:

  • Order of access paramount. Needs to remain stable.
  • Edges absolutely in a totally ordered set
  • Access time less important than order.

Based on these criteria, I arrived at two vastly different structures. For the Markov nodes, that “minimizing access time” led directly to thinking about trees, and thinking about how motion through a Markov chain happens led to considering a splay tree. A splay tree, if you’re unaware, is a binary search tree which has only one operation: splay. When you splay the tree, you modify its structure, bringing the node you are looking for (or the closest match) to the root of the tree. This has the handy property of keeping all the most frequently used nodes right in the top few levels of the tree, and still maintains an amortized O(log(n))—even with the slight probability that repeated splay operations will lead to it becoming a linked list.

To the contrary, the data structure I selected for the edges is a linked list, sorted from largest key to smallest key; in this case, I required that the structure remain stable, and furthermore that it have a particular order, with every node in a known position with regards to its value. Although search is an O(n) operation, it is doubtful a graph will have so many connections as to make that prohibitive, while at the same time maintaining a safe, regular, easy to traverse structure when running find_next_state().

By the by, the current code can be found in my Bitbucket repository, in markov.c and markov.h, respectively. They’re currently under rapid development and are nowhere near release-ready; I’m expecting the API to change significantly over the next few days.