Tries and HAT tries and PHP?
There is a plethora of data structures in the Computer Programming discourse: everything from the quintessential primitives — integers, floats, strings, and the like — to composite types, amalgams of the aforestated primitives — linked lists, objects, and arrays, among others — are essential in crafting a variety of metastable applications (staple of the craft). Tries are composite types that do not necessarily have the same prominence in programming parlance as say, an array or record/tuple but have significant viability that stems from their uniqueness.
A trie is a tree structure with a single empty parent node and multiple character descendants that each share a common prefix. Tries are principally designed to store strings and aptly transcend their primary purpose. It is the said idiosyncratic behavior that validates the synonym, prefix tree. Like many things in technology parlance, tries exist as a genealogy — there exist multiple variants of the structure distinguishable by technical nuance.
The most basic form of trie is a tree containing multiple child nodes (leaves) — each containing an 8-bit-sized character. Storage of data in a trie is such that key-constituting characters exist in branching character hierarchies — as individual nodes that descend from one another. The final node in a branch — that which corresponds to the penultimate string element — marks the end of a key. Per the aforedescribed rubric, storing a set of keys chain, chest, chime, chat, and box yields a structure that resembles that shown below.
Central to the allure of the trie is the fast prefix-driven searching. Each branch traversal is akin to treading a unique path. Per the example above, all keys with the ch prefix are easily derivable as branches whose precursor nodes are c and h. Tries, however, despite the aforedescribed potencies, are inherently flawed: they are space-intensive — for they grow in size on each addition of new nodes and consequently, branches — and are not cache-affable. The saving grace is that they are potent enough to provide reasonable worst-case lookup and search performance.
The solution to the space and caching inefficiencies of the most primitive trie structure is another entry in the line of trie — the HAT trie. First proposed by Nikolas Askitis and Ranjan Sinha, the HAT-trie is a cache-conscious, space-efficient data structure — a tapestry of ideas. The HAT trie mitigates space inefficiencies by existing, structurally, as a modified tree structure — with dynamically allocatable buckets. The anatomy of the HAT trie resembles that of the Burst trie — itself a robust data structure but, one also affected by being cache unaffable. Burst trie uniqueness is such that the data structure compresses trie nodes into node-parented buckets — capable of spawning/bursting into new node-parented buckets — each containing a set of string fragments.
Bursting is, in burst tries, more a result of exceeding bucket capacity than it is a function controllable by some heuristic. That is not the case with HAT tries which, contrarily offer the said heuristic as a discretionary burst threshold: one that can prove useful in controlling iteration speeds (typically, the stipulation is — the higher the threshold, the slower the iteration and vice-versa). Further still, the buckets present in the HAT trie structure are not linked lists like they are in burst trie anatomies, but rather cache-conscious array hashes — analogous to PHP hashtables. The former entity is susceptible to the pointer-chasing problem, a real caching encumbrance, while the latter proves invaluable in so far as existing as a dynamically allocable container.
All in all, it is safe to posit that the HAT trie is just a more efficient burst trie: one that combines the compression achievable from burst mechanics, and cache-consciousness of array hashes. To visually convey the theory of HAT trie storage, the diagrammatic representation to follow — that based on keys mentioned earlier in the text — should suffice. The tree traversal mechanism is the same as before, despite the presence of different anatomical structures — buckets. Fragments of the keys chime, chest, chat, chain, and box — ime, est, at, ain, and ox would — per HAT trie stipulations — exist in contiguous hash blocks (buckets) parented by nodes ch, and b respectively.
The natural corollary to the preceding exposition on trie theory is actual implementation. Per the title of this post, the target language is everyone’s and my personal favorite, PHP. One of the more underrated aspects of the language is its extensibility, regardless of whether the expansion of its potencies is via FFI or language extension. php-trie-ext, a module I recently put together, services the aforestated implementation need. A description of the module’s offerings is summarizable as an offering of two APIs — one a simple trie structure and the other, a wrapper around Thibaut Goetghebuer’s HAT trie header-only library. It differs from robust userland solutions such as Mark Baker’s and Abel Zhou’s respective tries and trie-tree projects as its purpose is infusing trie and HAT trie structures — written in C++ — into PHP.
Now, extending PHP is itself a verbose subscience and suitably a topic for another expositive text, so, describing vividly, the nuts and bolts of the said extension is not exactly the most feasible thing for the remainder of this article. Demonstrating usage of tries in the PHP userland, however, is. A quick glance at the API reveals features with which many developers are likely familiar. First is the array-like malleability, which validates the snippet to follow.
php-trie-ext’s Trie and HAT trie classes offer some semblance of persistence. Those interested in the immutability provided by persistent data structures are likely to take pleasure in using the module’s API, which also features map, filter, and merge (concat in a lot of contexts) methods — commonplace in Functional Programming.
Last, and ultimately most importantly, are the prefix search faculties bundled into the extension, that showcase the lookups discussed earlier in the article.
In conclusion, tries are incredibly useful as string storage containers and, with some additional plumbing, a viable option as key-value stores. There are not too many trie libraries in PHP that offer a combination of trie such as the one present in the php-trie-ext module, which I vehemently recommend to anyone intent on giving tries a shot.