Learning Tries with Dinosaurs
A cool thing about being a software engineer is that there is always something new to learn. Although I was familiar with some common data structures like Lists, Stacks and Trees, I was guilty of not knowing about Tries.
As the name suggests, Tries are essentially a type of Tree. They are optimized for searching and retrieving keys organized in hierarchical structures: each node would store a key value and have a varying - although limited - number of children that represent valid successors in the hierarchy. The type of key values and their respective structures can differ, but a common use case is to store a dictionary of words:
Each node represents an alphabet letter, and by following any branch it is possible to find all interrelated words in a given language. A possible application would be searching all the words in a dictionary that start with a common prefix. Why are Tries essential for this task? Well, it becomes quickly self-evident that searching words with shared prefixes is a daunting task when using alternative data structures. For instance, although Hashmaps allow for value retrieval in constant time, there is no way to find all the hashes of related keys. On the other hand, it would be horribly slow to search in a List. So what could we do instead? The superpower of Tries is that searching related values can be done in logarithmic time — sometimes in nearly constant time.
Programming with Dinosaurs
Storing language dictionaries is not the only possible use case for Tries. In fact, I can think of another class of hierarchically organized symbols: taxonomy ranks. To put it simply for non-experts like me, taxonomy ranks are a hierarchical system used by botanists, zoologists and paleontologists to classify any living being. By using this system, it is possible to build a bush-like structure that shows the relationships between different species. Of course, this gives me a perfect excuse to introduce Dinosaurs into the mix while studying Tries! It sounds like a win-win to me.
Data structure and operations
Let’s implement our Taxonomy Trie. Any given Rank (aka a Trie node) will have a set of children Ranks that represent the smaller rungs in the Dino taxonomy and will feature an
isSpecies marker that will tell us if we reached a leaf of the tree of life:
Of course, we need an entry point for our Taxonomy structure, providing a root Rank element and some basic operations to create hierarchies and search for values:
isSpecies flag will allow for searching for exact species matches; as an alternative, we can just verify if the given rank is present in the structure.
DinoTries in action
Consider this group of lovely
Sauropodomorpha (long-neck dinosaurs related to the
Although a quick search in a paleontology reference book would suggest that these dinos are interrelated, it is still difficult to grasp their exact position in the taxonomy. However, a Trie data structure quickly reveals the underlying relations:
Dinosauria └── Saurischia └── Sauropodomorpha └── Prosauropoda └── Anchisauria ├── Anchisauridae │ ├── ankylosaurus polyzelous🦕 │ └── ammosaurus major🦕 ├── aardonyx celestae🦕 └── Melanorosauridae ├── melanorosaurus readi🦕 └── riojasaurus incertus🦕
An algorithmic expedition
Imagine being a paleontologist on the verge of a breakthrough discovery. You just examined a new fossil that changes everything you used to believe about
Anchisauria. You only need to perform a quick check about the
Anchisauridae family representatives: after all, you don’t want to meet the disdain of your fellow paleontologist for a gross mistake. You quickly insert the query into your field terminal, hoping that the underlying algorithm won’t take ages before giving a satisfactory answer.
For a Taxonomy Trie like ours, searching for all
Anchisauridae or any other clade would be implemented like this:
Your fame as a paleontologist is secured as you rapidly find out about the
Anchisauridae family members:
Anchisauridae ├── ankylosaurus polyzelous └── ammosaurus major