The basic structure is a compressed ternary search tree of the suggestions, where nodes (prefixes) with the same completions are merged into one node, and where each node corresponding to a suggestion stores the weight of the suggestion and a reference to the suggestion string. Only the first character of a node is stored explicitly; the other characters are read from the corresponding suggestion string or (if the node does not correspond to a suggestion) from a suggestion string that is referenced instead. In addition to that basic structure, each node in the tree holds a precomputed "suggestion list": a rank-ordered array of references to the nodes of the top k highest weighted suggestions that start with prefix corresponding to the node.
For each suggestion inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. A tree with n suggestions has thus at most 2n - 1 nodes. But what is the total length of the suggestion lists in the tree? The answer is easier when we do not look at a ternary search tree but at a simpler trie data structure where the child nodes of a node are not arranged as a binary search tree. If such a tree has 2n - 1 nodes, then each internal node has exactly two child nodes. Consequently, if all leaf nodes are at the same depth, the tree has n suggestion lists of length 1, n/2 suggestion lists of length 2, n/4 suggestion lists of length 4, and so on until the maximum list length of k is reached. Assuming k is a power of two, this gives a total list length of n + 2(n/2) + 4(n/4) + ... + k(n/k) + k(n/k - 1), which is approximately (log2k + 2) n.
Ternary search trees are much less sensitive to insertion order than binary search trees. Even in the worst case, when the suggestions are inserted into the tree in lexicographic order, performance is usually only slightly degraded. The reason for this is that not the entire tree structure degenerates into a linked list, only each of the small binary search trees within the ternary search tree does. However, for best performance, the suggestions should be inserted into the tree in random order. For large n, this practically always produces a balanced tree where going left or right cuts the search space more or less in half.
If a suggestion is removed and the corresponding node has no middle child but a left child and a right child, the node is replaced with either the leftmost node from its right subtree or the rightmost node from its left subtree. To preserve the balance of the tree, the choice is made at random.
This implementation is not synchronized. If multiple threads access a tree concurrently, and at least one of the threads modifies the tree, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the tree. @version 1 August 2013
|
|
|
|
|
|
|
|
|
|
|
|
|
|