2015-11-14
Monq.jfa faster again
The implementation of a deterministic finite automaton (DFA) has
one central component: transition tables. Each
node of the DFA contains one. It maps the incoming character to
an outgoing edge, or rather to another node of the DFA, namely
the target node of the transition by this character. The fastest
implementation is an array indexed by the incoming
character. For a character ch
and a transition
table trans
this is just one array access:
trans[ch]
Many transition tables are, however, extremely sparse with sometimes only a handful characters leading out of a node. In this case, using the full array of $2^{16}$ character values is an extreme waste of main memory. From the beginning monq optimized these array based transition tables such they only contain entries for the range of characters between the first and the last character used. Retrieving the transition then looks like
trans[ch-firstChar]
which is only minimally slower than storing the full array. But when only a few characters between the first and the last used character indeed have a transition, this still is a big waste of main memory. Therefore there was always a second implementation for the transition table, that only stores the available transitions and searches for the selected one with a binary search.
For every node, the decision between the two implementations was based on an estimated memory footprint. And the smaller one was always chosen.
In version 1.4 of monq I introduced a parameter that allows to trade memory usage for speed. If it is set high enough, no transition table will use a binary search. And the good news: my performance tests show that for the use case of searching a number of words in a long text, a speedup between 20% and 40% is possible.
no. words | 3 | 4 | 11 | 17 | 26 | 35 |
---|---|---|---|---|---|---|
$\Delta t$ regex | 9 | 12 | 28 | 42 | 68 | 82 |
$\Delta t$ monq | 40 | 47 | 49 | 54 | 55 | 56 |
$\Delta t$ monq with StringBuilder |
24 | 24 | 26 | 27 | 29 | 30 |
$\Delta t$ monq trading memory for speed | 21 | 22 | 23 | 25 | 25 | 25 |