This series has since been revised and combinedLast time we met, you learned about Thompson's Construction - an algorithm for building NFAs from regular expressions. I also presented a "starter" implementation of a basic NFA class in C++. In this article, you'll see the connection between these two. I will present a complete C++ implementation of Thompson's Construction. Additionally, you will see how to represent regular expressions with parse trees, and get acquainted with the code that builds complete NFAs from these regular expressions. [size="5"]Implementing Thompson's Construction Thompson's Construction, as you probably remember (If you don't, it is a good time to skim through the previous article) tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics: [size="5"]The simplest regular expression The most basic regular expression is just some single character, for example a. The NFA for such a regex is:
Here is the implementation:
// Builds a basic, single input NFA
//
NFA build_nfa_basic(input in)
{
NFA basic(2, 0, 1);
basic.add_trans(0, 1, in);
return basic;
}
Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says "in moves from state 0 to state 1". That's it - a simple regex, a simple construction procedure.
Note that this procedure is suited for the construction of an eps transition as well (it was presented separately in the last article's discussion of Thompson's Construction).
[size="5"]Some changes to the NFA class
Last article's implementation of a simple NFA class was just a starting point and we have quite a few changes to make.
First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members ([font="Courier New"]size[/font], [font="Courier New"]initial[/font], [font="Courier New"]final[/font] and [font="Courier New"]trans_table[/font]) have been made public.
Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each [font="Courier New"]i[/font] and [font="Courier New"]j[/font], [font="Courier New"]trans_table[arrayi][/arrayi][j][/font] is the input that takes the NFA from state [font="Courier New"]i[/font] to state [font="Courier New"]j[/font]. It's [font="Courier New"]NONE[/font] if there's no such transition (hence a lot of space is wasted - the matrix representation, while fast, is inefficient in memory).
Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (one of the files in the .zip that comes with this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp open in an editor, and to follow the code for the operations while reading these explanations. Here they are:
- [font="Courier New"]append_empty_state[/font] - I want to append a new, empty state to the NFA. This state will have no transitions to it and no transitions from it. If this is the transition table before the appending (a sample table with size 5 - states 0 to 4): Then this is the table after the appending: The shaded cells are the transitions of the original table (be they empty or not), and the white cells are the new table cells - containing [font="Courier New"]NONE[/font].
- [font="Courier New"]shift_states[/font] - I want to rename all NFA's states, shifting them "higher" by some given number. For instance, if I have 5 states numbered 0 - 4, and I want to have the same states, just named 2 - 6, I will call [font="Courier New"]shift_states(2)[/font], and will get the following table:
- [font="Courier New"]fill_states[/font] - I want to copy the states of some other NFA into the first table cells of my NFA. For instance, if I take the shifted table from above, and fill its first two states with a new small NFA, I will get (the new NFA's states are sky blue):
Note that using [font="Courier New"]fill_states[/font] after [font="Courier New"]shift_states[/font] is not incidental. These two operations were created to be used together - to concatenate two NFAs. You'll see how they are employed shortly. Now I will explain how the more complex operations of Thompson's Construction are implemented. You should understand how the operations demonstrated above work, and also have looked at their source code (a good example of our NFA's internal table manipulation). You may still lack the "feel" of why these operations are needed, but this will soon be covered. Just understand how they work, for now.
[size="5"]Implementing alternation: a|b
Here is the diagram of NFA alternation from the previous article:
NFA a = build_nfa_basic('a'); NFA b = build_nfa_basic('b'); NFA alt = build_nfa_alter(a, b); NFA str = build_nfa_star(alt); NFA sa = build_nfa_concat(str, a); NFA sab = build_nfa_concat(sa, b); NFA sabb = build_nfa_concat(sab, b);
With these steps completed, sabb is the NFA representing (a|b)*abb . Note how simple it's to build NFAs this way! There's no need to specify individual transitions like we did before. In fact, it's not necessary to understand NFAs at all - just build the desired regex from its basic blocks, and that's it. [size="5"]Even closer to complete automation Though it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:typedef enum {CHR, STAR, ALTER, CONCAT} node_type; // Parse node // struct parse_node { parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) : type(type_), data(data_), left(left_), right(right_) {} node_type type; char data; parse_node* left; parse_node* right; };
This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the [font="Courier New"]tree_to_nfa[/font] function from regex_parse.cpp:NFA tree_to_nfa(parse_node* tree) { assert(tree); switch (tree->type) { case CHR: return build_nfa_basic(tree->data); case ALTER: return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case CONCAT: return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case STAR: return build_nfa_star(tree_to_nfa(tree->left)); default: assert(0); } }
Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code... [size="5"]From a regex to a parse tree If you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like (a|b)*abb). I really hate to do this, but I won't explain how this regex-to-tree code works. This article series is complicated enough without getting into parsing. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested. [size="5"]We've come a long way... Let's see what we have accomplished: we have implemented the complete process of taking regular expressions and turning them into NFAs - automatically! Our program turns the regexes into parse trees, walks these parse trees and creates NFAs from them using Thompson's Construction. To see this in action, compile the files nfa.cpp and regex_parse.cpp together (make sure nfa.h resides in the same directory). This would be the command for those who use GCC: [font="Courier New"]g++ -o regex2nfa regex_parse.cpp nfa.cpp [/font] regex_parse.cpp contains a small main function that takes the first argument and displays the NFA resulting from it. For instance, running: [font="Courier New"]regex2nfa "(a|b)*abb" [/font] Displays: [font="Courier New"]This NFA has 11 states: 0 - 10 The initial state is 0 The final state is 10 Transition from 0 to 1 on input EPS Transition from 0 to 7 on input EPS Transition from 1 to 2 on input EPS Transition from 1 to 4 on input EPS Transition from 2 to 3 on input a Transition from 3 to 6 on input EPS Transition from 4 to 5 on input b Transition from 5 to 6 on input EPS Transition from 6 to 1 on input EPS Transition from 6 to 7 on input EPS Transition from 7 to 8 on input a Transition from 8 to 9 on input b Transition from 9 to 10 on input b[/font] Which is, if you recall, the exact same NFA we dutifully crafted by hand for (a|b)*abb in the previous article. Now, we can generate it automatically! [size="5"]A note about the source code Along with this article, comes a .zip file containing all the code. It currently consists of 3 files: nfa.h, nfa.cpp and regex_parse.cpp (which contains a main() function). All the code, like all my future code for the series is in the public domain - you may do anything you wish with it. If you have any problem compiling and running it, feel free to contact me, I'll help. I'd also be quite happy to receive feedback and bug reports. [hr](C) Copyright by Eli Bendersky, 2003. All rights reserved.