Greetings! I'm about to step into a bit larger topic that does not seem to be related to games directly, but it actually is. The whole topic is about automata, virtual machines and compilers (and not just about them!). The topic is large, that is why it will not be included whole inside a single article, anyways after each article I would like to provide code that one can run and that is easy to understand. What knowledge do you need to understand this article? You definitely need to know something about programming. Some knowledge of automata and assembler is also welcome. And of course you need your favourite programming language to work in. In the first article I would like to demonstrate the basics behind simple math expressions, very basic theory behind compilers, disassemblers and computation machines.
Formal Definitions
Language Definition
In the beginning we need to define our language. For start it is going to be fairly simple (hopefully we are going to extend it further) and our expressions will only consist of numbers (integer values), operators (for the beginning we're fine with +, -, * and /) and parentheses.
Backus-Naur Form
For language definition we are going to use so-called BNF (Backus-Naur Form) which is a method to describe context-free grammar. The form is actually set of simple rules written as:
<ident> ::= <expression>
Where the value on left side is a non-terminal symbol, and on the right side there is a set of terminal symbols (if any) followed by non-terminal symbols (if any). On the right side we can have multiple expressions separated using | (which means we can replace the non-terminal on the left with one of the expressions on the right).
Parts of the expression can be enclosed in '[' and ']' brackets followed by '*' (which means that this expression is repeated N times, where N >= 0), or '+' (which means that this expression is repeated N times, where N >= 1), or '^N' where N is a number (which means that this expression is repeated exactly N times). An example language is:
<integer> ::= [0..9]+
<expression> ::= <term> + <term> | <term> - <term>
<term> ::= <factor> * <factor> | <factor> / <factor>
<factor> ::= (<expression>) | <integer>
<program> ::= [<factor>]*
Automata
With this grammar, we can generate any possible expression that is valid in the language, which is great, but... we do not need to generate all valid inputs of our language, that will be left for the user, we need to accept the language and during that process it is needed to generate the code that is simple enough for some sort of virtual machine to execute.
As the language is context-free, it is possible to use a recursive descent parser to accept (e.g. it is valid expression) or reject (e.g. it is invalid expression) the input. I'm not going to write too much about parsers in the series - I challenge users to try writing them. Of course you are free to use Flex or any other generator (but I do not recommend this, unless you know how exactly you would parse something - these production tools are awesome, but only when you know how the actual parsing works).
What is a recursive descent parser? It is a top-down parser, built from mutually recursive procedures. Each of these procedures implements often one (and sometimes more) of productions of the grammar it recognizes. The exciting thing about this is that it closely resembles BNF!
Compilation and Execution
This is a process of producing machine code from some other language. Let me present a simple abstraction here, let us have a 'program':
"Compute three plus four"
Such sentence is, well... not really useful, we need to translate it into something we can work with. One of such formal languages is math. If we translate such sentence into:
3+4
We know a way to compute it (there are multiple computation models how to compute this), let me present one:
MOVE reg0 3
MOVE reg1 4
ADD reg0 reg1
PRINT reg0
Having a machine with 2 registers, allowing for 3 instructions, where:
- Instruction MOVE takes 2nd argument (value) and moves it into 1st argument.
- Instruction ADD takes value stored in register in 2nd argument and adds it to value stored in register in 1st argument
- Instruction PRINT prints the value in 1st argument onto screen
Execution process of such program will work as following:
INSTRUCTION | REGISTERS [] | SCREEN []
[-, -] | []
MOVE reg0 3 [3, -] | []
MOVE reg1 4 [3, 4] | []
ADD reg0 reg1 [7, 4] | []
PRINT reg0 [7, 4] | [7]
Such approach is known as imperative execution (we are giving orders to computing machine of what to do). This paradigm has one large advantage over other variants of execution: it is rather simple to make a machine that executes imperative language.
Simple calculator
Let us begin with something simple, and what is the most simple language we can imagine? It is a calculator of course! Our goal is to be able to handle 4 operators (+, -, * and /), parentheses and integer numbers. So, let us formally define a language in BNF.
<integer> ::= [0..9]+
<expression> ::= <term> + <term> | <term> - <term>
<term> ::= <factor> * <factor> | <factor> / <factor>
<factor> ::= (<expression>) | <integer>
<program> ::= [<factor>]*
Note: such language is exactly the one in the previous example.
Compiler
These rules are to be rewritten into the source, I'm going to provide pseudo-code here (you can see full working code in accompanying project).
Let us start with Integer, what we want to do when we hit an integer - well that is simple, we want to put its value into the register (e.g. somewhere where we can further work with it):
Note that in our assembly we move the right value or register into left one. Also after each operation, the result is stored in left register.
Integer()
{
printLine("mov.reg.i32 r0 %s", GetValue());
}
Where GetValue is just a function that takes next token from stream, validates whether it is an integer (containing only 0..9). Such function can look like following:
GetValue()
{
string output;
int i = 0;
string token = GetNextToken(); // Gets current token and moves to next
do
{
if (token in [0..9])
{
output += token;
}
}
while (i < token.length);
Match();
return output;
}
Note. for '+' (e.g. at least one repetition) we use a do-while construct. I'm intentionally going to skip add_operation and mul_operation rules and jump over to Factor.
Factor()
{
if (NextToken is LEFT-PARENTHESIS)
{
Match(LEFT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token
Expression();
Match(RIGHT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token
}
else
{
Integer();
}
}
This was quite obvious - in parentheses we always have another expression, outside we have just an integer. The following two Term and Expression are the interesting ones:
Term()
{
Factor();
while (NextToken is MULTIPLICATION | DIVISION)
{
printLine("push.i32 r0");
if (NextToken is MULTIPLICATION))
{
Match(MULTIPLICATION); // Eats current token, matches it against argument and goes to next token
Factor();
printLine("pop.i32 r1");
printLine("mul.i32 r0 r1");
}
else if (NextToken is DIVISION)
{
Match(DIVISION); // Eats current token, matches it against argument and goes to next token
Factor();
printLine("pop.i32 r1");
printLine("div.i32 r1 r0");
printLine("mov.reg.reg r0 r1");
}
else if (NextToken is NULL) // If there is no next token
{
// Crash compilation and print error Expected("Expected multiplication or division operation");
}
}
}
Expression()
{
Term();
while (NextToken is ADDITION | SUBTRACTION)
{
printLine("push.i32 r0");
if (NextToken is ADDITION)
{
Match(ADDITION); // Eats current token, matches it against argument and goes to next token
Term();
printLine("pop.i32 r1");
printLine("add.i32 r0 r1");
}
else if (NextToken is SUBTRACTION)
{
Match(SUBTRACTION); // Eats current token, matches it against argument and goes to next token
Term();
printLine("pop.i32 r1");
printLine("sub.i32 r0 r1");
printLine("neg.i32 r0");
}
else if (NextToken is NULL) // If there is no next token
{ // Crash compilation and print error
Expected("Expected addition or subtraction operation");
}
}
}
What happens here? I know it can seem confusing at first - but we are doing exactly what BNF rule says to us. Note that we handled add_operation/mul_operation here in the while condition and based upon what operation we encountered we do different things. Actually we started working with operator precedence (that is why we always push the register value onto stack and pop it prior to working with that), the rest should be clear.
Addition and subtraction instructions are inside Expression because they are lower priority compared to multiplication and division handled in Term (in recursion, the deeper we are, the higher the operator precedence) - so resolving inside parentheses and actual values have the highest precedence, multiplication and division are in the middle and addition and subtraction have the lowest precedence.
I know this is not the clearest thing to understand, and I highly encourage you to try running the compiler in debug mode so you can actually see what is happening (and that precedence is solved correctly in this way).
When implementing the compiler like this and running on some input like:
(3 + 4 * (7 - 2)) / 2
We obtain assembly:
mov.reg.i32 r0 3
push.i32 r0
mov.reg.i32 r0 4
push.i32 r0
mov.reg.i32 r0 7
push.i32 r0
mov.reg.i32 r0 2
pop.i32 r1
sub.i32 r0 r1
neg.i32 r0
pop.i32 r1
mul.i32 r0 r1
pop.i32 r1
add.i32 r0 r1
push.i32 r0
mov.reg.i32 r0 2
pop.i32 r1
div.i32 r1 r0
mov.reg.reg r0 r1
Let us demonstrate what the execution would look like (by stack we mean LIFO stack):
mov.reg.i32 r0 3 // Registers [ 3, -] Stack []
push.i32 r0 // Registers [ 3, -] Stack [ 3]
mov.reg.i32 r0 4 // Registers [ 4, -] Stack [ 3]
push.i32 r0 // Registers [ 4, -] Stack [ 3, 4]
mov.reg.i32 r0 7 // Registers [ 7, -] Stack [ 3, 4]
push.i32 r0 // Registers [ 7, -] Stack [ 3, 4, 7]
mov.reg.i32 r0 2 // Registers [ 2, -] Stack [ 3, 4, 7]
pop.i32 r1 // Registers [ 2, 7] Stack [ 3, 4]
sub.i32 r0 r1 // Registers [-5, 7] Stack [ 3, 4]
neg.i32 r0 // Registers [ 5, 7] Stack [ 3, 4]
pop.i32 r1 // Registers [ 5, 4] Stack [ 3]
mul.i32 r0 r1 // Registers [20, 4] Stack [ 3]
pop.i32 r1 // Registers [20, 3] Stack []
add.i32 r0 r1 // Registers [23, 3] Stack []
push.i32 r0 // Registers [23, 3] Stack [23]
mov.reg.i32 r0 2 // Registers [ 2, 3] Stack [23]
pop.i32 r1 // Registers [ 2, 23] Stack []
div.i32 r1 r0 // Registers [ 2, 11] Stack []
mov.reg.reg r0 r1 // Registers [11, 11] Stack []
I recommend to check a few more examples and hand-compute the registers and stack value on the run. It will definitely help to understand that operator precedence will work here and how the calculator works.
Disassembler
While I already mentioned how the computation works, I don't really recommend writing a virtual machine and processing strings. We can do better!
Performing very simple disassembly and storing everything in binary form is a far better way - all you need is to think of the opcodes for your instructions, assign IDs to your registers and store all these numbers into binary form.
Note that this way you can actually perform disassembly into x86 or x64 if you wish, but that is not necessary. Implementing a disassembler is very straight-forward and I recommend to look into accompanying source code, even though the implementation in there is not highly effective, it is easy to understand.
Virtual Machine
Up to here, we could only execute the code "by hand", but that ends now! What our virtual machine needs? It needs some memory where we can store stack and the actual code that is to be executed, and at least 4 registers - as our code is working with 2 (r0 and r1) and 2 more, one for stack pointer (as we need stack as temporary memory for our computation) and one for instruction pointer (so we know what we are executing right now).
Now, we read (in binary) our application to the memory, place instruction pointer to the address where there is beginning of our application and stack pointer right after the end of or application in memory. E.g.:
|O|U|R|_|A|P|P|L|I|C|A|T|I|O|N|_|S|O|U|R|C|E|_|C|O|D|E| | | | | | | | | | | | | |
IP SP
Where IP represents memory address where instruction pointer points to, and SP represents memory address were stack pointer points to. E.g. in this case our IP=0 and our SP=27. As each sub-computation stores result into R0 (register 0), the result of the whole computation will be also in register 0. Note that I'm actually printing both registers in the accompanying source code, once computation finishes.
Conclusion
The article showed basic principles in automata theory, formal languages and compilers. I know it wasn't too much and creating something useful out of this might take a lot more effort and time. Compilers aren't easy stuff, and there is not many recent articles about them. If the time allows, I'd be very glad to continue step-by-step adding more and more features into this little compiler, eventually making it useful. For next time, I promise to talk about types! Accompanying source code demonstrating sample implementation.
Look at the sample implementation here: https://github.com/Zgragselus/Compiler/tree/v1.0
Changelog:
29th September 2020 - Reformatted article, BNF was wrong; pointed out link to GitHub repository, where there is proper project with proper tag.
30th March 2020 - Reformatted article for new GameDev.net formatting
Notes:
Mirror posted on my site - https://www.otte.cz/new/page/articles/post/2
This is a good articles and well demonstrated. I created a parser generator two years ago using the weak precedence algorithm to generate a syntax analyzer. I used C#.
The generator receive a weak precedence grammar as input and then verify it and generate the analyzer.
You can check it directly by typing your inputs in the checking area. or generate the source code of your new analyzer.
I didn't implement everything like yacc does but you can generate a full elevator for arithmetic operations using a weak precedence grammar.
you can check my parser: here