A person can always use a common off-the-shelf scripting language, such as Lua or Python or Squirrel or JavaScript or similar, but sometimes there may be reason for a person to implement their own scripting language or interpreter (either for practical reasons, or maybe as a learning experience, or doing it for fun).
Writing a compiler or interpreter can also be an interesting experience. (Granted, a person may ask themselves whether or not implementing something like this is actually good idea, but assuming here that this much is already decided).
But, what sort of interprerer for what sort of language? While there may be many possible choices in the design process, not all choices may necessarily carry equal weight.
There may also be a range of relative complexity and performance, all based on what a person wants to do.
For example, anywhere from a simple interpreter directly parsing and executing lines of code line-by-line, to producing highly optimized machine-code with performance on par with that of traditional static compilers.
A person may also realize that for their uses, they never need to go much beyond a parser and directly executing the syntax tree, and this may well be a good option as well.
General Design Tradeoffs
There may be a number of tradeoffs in the design of both a scripting language and the underlying interpreter. I will attempt to address a few of them here.
Syntax
There are many directions a person can go with syntax.
Often, this ends up boiling down to one of several major alternatives:
- Traditional C-family style syntax
- Usually a specialized and often minimalistic syntax
- Pretty much everything else...
C style syntax has the advantage of being at this point nearly ubiquitus and reasonably readily understood by most experienced programmers. There is no confusion, say, over what control-flow constructs exist when, for example, they can just pull out a generic "for()" or "while()" loop, and call it done. "How do I exit this loop early? 'break;' Yeah, I knew that..."
This can help to avoid much of the learning curve of a language, so a programmer new to the language can spend less time reading documentation, and get more directly to writing code.
However, some people regard this style of syntax to be cumbersome or awkward.
And, clearly, there are common cases which may need to exist in a language which aren't suitably addressed in the existing available languages (after all, the person is probably implementing their own custom language for
some reason...). Likewise, language design may be in a way a chance to express ideas or preferences which may not be readily available in many other languages, and there is not much saying a person necessarily
has to do everything exactly like every other common language.
However, for nearly everything which is commonly done, there may be a good reason for this, and sometimes it may be worthwhile to try to understand why things are as they often are, and how they may be improved upon, rather than simply make them different for sake of being different (Does being
different really make it
better?).
Another common alternative to C-style syntax is a trend towards minimalism, which usually seeks to find some minimal set of syntactic constructions in which to express the language. This is sometimes then equated with
simplicity in the general sense, like because the syntax is minimal, it is necessarily also simple, both in concept and implementation. This is, however, very often not the case.
A parser for a C-like syntax is not necessarily
that much more complicated than, say, for a Lisp-like syntax, and at the level of the compiler itself, these differences may in-fact become largely insignificant, making the choice of syntax more of an aesthetic tradeoff than a functional one.
A drawback is that very often these sorts of minimalist syntax designs will be very one-off, and may be very opaque for someone not already familiar with the language. While a counter claim can be made about the opaqueness of the traditional C style syntax, it does have an advantage: Pretty much everyone with much of any real programming experience is likely to have already seen something similar already.
Sometimes, they may also introduce other issues, such as there being no obvious or good way to format it, or introduce dependencies on the use of specific editors or similar to make it usable, which aren't generally a good thing.
However, a more subtle lesson may lurk in this matter:
Often, these minimal syntax designs are much more orthogonal than the traditional C style syntax. It is like asking the question: "Why are braces optional around a statement for an 'if()' block, but not around a function body?" Likewise, the whole matter of the difference between statements or expressions, or the need for an explicit
return statement. Or, for example, why there may be a hard line between executable code and declarative data (Can we also work with the syntax as part of an expression?).
A person could instead, for example, see it another way: Statements and Expressions are one and the same, merely differing in context, and with statements potentially also having a return-value in the same way as expressions.
Other relevant questions may be, for example, why can't types or identifiers also be used as values, ... But, these are more a matter of semantics than of syntax.
Sometimes, we can't have everything, and it may be a tradeoff along the lines of what is more useful. For example, is it more useful to be able to easily and clearly express commonly used constructions, or to be able to work with executable code as if it were data?
Objects, Anyone?
Class/Instance Object Systems are pretty much the standard solution to the matter of objects, but these are not the only solution. For example, a language may use prototype objects instead, or possibly an engine-specific 'entity' type as the primary, or sole, type of compound object.
Within class/instance, there is the tradeoff of single inheritance + interfaces, or the implementation of a multiple inheritance model. Single Inheritance has the advantage of being simpler to implement (since every child class simply appends its contents onto the parent class), and is "generally good enough" for comon situations. Interfaces and Default Methods can also address many of the limitations of single inheritance.
Multiple Inheritance adds some possibilities, but also a lot of potential complexities, and a few "ugly issues" regarding things like language semantics. For example, what happens when a child class derives from two parents which both derive from a common parent? What happens if both parents provide an implementation of a given method? ...
Prototype Objects are another possiblilty, where every object is essentially an unstructured bag of fields and methods, which may be modified freely. Instead of class inheritance, we may instead use delegation, where each object may in turn link to other objects which implement other fields and behaviors.
Another option is also using class/instance, but making the ability to dynamically add fields or delegate to other objects be available as optional features.
Static vs Dynamic Types
Static types allow superior performance, but may place limits the kinds of things which may be readily done, and by itself may require going through contortions in order to build dynamic data structures (How do you effectively build a structure which may hold several types of item if the language only ever allows a single type of item for each variable?).
Dynamic Types are easier initially, and can be optimized via type-inference, but a VM using type-inference is typically more complex than one using a simplistic static type system. By themselves dynamic types may introduce the problem of needing to endlessly check types at runtime (reducing performance), and may hinder the ability of the compiler to effectively detect or warn about "obvious" programming errors.
While it may seem like dynamic typing eliminates the need for dealing with type issues, more often it simply shifts them from being a compile-time problem to being a run-time problem, and may in effect more serve to obscure the nature of holes or deficiencies in the type system resulting in possible problematic edge case behaviors. For a behavior to work, it either needs to exist statically (within the compiler itself) or dynamically (within the runtime).
A more reasonable tradeoff may be having both, with both statically typed values, and also a 'variant' type or similar, which may implement dynamically typed semantics. A drawback of doing this is that the implementor may need to deal with some of the issues of both static and dynamic type-systems in order to make something like this work correctly. It may also require essentially implementing alternate versions of many of the same operations (such as both an operation for adding two intergers as statically-typed values, and for adding two integers as variant values).
There are differences in how to best represent these type-systems as well. For example, in a VM such as the JVM, each static type is represented via a "signature", which is an ASCII representation of the type (the .NET CLR also uses signatures, but they are binary). Each primitive type may then be assigned a sequence of one or more characters, and complex types may be identified via a character-sequence followed by a qualified name for the resource.
for example (partly derived from the IA64 ABI's C++ name mangling scheme):
- 'a': signed char / byte (8-bit)
- 'b': bool (8-bit, packed 1-bit)
- 'c': char (8-bit)
- 'h': unsigned char / byte (8-bit)
- 's': short (16-bit)
- 't': unsigned short (16-bit)
- 'w': char (16 bit)
- 'i': int (32-bit)
- 'j': unsigned int (32-bit)
- 'x': long long / int64
- 'y': unsigned long long / uint64
- 'n': int128 (128-bit)
- 'o': unsigned int128 (128-bit)
- 'f': float (32-bit)
- 'd': double (64-bit)
- 'g': float128 (128-bit)
- 'v': void
- ...
Then, we might run out of single lower-case letters, and start using letters like A/B/C/D/G or similar as prefixes to define additional types.
Likewise, we could define complex types like:
- Uqname; Named extension type.
- Xqname; Struct or Value-Class.
- Lqname; Normal Class.
- Ptype Pointer to type.
- Rtype Reference to type.
- Qtype Unsized array to type.
- Asize;type Sized array of type.
- ...
A question would then be, what about function signatures? One nifty idea is putting the argument types in parenthesis, and a type as a suffix, for example:
void foo(int x, float y); would have a signature
(if)v, and
void (*foo)(int x, float y) as
P(if)v.
A related idea is doing similar for modifier flags and attributes, which can be more flexible and extensible than the use of a fixed-size flags field, and potentially more convenient than representing things like this using explicit data structures. Granted, it may be necessary to generally unpack into flags and data-structures prior to use.
For dynamic or variant types, it may be instead more useful to include the type in the value itself. One common way to do this is via tags.
For example, one idea here would be to use 64 bit values, and then use the high order 4 bits as a tag, for example:
- 0, Pointer (Positive)
- 1, Integer (Positive, Fixnum)
- 2, Integer (Positive, Fixnum)
- 3, Double (Positive, Flonum)
- 4, Double (Positive, Flonum)
- 5, -
- 6, -
- 7, Subdived Tag Space.
- 8, Flonum2 (Double >>> 4)
- 9, -
- 10, -
- 11, Double (Negative, Flonum)
- 12, Double (Negative, Flonum)
- 13, Integer (Negative, Fixnum)
- 14, Integer (Negative, Fixnum)
- 15, Pointer (Negative)
In this scheme, it can be noted that pointers are very easy to convert (all valid pointer values, say, on x86 or x86-64, are also valid references of the pointer type). Most of the common Double range also maps exactly with no loss of precision (apart from very large/small values, which lose 4 bits).
This also allows integer values of up to 62 bits.
This will still require use of another mechanism for identifying the type of on-heap objects.
One possible strategy here is to point to the data-payload of the object, but store the object-type header directly preceding this data (if using a custom MM/GC, this may be equivalent to the GC's memory-object header). If we know we are dealing with a pointer to the start of an objects' payload, this allows potentially quickly retrieving the type from the header.
For Example:
((((ObjectHeader *)ptr)-1)->typeID)
In cases where we already know the type, we can ignore it, and quickly get at the payload data with little extra effort. In cases where we don't know for certain that we are looking at an object pointer, careful implementation can allow reasonably fast heap lookups.
Another popular tag strategy is to put the tag bits in the low-order bits of a pointer, and instead mandate that any object pointers be aligned (say, to 4 or 8 bytes). However, this limits the ability to have pointers to unaligned data (such as pointing at characters in a string table), but makes some sense on 32-bit targets.
Another tagging strategy is also to make use of the fact that on current typical OS's for 32-bit targets, the high 1GB of address space is inaccessible. This part of the address space could then be instead used for encoding integer and floating point values, with the minor drawback of limiting them to 28 bits. The advantage here, however, is that this doesn't mandate pointer alignment.
However, it may also make sense to just "bite the bullet" and always use 64-bit references for variant-type values (including on 32-bit x86), primarily due to its ability to hold the full range of 32-bit integers, full precision floats and nearly full precision doubles, and only rarely needing to box long values. A person may well find that the increased cost of wasted space for pointers is more than offset by the space savings of not having to box things like doubles.
Stack vs Registers
Stacks are simpler to generate code for, but are slightly less efficient for an interpreter, and may effectively require more work to deal with efficiently generating code for a straightforward JIT.
However, it isn't particularly difficult to convert stack-machine into register-machine code as-needed, and a simple interpreter or JIT for a stack machine isn't particularly complicated. This leads to the possibility of using a stack-machine as the high-level IL, and using a register machine as an intermediate stage (or for actual execution).
It is also possible to directly generate code for a register machine model.
In such an interpreter, registers may or may not be the same things as function arguments and local variables. In the latter case, the registers may just be temporary variables within the local environment (In contrast to having the registers in their own space and, for example, using load and store operations to move values between registers and local variables). Some of this may get a little more complicated if lexical variables are introduced.
In a stack machine, each operation will typically work on the values in the item(s) on the stack, and putting any results back onto the stack, whereas in a register machine they will instead generally work by taking the source and destination operands directly with the opcode.
For example: ADD_I may, in a stack machine, may transform '2, 3' on the stack into '5'.
In a register machine, ADD_I may instead be given arguments, as in:
ADD_I R3, R1, R2
Which adds the contents of R1 and R2 and stores the results in R3.
Another minor difference may be dealing with item types. In a stack machine, at least in concept, the stack needs to be able to hold any type of value in a stack item, whereas for registers, it only needs to be possible for the registers to hold the values of the type that they hold.
Another tradeoff may be the level of abstraction for various operations, for example, do we access a struct or object's fields via direct pointer operations in the IR, or do we instead do something like having operations to load/store these fields, and leave it up to the operation to figure out details such as the offset of the field within the object.
An advantage of having dedicated operations is that it can avoid specializing on object layouts too early in the process, potentially allowing object layout to change without necessarily having to go and change the compiler code for accessing the object (it may instead just know the name and type-signature of the field). More so, providing an explicit operation may actually be faster.
Lexical Scope and Closures
Sometimes we may want lexical scope and closures in a language. In this case, we need not only to refer to the current scope, but also to the captured enclosing scope.
One option here is simply to make locals and arguments simply be part of the lexical environment, using a similar representation regardless of whether or not anything is captured, possibly using a single index value to refer both to the current scope and also to any parent scopes. However, this makes more of a mess of using the local environment for registers.
Another option is to use a 2D index, which is omitted for local variables and arguments:
LXLOAD_I R0, 1, 5 //Lexical Load Level 1 Index 5 into R0
This makes lexical scoping the special case, and may provide fast/simple access to locals.
Another alternative is to instead make registers be their own thing, and use explicit variable LOAD/STORE operations.
LLOAD_I R0, A5 //local load Arg5 into R0
However, this introduces additional operations and may potentially reduce performance.
Another question may be regarding whether or not captured variables should themselves be special, or if closures will simply capture the entire parent binding frame (Simple, but may waste memory and prevent destruction of non-captured bindings).
One option here is to make captured locals special, using LXLOAD/LXSTORE, for these bindings, in contrast to other non-captured bindings.
Typically, we may also need an explicit operation to capture the current environment in the creation of a closure, potentially clean-up the environent in the non-capture case, and it may also make sense to be able to have more efficient handling in the cases of lambdas which do not capture bindings, and maybe also those which can be determined not to exceed the lifetime of the parent scope.
While() we're at it
Assuming we go beyond simply executing ASTs or similar, another question may be how to best handle execution.
A typical answer is we compile to bytecode. But, bytecode isn't the end of the story, since we will probably need to execute it somehow.
A common and simple answer here is the use of a "while()" loop and a big "switch()" statement. Then we will decode each operation in every iteration of the loop.
rs=0; ip=ctx->ip;
while(!rs)
{
op=*ip++;
switch(op)
{
...
case OP_ADD_I:
c=*ip++; a=*ip++; b=*ip++;
ctx->regs[c].i=ctx->regs[a].i+ctx->regs.i;
break;
...
}
}
While this works pretty well, there may be potentially faster options.
For example:
Each instruction is a small struct containing a function pointer (to the instruction-specific logic), maybe some data, and the function pointer returns the next instruction.
Then, we can have an inner interpreter loop something like:
void Ctx_Run(Context *ctx)
{
cur=ctx->op;
while(cur)
{ cur=cur->fcn(ctx, cur); }
}
With opcode functions something like:
Opcode *Op_AddI(Context *ctx, Opcode *op)
{
ctx->regs[op->c].i=ctx->regs[op->a].i+ctx->regs[op->b].i;
return op->next;
}
And, may observe potentially faster raw execution speeds than when using a while loop and a switch.
We can potentially use a while loop and switch to build these instruction chains from the input bytecode, treating these chains primarily as an intermediate structure.
Taken a little further, a person may make another observation:
In the vast majority of cases, the operation may simply, naively, and always, return the following instruction.
A speedup here can be gained by recognizing these cases, and instead grouping the instructions into "traces", which operate in a loop like the above (each trace then returns the next trace).
The trace itself may then contain an unrolled version of the above loop:
Trace *Tr_ExecuteBasic5(Context *ctx, Trace *tr)
{
Opcode *op;
op=tr->op;
op=op->fcn(ctx, op);
op=op->fcn(ctx, op);
op=op->fcn(ctx, op);
op=op->fcn(ctx, op);
op=op->fcn(ctx, op);
return tr->next;
}
Getting The JITters
The next step, in the quest for high execution speeds, may be going and writing a JIT. This can range from fairly simple to much more complex.
A simple strategy is basically to produce "call threaded code", which basically just mean that the "traces" above are simply spit out as a chain of call instructions or similar wrapped in a function call/return sequence. This allows getting a little closer to the execution speeds of native code, without significantly increasing complexity.
The next step up from this is to directly handle certain instruction sequences, like instead of emitting a call to 'AddI', directly emitting the machine-code sequence for adding the integers and storing off the results. Then lots of other common operations can be given similar treatment, while maybe still relying on the interpreter for handling more complex operations.
Beyond this, a person might start getting into things like register allocation and peephole optimization, ... But, something like this is getting a bit more advanced (There be dragons here...).
There are various ways to handle emitting the machine-code sequences, ranging from direct options, like directly emitting byte sequences into a buffer, to slightly more glossing over it (say, using an ASCII "command notation" mixing hex values possibly with shorthand notation for generating other sequences of bytes), to something like using an assembler.
Interesting Points
I have done a lot of this before, in some cases having good experiences, in other cases having much pain and spending time working on things to have them turn out to be useless, or having things fall apart or turn out badly with little ability to make it better, but in some ways, this is just life sometimes.
For example, my first real code generator also became very ambitious, with me trying to write essentially a full featured code generator like one used in a "real" compiler. This however didn't turn out well, and ultimately I was unable to fully debug the thing, and it has since been dropped from the project.
However, with a much simpler JIT, I can get speeds not that much drastically slower than native C, and sometimes this is good enough. Do we really need to try to compete performance-wise with optimized C compiler output? Not necessarily.
Conclusion
Whether or not a person actually goes and does something like this is up to them, it can be a fun or interesting experience, or can potentially literally end up eating
years of a person's life, this much is up to them...
Not every project or language needs a complex interpreter with a JIT and so on, and sometimes something simpler, like directly parsing or executing commands, or simply parsing and executing a syntax tree, may well turn out to be sufficient for the task at hand.
Article Update Log
2013-04-04: Writing article
Great article! ;)