Manually writing parsers, eg with regexp has been discussed. To show you the next step, generating a scanner, that produces tokens which you process, this little example.
You probably don't want to do that now, but it never hurts to see what it would look like :)
Compile with
$ flex scanner.l
$ g++ lex.yy.c main.cpp
and run with "./a.out input_file.txt" (lex.yy.c is generated by flex)
scanner specification (text to number specification, eg "=" is translated to "TK_EQUAL" value): scanner.l
%{
int line = 1;
char *text;
#include "tokens.h"
%}
%%
= { return TK_EQUAL; }
NewObject { return KW_NEWOBJECT; }
End { return KW_END; }
[A-Za-z][A-Za-z0-9]* { text = yytext; return TK_IDENTIFIER; }
[0-9]+ { text = yytext; return TK_NUMBER; }
[ \t\r] ;
\n ;
. { printf("Unrecognized character 0x%02x\n", yytext[0]); }
%%
int yywrap() {
return 1;
}
Glue file to share the common definitions: tokens.h
#ifndef TOKENS_H
#define TOKENS_H
extern int line;
extern char *text;
extern FILE *yyin; // Owned by the generated scanner.
int yylex();
enum Tokens {
TK_EOF,
TK_EQUAL,
TK_IDENTIFIER,
TK_NUMBER,
KW_NEWOBJECT,
KW_END,
};
#endif
Main program file, with the actual parser
#include <cstdio>
#include <cstdlib>
#include <string>
#include "tokens.h"
bool parse()
{
int tok;
tok = yylex();
if (tok == TK_EOF) return true;
if (tok != KW_NEWOBJECT) {
printf("Expected NewObject at line %d\n", line);
return false;
}
tok = yylex();
if (tok != TK_IDENTIFIER) {
printf("Expected object name after NewObject at line %d\n", line);
return false;
}
printf("Found object name \"%s\" at line %d\n", text, line);
for (;;) {
tok = yylex();
if (tok == KW_END) break; // End of the input.
if (tok != TK_IDENTIFIER) {
printf("Expected field key at line %d\n", line);
return false;
}
std::string key = text; // Save name before it gets overwritten by a field name.
tok = yylex();
if (tok != TK_EQUAL) {
printf("Expected equal sign at line %d\n", line);
return false;
}
tok = yylex();
if (tok == TK_IDENTIFIER) {
printf("Found a field with a named value: \"%s :: %s\"\n", key.c_str(), text);
} else if (tok == TK_NUMBER) {
printf("Found a field with a number: \"%s :: %d\"\n", key.c_str(), atoi(text));
} else {
printf("Unknown field value at line %d\n", line);
return false;
}
// And loop for the next "key = value"
}
tok = yylex();
if (tok != TK_EOF) {
printf("EOF expected at line %d\n", line);
}
return true;
}
int main(int argc, char *argv[])
{
FILE *handle = (argc == 2) ? fopen(argv[1], "rt") : NULL;
if (handle == NULL) {
printf("File could not be opened\n");
exit(1);
}
yyin = handle; // Give handle to the scanner.
bool result = parse();
fclose(handle);
return result ? 0 : 1;
}
Parser just prints the values, but of course you could also put it in some data structure. (main.cpp file)
If you think the "parse" function is a bit repetitive, it is. You can step up and use a parser generator like bison, to get rid of it, and gain a lot of additional recognizing power at the same time.
I don't have a working example with a parser generator (it needs some new code, like the class definitions, and a bit additional glue code), but the core parser input specification would be like
Program : KW_NEWOBJECT TK_IDENTIFIER Fields KW_END
{
$$ = new Program($2, $3);
}
Fields : Field
{
$$ = std::list<Field *>();
$$.push_back($1);
}
Fields : Fields Field
{
$$ = $1;
$$.push_back($2);
}
Field : TK_IDENTIFIER TK_EQUAL TK_IDENTIFIER
{
$$ = new NameField($1, $3);
}
Field : TK_IDENTIFIER TK_EQUAL TK_NUMBER
{
$$ = new NumberField($1, $3);
}
You just write the sequences that you want to match, and what code should be executed. The parser generator generates the recognizer that reads tokens from the scanner, and calls your code when appropriate.