1. The design of a toy program

Suppose we want to design a program that does the following. It processes and performs user's commands much like our Assignment 1. The comands are of the following forms:

exit
bye
quit
freq_count [fn1] [fn2] ... [fnk]      // just a list of file names
foreground [color] "a string" //  is either red, cyan, or blue

The first three commands are equivalent: the program terminates. The command freq_count asks our program to output the character frequency counts for each file in the list of file names [fn1], .., [fnk]. The command foreground asks our program to output the string inside the double-quotes using the color specified by [color] which is one of the three words "red", "cyan", or "blue". How would we go about designing such a program? I would like you to compare the final product that comes out of this lecture note with the design you had for Assignment 1; and, think about where you made a poor design decision, or where my design was poor. What do I mean by "poor design"? This is a loaded term. One major criterion for a good design is the following. Can the design be almost effortlessly extended to accommodate changes in the functionality specification? Real programs are never static. Users change their minds about what they want out of a program all the time. For example, what if we need to add more commands:

bf "a string" [fn] // same as in your Assignment 1
mp "a string" [fn] // same as in your Assignment 1
background [color] "a string"
fgandbg [fg-color] [bg-color] "a string"

Will our design allow for a relatively straightforward code update without much changes to the existing codes? We also would like to handle user's errors such as when the syntax of the line entered was incorrect (missing a closing quote, missing one argument, etc). Through this example, I would like to introduce several key concepts and even design patterns, such as struct, class, lexical analysis (a very elegant and useful concept with lots of theoretical and practical implications), a crude form of delegate pattern, and the powerful function pointer construct.

2. A Lexer class

In the toy program described above, a fundamental task is to break up an input string such as

foreground blue "this is blue"

into "tokens": foreground, blue, and this is blue. The first two tokens are identifiers, one for the command name, and the other for a color; and, the last token is of a string type, where we consider all characters enclosed in a pair of double quotes to be part of the same token. Scanning the input and breaking the input up into tokens is called lexical analysis, which has an extremely rich theory behind. (Any random book on compiler will have sections devoted to this topic. I'd recommend Modern Compiler Implementation in C.) The idea of a lexer is relatively simple:

it tells us whether or not there is more token in the input
we can ask it to return the next token, or return an error token (for example, there is an open double quote without matching closing quote)
plus a couple of other functionalities if we wanted to have them, such as assigning a new input, etc.

2.1. Interface of a class

We will design a Lexer class with the following interface: [sourcecode language="cpp"] /* * ***************************************************************************** * file name : Lexer.h * author : Hung Q. Ngo * description: a simple lexical analyzer * ***************************************************************************** */ #ifndef _LEXER_H #define _LEXER_H #include enum token_types_t { IDENT, // a sequence of alphanumeric characters and _, starting with alpha STRING, // sequence of characters between " ", no escape ENDTOK, // end of string/file, no more token ERRTOK // unrecognized token }; struct Token { token_types_t type; std::string value; Token(token_types_t tt=ENDTOK, std::string val="") : type(tt), value(val) {} }; /** * ----------------------------------------------------------------------------- * the Lexer class: * - take a string to be scanned as input * - scan for tokens and return one at a time * ----------------------------------------------------------------------------- */ class Lexer { public: // constructor with a default parameter Lexer(std::string str="") : input_str(str), cur_pos(0), in_err(false), separators(" \t\n") { } // a couple of modifiers void set_input(std::string); // set a new input, void restart(); // move cursor to the beginning, restart Token next_token(); // returns the next token bool has_more_token(); // are there more token(s)? private: std::string input_str; // the input string to be scanned size_t cur_pos; // current position in the input string bool in_err; // are we in the error state? std::string separators; // set of separators; *not* the best option! }; #endif [/sourcecode] A C++ class is a user-defined type which has two parts: the interface and the implementation. The interface, typically declared/defined in a header file such as Lexer.h, tells the world how to use the class; but it contains very little (if any) information about how the class is implemented. This is a form of information hiding, a cornerstone of good software engineering. (Side note: the "interface" to a class is sometimes confusingly called the class "definition" -- but it is more like a class "declaration".) A class typically consists of member functions and data members. The member functions (methods) help perform operations and algorithms on the data. For example, in a Lexer class, the crucial piece of data is the input string we want to scan and the important operations we want to perform on the input are set_input(), next_token(), has_more_token(), and restart() which are hopefully self-explanatory. In order to support the operations, we might need some other data members such as cur_pos which stores the current position in the input that we are scanning, or in_err which indicates whether we found an erroneous token in the input. We would like the "users" of our class (i.e. fellow programmers) to operate on the data via the functions we provide, without messing up with the data or other internal functions. Again, this is information hiding/encapsulation; one way to achieve the effect is to use access specifiers such as public or private. The member functions or data members which are not public cannot be modified/accessed by the user. (For example, if the user modifies the cur_pos variable wildly then we will not able to scan the input correctly!)

2.2. Constructors

In the interface for the Lexer class, there is a member function which has a special format: [sourcecode language="cpp"] // constructor Lexer(std::string str="") : input_str(str), cur_pos(0), in_err(false), separators(" \t\n") { } [/sourcecode] This is called a constructor of the class, which will be invoked whenever an object of the class is created (e.g., when a variable of the class type is declared and define): Lexer mylexer; The constructor without parameters is called the default constructor, which will initialize the data members of the object with some semantically meaningful values. For example, when we define string my_str; without any parameter, my_str is initialized with the empty string. When we define string my_str("David Blaine"), the string is assigned with "value" equal to the string "David Blaine". In our case, if we define Lexer my_lexer; then the input string is set to be empty, and other variables are initialized appropriately. When we say Lexer my_lexer("This \"is a\" test"); the input string is This "is a" test, which has three tokens This, is a, and test. The internal data member separators is a string which contains all characters which can serve as separators for the tokenization process: the space ' ' character, the tab '\t' character, and the end of line '\n' character. When the lexer sees one of these characters, it knows that the end of a token has been reached; separators are not part of any token.

2.3. The `struct` type

A struct type is similar to a class type, except that all members of a struct are public by default. We typically use a struct type for classes which have only simple data fields accessible by the users of the class. In our example, we have a Token structure which has only two fields: the token type and the token value.

2.4. The driver for a class

Let's say an oracle has implemented a class for us; we can test the class by writing simple "driver programs" to examine the class operations. [sourcecode language="cpp"] // lextest.cpp : a simple driver for the the Lexer class #include #include "Lexer.h" using namespace std; void print_tokens(Lexer lexer) { Token tok; while (lexer.has_more_token()) { tok = lexer.next_token(); switch (tok.type) { case IDENT: cout << "Identifier: " << tok.value << endl; break; case STRING: cout << "String: " << tok.value << endl; break; case ENDTOK: cout << "END of string token -- should not reach here\n"; break; case ERRTOK: cout << "ERROR Token" << endl; return; } } } int main() { Lexer lexer("This \"is a good\" \t test"); print_tokens(lexer); lexer.set_input("This \"is a \" test and \"an error"); print_tokens(lexer); return 0; } [/sourcecode] The output of the above driver will be [sourcecode language="bash"] Identifier: This String: is a good Identifier: test Identifier: This String: is a Identifier: test Identifier: and ERROR Token [/sourcecode]

2.5. The Lexer class implementation

[sourcecode language="cpp"] /** * ***************************************************************************** * file name : Lexer.cpp * author : Hung Q. Ngo * description: implementation of Lexer interface * ***************************************************************************** */ #include "Lexer.h" #include using namespace std; /** * ----------------------------------------------------------------------------- * scan and return the next token * cur_pos then points to one position right past the token * the token type is set to ERRTOK on error, at that point the global state * variable err will be set to true * ----------------------------------------------------------------------------- */ Token Lexer::next_token() { Token ret; size_t last; if (in_err) { ret.type = ERRTOK; ret.value = ""; return ret; } // if not in error state, the default token is the ENDTOK ret.type = ENDTOK; ret.value = ""; if (has_more_token()) { last = cur_pos; // input_str[last] is a non-space char if (input_str[cur_pos] == '"') { cur_pos++; while (cur_pos < input_str.length() && input_str[cur_pos] != '"') cur_pos++; if (cur_pos < input_str.length()) { ret.type = STRING; ret.value = input_str.substr(last+1, cur_pos-last-1); cur_pos++; // move past the closing " } else { in_err = true; ret.type = ERRTOK; ret.value = ""; } } else { while (cur_pos < input_str.length() && separators.find(input_str[cur_pos]) == string::npos && input_str[cur_pos] != '"') { cur_pos++; } ret.type = IDENT; ret.value = input_str.substr(last, cur_pos-last); } } return ret; } /** * ----------------------------------------------------------------------------- * set a new input string, restart * ----------------------------------------------------------------------------- */ void Lexer::set_input(string str) { input_str = str; restart(); } /** * ----------------------------------------------------------------------------- * ----------------------------------------------------------------------------- */ bool Lexer::has_more_token() { while (cur_pos < input_str.length() && separators.find(input_str[cur_pos]) != string::npos) { cur_pos++; } return (cur_pos < input_str.length()); } /** * ----------------------------------------------------------------------------- * restart from the beginning, reset error states * ----------------------------------------------------------------------------- */ void Lexer::restart() { cur_pos = 0; in_err = false; } [/sourcecode]

3. Function pointers and map

So far so good. We can use the lexer to scan and tokenize the input (line). By getting the first token, we know which command we are supposed to run. Let us first start with two simple types of commands: [exit | bye | quit], and foreground [red | green | blue] "the string" Our idea is the following: we construct an "array" of functions, indexed by the command names "exit", "bye", "quit", and "foreground". Given a command name (which is the value of the first token we got from the lexer), we use the name to index into an array of functions and call that corresponding function. Since the array is not indexed by non-negative integers as usual, this special type of arrays is called an associative array. Like in Java, an associative array in C++ is implemented with the map data type. We need a map which associates a string (command name) with a function. But a function is not a type; we will need to use function pointers instead.

3.1. Define a function pointer variable

To define a function pointer variable, we use the following syntax; basically, we take the function prototype and in place of the function name we put (* func_pointer_name). [sourcecode language="cpp"] // fp.cpp #include using std::cout; using std::endl; int add(int x, int y) { return x+y; } int sub(int x, int y) { return x-y; } int main() { // define fp to be a pointer to a function which takes // two int parameters and returns an int int (*fp)(int, int); fp = &add; cout << fp(6,3) << endl; // get 9 fp = ⊂ cout << fp(6,3) << endl; // get 3 return 0; } [/sourcecode]

3.2. Define a function pointer type

To define a function pointer type, preparing ourselves for the map, we use the typedef. [sourcecode language="cpp"] /* * ***************************************************************************** * file name : cmd.h * author : Hung Ngo * description: codes to implement user's commands * illustrates the user of function pointers for late binding * ***************************************************************************** */ #ifndef _CMD_H #define _CMD_H #include #include "Lexer.h" /** * cmd_handler_t is a function pointer type, pointing to a function which takes * a Lexer as an argument and returns nothing */ typedef void (*cmd_handler_t)(Lexer); /* * ----------------------------------------------------------------------------- * print_color() prints a string token (if any) from the Lexer in color * bye() simply terminates the program * ----------------------------------------------------------------------------- */ void print_color(Lexer); void bye(Lexer); #endif [/sourcecode] The two commands can be implemented as follows. [sourcecode language="cpp"] /** * ***************************************************************************** * file name : cmd.cpp * author : Hung Ngo * description : definitions of simple toy commands * ***************************************************************************** */ #include #include // for exit() #include "cmd.h" #include "Lexer.h" #include "term_control.h" #include "error_handling.h" // will be provided below using namespace std; /** * ----------------------------------------------------------------------------- * expects two Tokens: IDENT and STRING * prints the STRING using the color specified by the IDENT token * ----------------------------------------------------------------------------- */ void print_color(Lexer lexer) { Token t1, t2; // the following is very clumsy; need a "tokenizer" member function if (lexer.has_more_token()) { t1 = lexer.next_token(); if (lexer.has_more_token()) { t2 = lexer.next_token(); if (!lexer.has_more_token()) { // this means we're probably ok if (t1.type == IDENT && t2.type == STRING) { if (t1.value == "red") { cout << term_cc(RED) << t2.value << endl; return; } else if (t1.value == "green") { cout << term_cc(GREEN) << t2.value << endl; return; } else if (t1.value == "blue") { cout << term_cc(BLUE) << t2.value << endl; return; } } } } } error_return("Syntax error, use \"Foreground [red|green|blue ] \"str\""); } /** * ----------------------------------------------------------------------------- * terminates the program, ignores all parameters * ----------------------------------------------------------------------------- */ void bye(Lexer lexer) { if (lexer.has_more_token()) { error_return("Syntax error: use bye/exit/quit\n"); } else { exit(0); } } [/sourcecode]

3.3. Defining and using a C++ associative array (i.e. map)

Now, we define a map indexed by command names. Each command name maps to a function pointer of the type cmd_handler_t. Hopefully you can appreciate the simplicity of the following design: [sourcecode language="cpp"] // cmd_test.cpp : test simple command handling with map #include #include #include "Lexer.h" #include "term_control.h" #include "error_handling.h" #include "cmd.h" using namespace std; /** * ---------------------------------------------------------------------------------- * prints a prompt. Note the variable 'flush' which forces the prompt to be * printed right away * ---------------------------------------------------------------------------------- */ void prompt() { cout << term_cc(BLUE) << "> " << term_cc() << flush; } /** * ---------------------------------------------------------------------------------- * the body * ---------------------------------------------------------------------------------- */ int main() { map cmd_map; // simply add all commands to the map cmd_map["foreground"] = &print_color; cmd_map["exit"] = &bye; cmd_map["bye"] = &bye; cmd_map["quit"] = &bye; // read inputs, call appropriate function from the map Lexer lex; string line; Token tok; while (cin) { prompt(); getline(cin, line); lex.set_input(line); if (!lex.has_more_token()) continue; tok = lex.next_token(); if (tok.type != IDENT) { error_return("Syntax error\n"); continue; } if (cmd_map.find(tok.value) != cmd_map.end()) { cmd_map[tok.value](lex); } else { error_return("Unknown command"); } } return 0; } [/sourcecode]

4. Some error handling routines

I wrote a few error reporting routines which might be useful. The files term_control.h and term_control.cpp were provided from the previous lecture and will not be repeated here. [sourcecode language="cpp"] /* * ***************************************************************************** * file name : error_handling.h * author : Hung Ngo * description : error reporting functions * ***************************************************************************** */ #ifndef _ERROR_HANDLING_H #define _ERROR_HANDLING_H #include void error_quit(std::string); void error_return(std::string); void print_warning(std::string); void note(std::string); #endif [/sourcecode] and here is the implementation of the error reporting routines [sourcecode language="cpp"] /** * ***************************************************************************** * file name : error_handling.cpp * author : Hung Ngo * description : error reporting functions * ***************************************************************************** */ #include #include #include // for exit() #include "term_control.h" #include "error_handling.h" using std::cerr; using std::endl; using std::string; /** * ----------------------------------------------------------------------------- * report an error and quit * ----------------------------------------------------------------------------- */ void error_quit(string msg) { cerr << term_cc(RED) << "** FATAL ERROR: " << msg << endl << term_cc(); exit(1); // exit on error } /** * ----------------------------------------------------------------------------- * report an error and return * ----------------------------------------------------------------------------- */ void error_return(string msg) { cerr << term_cc(RED) << "** ERROR **\n" << msg << endl << term_cc(); } /** * ----------------------------------------------------------------------------- * report a warning * ----------------------------------------------------------------------------- */ void print_warning(string msg) { cerr << term_cc(YELLOW) << "== Warning ==\n" << msg << endl << term_cc(); } /** * ----------------------------------------------------------------------------- * print a note to myself * ----------------------------------------------------------------------------- */ void note(string msg) { cerr << term_cc(MAGENTA) << "-- Note --\n" << msg << endl << term_cc(); } [/sourcecode] There is also a Makefile for all these. See this page which contains source codes and Makefile. Download them all to the same directory. make cmd_test produces an executable called cmd_test. make lextest produces the lextest executable.