How to Compile C/C++/Java Expressions: A Stack-Language Semantics (2007 update) By KWR for CSE305, expanding notes pre-1995 by Stuart Shapiro and Kulbir Arora ------------------------------------------------------------------------------- Besides the usual arithmetic operators, C, C++, and Java also consider assignment (=) to be a binary /operator/, likewise += and so on. They also have unary pre- and post- inc/decrement operators, and these bind tightest of all. All of these require the target to be an /lvalue/, i.e. a variable or array element or some other designator of a storage object. That said, every expression E has a unique translation into "stack code" defined as follows: 1. Parse E and build the expression tree (not the parse tree) for E. Leave the following tightest-binding operators in their original form as unary operation nodes with "x" underneath---parens help for clarity: (x[i]), (&x), (++x), (--x), (x++), (x--) 2. Label every leaf according to whether it is an "rvalue" or an "lvalue". Recall that lvalues are on the left-hand side of assignment operations and under pre- and post- ++ and --. Array accesses x[i] can be either lvalue or rvalue. Propagate this information up the tree as needed. 3. Do a postorder transversal to convert E to Postfix form. Change (++x) to "x pre++" and (x++) to "x post++", and similar with --. Maybe write "x post*" for "*x". Now we can translate E into stack-based bytecode. 4. In an lvalue context, a variable name x is translated to itself ("x"), and x[i] is translated as "x s i lo - * +", where s is the size in bytes of the array element type (= sizeof(a[i])) and lo is the lowindex. In an rvalue context, x is translated as "x fetch", and "x[i]" as "x s i lo - * + fetch". For 0-based arrays this is just x s i * + fetch If i is a variable not a constant then replace i by i fetch 5. "(&x)" may only occur in an rvalue context, and is also translated "x". Pointer * (placed in postfix in step 1.) may occur in either context and is translated as "fetch". In an lvalue context, *x (or x post*) becomes "x fetch", while in an rvalue context, it becomes x fetch fetch. C and C++ allow "pointer arithmetic", but we do not cover it here. 6. x pre++ = "x x fetch 1 + store", x pre-- = "x x fetch 1 - store". x post++ = "x fetch x x fetch 1 + store pop", x post-- is similar. 7. Semicolon ; translates as "pop"; assignment = translates as "store"; any subexpression x += E is translated the same way as x = x + E. E.g.: x += 5 is translated as "x dup fetch 5 + store", and x[i] += 5 as "x s i 1 - * + dup fetch 5 + store. Semantics of Execution, with stack growing to the right: () Variable names stand for their binding addresses, literals for themselves. () x: [...] => [...][x]; literal: [...] => [...][lit] (so "push" is tacit). () dup: [...][top] ==> [...][top][top] () pop: [...][top] ==> [...] () store: [...][x][top] => [...][top]; x now has value top. Leaving "[top]" on the stack enables us to read from the middle identifier "y" in a "multiple assignment" such as "x = y = z;" though marked as lvalue. () fetch: [...][x] => [...][v]; v = value of storage object x. () Binary operators op: [...][2nd][top] ==> [...][2nd op top] Note: subtraction [...][a][b] - is a-b, not b-a! () Unary operators unop: [...][top] => [...][unop(top)]. 8. Later we will extend this language to compile if-then-else and for-loops by postulating an Instruction Counter (IC), which *always* holds the memory address of the stored machine-language instruction being evaluated. Commands: () here: [...] => [...][value of IC]. () goto: [...][top] => [...], IC := top. () jmpif0: [...][2nd][0] => [...], IC := 2nd; [...][2nd][non-zero] => [...] (That is, no change to IC if top != 0). The stack language is now /complete/ in the sense that theoretically any numerical program can be compiled into it. Examples (parentheses are removed in real postfix, shown for readability only): () x = 0; x = x++; translates as x 0 store pop x (x fetch x x fetch 1 + store pop) store pop. On execution: ^^x has value 0 after first "pop", with stack empty. Then: [] => [x] => [x][x] => [x][0] =>2 [x][0][x][x] => [x][0][x][0] => [x][0][x][0][1] => [x][0][x][1] => [x][0][1] (x now has value 1) => [x][0] (finally done executing "x++", now process last "store pop") => [0] => [], and x winds up with value 0. While Java gives x = 0, all gcc/g++ options give 1! () x = 0; x = ++x; is: x 0 store pop x x x fetch 1 + store store pop Yes, it has "x" 3 times in a row! All agree on y = 1 here. () a = b + (c = d / b++) - 1; Parsed as a = ((b + (c = (d / (b++)))) - 1); When we translate into Postfix, use "_l" to mark lvalues. Postfix: a_l b c_l d (b++) / = + 1 - = ; Stack-language translation: a b fetch c d fetch (b fetch b b fetch 1 + store pop) / store + 1 - store pop ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Execution on the arguments b = 1, c = 10, d = 100 shown in "Sebesta7p334.java", with stack shown at places marked ^ above: [a][1][c][100] => [a][1][c][100][1][b][1][1] => [a][1][c][100][1][2] (b = 2) => [a][1][c][100][1] => [a][1][c][100] (since 100 / 1 = 100) => [a][1][100] (now c = 100) => [a][101] => [a][101][1] => [a][100] => [] and a ends up with the final value 100. This agrees with Java, and disagrees with Sebesta's "register-based" translation on page 334 (7th ed., toward the end of the "Expressions" chapter). () For a final example, suppose we have the extreme-looking assignment statement **p = &q + **r; This gets translated as **p = &q + **r; --> p fetch fetch q r fetch fetch fetch + store pop Just note that & subtracts a fetch, * represents a fetch, and variables in /rvalue/ contexts contribute a fetch by themselves. Note also that &p = ...; can't happen, because the "&p" to the left of = would call for -1 fetches!