Suggested Solutions¶

Chapter 1 - Introduction¶

A translator transforms a program from one representation to another, often a more primitive representation. Input to a translator is called the source language, output is called the object language.

A compiler is a translator whose source language is a high-level language, e.g. Pascal, Ada or FORTRAN, and whose object language is a machine- or assembler language.

An interpreter does not translate the source language but interprets it directly – no object code is generated. In practice a certain form of translation to some other intermediary form which is then interpreted is performed.

An assembler is a translator whose source language is symbolic machine code (assembler code) and whose output is machine code.

A preprocessor, finally, translates from an extended high-level language to a high-level language and is perhaps best characterised by simple syntactic transformations, e.g. macro expansion.
A high-level language enables simpler and more natural programming, readable code, shorter development time, more efficient error diagnosis and increased portability.
The compiler can be coarsely divided into an analysis- and a synthesis part. The analysis part breaks the source program down to an internal structure which is then used to build the object program.

The analysis part consists of three phases (see the course book for a picture):

Lexical analysis

Groups the source program’s character sequence into tokens.

Syntactic analysis

Determines whether the token sequence from the scanner is grammatically correct. If this is the case, a parse tree is constructed.

Semantic analysis

Checks, for example, that variables are declared, types match, parameters to the functions are correct in type and number etc.

The synthesis starts from some kind of internal representation and can comprise the following phases:

Intermediary code generation

The parse tree is broken down into a linear representation with explicit jumps, e.g. Polish notation or quadruples.

Code optimization

Tries to perform machine-independent optimization, i.e. a faster and/or more compact internal form.

Code generation

Transforms the internal form to assembler or machine code, performs register allocation and carries out, if possible, machine-dependent optimizations.

In addition a symbol table manager is required whose task is to store those symbols which appear in the source program and help other phases during compilation, e.g. by arranging the symbols in lexical levels (scope).
The number of passes is the number of times the source program, possibly in some internal representation, is analysed in its entirety. A compiler which is divided into several passes can be written as a collection of independent modules and can manage larger source programs as the whole internal form does not need to be stored in memory. A multi-pass compiler is probably slower but certain languages require several passes because of forward references, e.g. an assembler usually calculates the values of labels during the first pass and generates code in the second when the locations of the jump instructions are known. Pascal is an example of a language which can be compiled with a 1-pass-compiler, you have to e.g. forward-declare functions and procedures which are called before they have been declared.
A compiler-compiler is a tool for a compiler/constructor. The ideal compiler-compiler produces a complete compiler given a specification of the source language and target machine. However, it is more common that a certain phase in the compiler has been automated, e.g. you can produce a parser from a description of the source language’s grammar. At present lexical and syntactic analysis is automated and recently it has become possible to formally specify code generators. What remains is to efficiently be able to generate the part of the compiler which performs semantic analysis and translates the source program to an internal form.

Chapter 3 - Lexical Analysis¶

The scanner’s task is to read the input stream and group the characters into tokens (basic symbols) which are then returned to the parser. Examples of tokens include integers, identifiers and left parentheses. It is practical to let the scanner perform the conversion of digit strings to integers and the installation of identifiers in the symbol table. The scanner also throws out redundant sequences of characters, e.g. blanks and comments.
It is possible to construct a parser without a scanner but then in the grammar you have to describe how integers appear, where spaces and comments can appear etc., which results in a slower parser and a grammar which is difficult to overview.
An alphabet $\Sigma$ is a finite set of symbols; A string, $x \in \Sigma^*$ , is a finite sequence of symbols from an alphabet; the length of a string, $\vert x \vert$ , is the number of symbols in the string; the empty string $\epsilon$ has the property $\vert \epsilon \vert = 0$ ; concatenation of two strings $x$ and $y$ is achieved by simply joining the two strings together: $xy$ ; the string exponent $x^n$ is equivalent to $n$ concatenations of $x$ with itself:

$x^n = \left\{ \begin{array}{ll} \epsilon & \mbox{if $n=0$} \\ x x^{(n-1)} & \mbox{if $n \geq 1$} \end{array} \right.$
A language $L \subset \Sigma^*$ is a subset of all the strings that can be formed from an alphabet. A language can be a finite or infinite set ( $\Sigma^*$ is, of course, infinite). Concatenation of two languages $L$ and $M$ is denoted as $LM$ and is defined as $LM = \lbrace xy \vert x \in L \wedge y \in M \rbrace$ . The closure of a language $L$ is denoted $L^*$ and is defined as $\bigcup_{i=0}^{i=\infty} L^i$ . The positive closure, $L^+$ , is defined as $\bigcup_{i=1}^{i=\infty} L^i$ .
$AB = {a0,a1,b0,b1}$ ,

$A* = {\epsilon,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,...}$ , and

$A+ = {a,b,aa,ab,ba,bb,aaa,aab,aba,abb,...}$ .
Regular expressions and their correspondences in a language:

Regular expressions $r$ Language $L$

$\epsilon$ $\lbrace \epsilon \rbrace$

$a \in \Sigma$ $\lbrace a \rbrace$

Union $(r) \vert (s)$ $L_r \cup L_s$

Concatenation $(r)(s)$ $L_r L_s$

Repetition $(r)^*$ $L_r^*$

Repetition $(r)^+$ $L_r^+$

Example:

Regular expression Language

$a$ $\lbrace a \rbrace$

$a^*$ $\lbrace \epsilon, a, aa, aaa, \ldots \rbrace$

$(a \vert b)$ $\lbrace a,b \rbrace$

$(a \vert b)^+$ $\lbrace a,b,aa,ab,ba,bb,aab, \ldots \rbrace$

$(a \vert ba)^*$ $\lbrace (a \vee ba)^i \vert i \geq 0 \rbrace$
A deterministic finite automaton, (DFA) is a machine representation of a regular expression. It takes a string and answers yes or no depending on whether the string is included in the language or not. Formally a DFA is defined as a finite set state $S$ where on of these, $s_0 \in S$ , is the start state and a subset of them, $F \subset S$ , are final states. Furthermore, the machine has an alphabet $\Sigma$ and a transition function $f: \Sigma \times S \rightarrow S$ . The machine is initially in $s_0$ , reads a symbol $c \in \Sigma$ , and then goes to the next state $f(c, s_0)$ , reads the next symbol etc. The string is accepted if the machine finishes in a state $s \in F$ .

A transition graph is a graphical representation of the machine; states are nodes and edges marked with symbols illustrate the transition function. The start state $s_0$ is marked (as in the figure) and the final state has double rings.

$digraph diesel_11_activation_ex1 { node [shape=circle, fontname="Courier New"]; rankdir=LR; start[shape=plaintext]; s0[label=<s0>]; s1[label=<s1>]; s2[shape=doublecircle, label=<s2>]; start -> s0; s0 -> s1 [label="a"]; s0 -> s0 [label="b"]; s1 -> s1 [label="a"]; s1 -> s2 [label="b"]; s2 -> s1 [label="a"]; s2 -> s0 [label="b"]; }$
Transition graph for $(a \vert b)^*ab$ , $S=\lbrace s_0, s_1, s_2 \rbrace$ , $\Sigma = \lbrace a, b \rbrace$ .¶

A transition table is also just another way of viewing the machine; columns and rows contain symbols and states, respectively:

Transition table for $(a \vert b)^*ab$ ¶

a b final state?

$s_0$ $s_1$ $s_0$ no

$s_1$ $s_1$ $s_2$ no

$s_2$ $s_1$ $s_0$ yes
A non-deterministic finite automaton, NFA, differs from a DFA in the following respects:
- A DFA has no edges with the empty string $\epsilon$ .
- A DFA has, for an arbitrary state exactly one outgoing edge per symbol [1]; An NFA can have arbitrarily many.
A scanner is a program which is constructed by (1) describing tokens using regular expressions; (2) constructing a DFA (possibly by first constructing an NFA and then transforming it into a DFA); (3) drawing transition diagrams and/or (4) recoding the diagram as a table or a program.

Regular expressions $r$	Language $L$
$\epsilon$	$\lbrace \epsilon \rbrace$
$a \in \Sigma$	$\lbrace a \rbrace$
Union $(r) \vert (s)$	$L_r \cup L_s$
Concatenation $(r)(s)$	$L_r L_s$
Repetition $(r)^*$	$L_r^*$
Repetition $(r)^+$	$L_r^+$

Regular expression	Language
$a$	$\lbrace a \rbrace$
$a^*$	$\lbrace \epsilon, a, aa, aaa, \ldots \rbrace$
$(a \vert b)$	$\lbrace a,b \rbrace$
$(a \vert b)^+$	$\lbrace a,b,aa,ab,ba,bb,aab, \ldots \rbrace$
$(a \vert ba)^*$	$\lbrace (a \vee ba)^i \vert i \geq 0 \rbrace$

Transition table for $(a \vert b)^*ab$ ¶
	a	b	final state?
$s_0$	$s_1$	$s_0$	no
$s_1$	$s_1$	$s_2$	no
$s_2$	$s_1$	$s_0$	yes

Chapter 4 - Syntax Analysis¶

A grammar is used to describe a language’s syntax. A CFG (context-free grammar) is a quadruple $G = < N, \Sigma, P, S >$ where $N$ is the set of non-terminals, $\Sigma$ is the set of terminals, $P$ are productions of the form $A \rightarrow \alpha$ , $A \in N$ , $\alpha \in (N \cup \Sigma)^*$ and $S \in N$ is the start symbol. Example:

$<\lbrace S, A, B \rbrace, \lbrace 0,1,2,3,4,5,6,7,8,9 \rbrace, \lbrace S \rightarrow A, A \rightarrow A B \mid B, B \rightarrow 0 \vert 1 \vert 2 \vert 3 \vert 4 \vert 5 \vert 6 \vert 7 \vert 8 \vert 9 \rbrace, S >$

A BNF-grammar is a context-free grammar (even if the notation can vary in the literature). The non-terminal symbols appear to the left of “ $\rightarrow$ ” while the right side can be arbitrary strings of terminal and non-terminal symbols. The union $N \cup \Sigma$ of terminal and non-terminal symbols is sometimes called the vocabulary.

You say that a grammar $G$ generates a language $L(G)$ which simply consists of all the strings that can be derived from the grammar, $L(G) = \lbrace \alpha \vert S$ $\mathop \Rightarrow\limits^{\mbox{\tiny\rm +}}$ $\alpha, \alpha \in \Sigma^* \rbrace$ . In the example above the language consisting of all the natural numbers is generated.
If we have a production $A \rightarrow \delta$ we can perform a direct derivation $\gamma A \theta$ $\mathop \Rightarrow$ $\gamma \delta \theta$ . The $\mathop \Rightarrow\limits^{\mbox{\tiny\rm *}}$ symbol is used to show that zero or more direct derivations have be made. If in each derivation step you expand the non-terminal which appears furthest to the left with its right side, it is said that you are performing a left derivation. A right derivation is defined analogously. A string $\alpha$ is a sentential form if it can be derived from the start symbol $S$ , i.e. $S$ $\mathop \Rightarrow\limits^{\mbox{\tiny\rm *}}$ $\alpha$ , $\alpha \in V^*$ where $V$ is the vocabulary. A sentence is a sentential form which only consists of terminal symbols, i.e. $w$ is a sentence if $S$ $\mathop \Rightarrow\limits^{\mbox{\tiny\rm +}}$ $w$ , $w \in \Sigma^*$ . A parse tree is a graphical description of a derivation. The start symbol is the root, the terminal symbols are leaves and the non-terminal symbols are internal nodes. Each new level represents a direct derivations step. You can not see from a parse tree whether left- or right derivation has been performed.
A grammar is ambiguous if a sentence has several parse trees. Ambiguous grammars should be avoided as you do not want several interpretations of a program.
Ambiguous grammar:

$E_1 \rightarrow E_1$ IMP $E_1 \vert E_2$

$E_2 \rightarrow E_2$ EQU $E_2 \vert E_3$

$E_3 \rightarrow E_3$ OR $E_3 \vert E_4$

$E_4 \rightarrow E_4$ AND $E_4 \vert E_5$

$E_5 \rightarrow$ NOT $E_5 \vert E_6$

$E_6 \rightarrow (E_1) \vert ident$

Unambiguous grammar (left associative):

$E_1 \rightarrow E_1$ IMP $E_2 \vert E_2$

$E_2 \rightarrow E_2$ EQU $E_3 \vert E_3$

$E_3 \rightarrow E_3$ OR $E_4 \vert E_4$

$E_4 \rightarrow E_4$ AND $E_5 \vert E_5$

$E_5 \rightarrow$ NOT $E_5 \vert E_6$

$E_6 \rightarrow (E_1) \vert ident$

Unambiguous grammar (right associative):

$E_1 \rightarrow E_2$ IMP $E_1 \vert E_2$

$E_2 \rightarrow E_3$ EQU $E_2 \vert E_3$

$E_3 \rightarrow E_4$ OR $E_3 \vert E_4$

$E_4 \rightarrow E_5$ AND $E_4 \vert E_5$

$E_5 \rightarrow$ NOT $E_5 \vert E_6$

$E_6 \rightarrow (E_1) \vert ident$

Derivation of $a$ OR $b$ AND NOT $c$ using the left associative grammar:

$E_1$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_2$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_3$ OR $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_4$ OR $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_5$ OR $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_6$ OR $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $E_4$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $E_5$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $E_6$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $b$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $b$ AND NOT $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $b$ AND NOT $E_6$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $a$ OR $b$ AND NOT $c$

Derivation of $(x$ IMP $y)$ EQU $z$ AND NOT $u$ using the left associative grammar:

$E_1$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_2$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_2$ EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_3$ EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_4$ EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_5$ EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ $E_6$ EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_1$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_1$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_2$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_3$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_4$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_5$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $E_6$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $E_2$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $E_3$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $E_4$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $E_5$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $E_6$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $E_3$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $E_4$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $E_4$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $E_5$ AND E5 $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $E_6$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $z$ AND $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $z$ AND NOT $E_5$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $z$ AND NOT $E_6$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$

$\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ( $x$ IMP $y$ ) EQU $z$ AND NOT $u$
$digraph figure4 { node [shape=none, fontname="Courier New"]; rankdir=TB; ordering="out"; E1_1[label=<E1>]; E2_1[label=<E2>]; E3_1[label=<E3>]; E3_2[label=<E3>]; E4_2[label=<E4>]; E5_2[label=<E5>]; E6_2[label=<E6>]; E4_3[label=<E4>]; E4_4[label=<E4>]; E5_4[label=<E5>]; E6_4[label=<E6>]; E5_5[label=<E5>]; E5_6[label=<E5>]; E6_6[label=<E6>]; E1_1 -> E2_1; E2_1 -> E3_1; E3_1 -> E3_2; E3_2 -> E4_2; E4_2 -> E5_2; E5_2 -> E6_2; E6_2 -> a; E3_1 -> OR; E3_1 -> E4_3; E4_3 -> E4_4; E4_4 -> E5_4; E5_4 -> E6_4; E6_4 -> b; E4_3 -> AND; E4_3 -> E5_5; E5_5 -> NOT; E5_5 -> E5_6; E5_6 -> E6_6; E6_6 -> c; }$
Parse tree for $a$ OR $b$ AND NOT $c$ ¶

$digraph figure4 { node [shape=none, fontname="Courier New"]; rankdir=TB; ordering="out"; E1_1[label=<E1>]; E2_1[label=<E2>]; E2_2[label=<E2>]; E3_2[label=<E3>]; E4_2[label=<E4>]; E5_2[label=<E5>]; E6_2[label=<E6>]; E1_3[label=<E1>]; E2_3[label=<E2>]; E3_3[label=<E3>]; E4_3[label=<E4>]; E5_3[label=<E5>]; E6_3[label=<E6>]; E2_4[label=<E2>]; E3_4[label=<E3>]; E4_4[label=<E4>]; E5_4[label=<E5>]; E6_4[label=<E6>]; E3_5[label=<E3>]; E4_5[label=<E4>]; E4_6[label=<E4>]; E5_6[label=<E5>]; E6_6[label=<E6>]; E5_7[label=<E5>]; E6_7[label=<E6>]; E1_1 -> E2_1; E2_1 -> E2_2; E2_2 -> E3_2; E3_2 -> E4_2; E4_2 -> E5_2; E5_2 -> E6_2; E6_2 -> "("; "(" -> E1_3; E1_3 -> E2_3; E2_3 -> E3_3; E3_3 -> E4_3; E4_3 -> E5_3; E5_3 -> E6_3; E6_3 -> x; E6_2 -> IMP; E6_2 -> ")"; ")" -> E2_4; E2_4 -> E3_4; E3_4 -> E4_4; E4_4 -> E5_4; E5_4 -> E6_4; E6_4 -> y; E2_2 -> EQU; E2_2 -> E3_5; E3_5 -> E4_5; E4_5 -> E4_6; E4_6 -> E5_6; E5_6 -> E6_6; E6_6 -> z; E4_5 -> AND; E4_5 -> E5_7; E5_7 -> NOT; E5_7 -> E6_7; E6_7 -> u; }$
Parse tree for $a$ OR $b$ AND NOT $c$ ¶
–
A parser (syntax analyser) determines whether the incoming sequence of tokens forms a structure which is legal according to the definition of the language. The parser accepts a string $w$ if it is a sentence, i.e. it can be derived from the start symbol $S$ , i.e. $S$ $\mathop \Rightarrow\limits^{\mbox{\tiny\rm +}}$ $w$ , $w \in \Sigma^*$ . The parser also has to call semantic functions and handle syntax errors as well as generating error messages.
Canonical derivation is another name for right derivation. Left (right) sentential form is a sentential form which is included in a left- (right-) derivation. A handle is two things: a production $A \rightarrow \beta$ and a position in a sentential form where the production has been applied. If $S$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm rm}}^{\mbox{\tiny\rm *}}$ $\alpha A w$ $\mathop \Rightarrow\limits_{\mbox{\tiny\rm rm}}$ $\alpha \beta w$ the production $A \rightarrow \beta$ and position after $\alpha$ is the handle of $\alpha \beta w$ . A handle is used to get the previous sentential form in a right derivation. The canonical reduction sequence is also called reverse rightmost derivation. You get this sequence by starting with a sentence $w$ and then successively eliminating handles until you reach the start symbol $S$ :

$w = \gamma_n$ $\mathop \Leftarrow\limits_{\mbox{\tiny\rm rm}}$ $\gamma_{n-1}$ $\mathop \Leftarrow\limits_{\mbox{\tiny\rm rm}}$ … $\mathop \Leftarrow\limits_{\mbox{\tiny\rm rm}}$ $\gamma_1$ $\mathop \Leftarrow\limits_{\mbox{\tiny\rm rm}}$ $\gamma_0 = S$
In top-down parsing you start with the start symbol and try to derive ( $\mathop \Rightarrow\limits_{\mbox{\tiny\rm rm}}$ or $\mathop \Rightarrow\limits_{\mbox{\tiny\rm lm}}$ ) the sentence which is given by the incoming token sequence. The parse tree will therefore grow downwards (root first, leaves last). In Bottom-up parsing you go the other way and start with the incoming token sequence which you try to reduce (reverse right derivation, $\mathop \Leftarrow\limits_{\mbox{\tiny\rm rm}}$ ) to the start symbol. The parse tree will therefore be constructed upwards (leaves first, root last).
The input sequence was a c b a b. A bottom-up parser would have reduced the rules in the following order: 4 2 3 4 2 1 1.
A shift-reduce parser uses a stack. The parser works by shifting symbols from the incoming token sequence on the stack top until a handle (i.e., a complete right side of some rule) is on the stack top (never inside). Then the handle is reduced to its corresponding left side (i.e., non-terminal). If there is no more input and only the start symbol is on the stack, the string is accepted, otherwise a syntax error has occurred.

Stack	Input	Action
0	$a$ $c$ $b$ $a$ $b$ $	shift 2
0 $a$ 2	$c$ $b$ $a$ $b$ $	shift 4
0 $a$ 2 $c$ 4	$b$ $a$ $b$ $	reduce $A \rightarrow \epsilon$
0 $a$ 2 $c$ 4 A 7	$b$ $a$ $b$ $	shift 3
0 $a$ 2 $c$ 4 A 7 $b$ 3	$a$ $b$ $	reduce $S \rightarrow b$
0 $a$ 2 $c$ 4 A 7 S 8	$a$ $b$ $	reduce $A \rightarrow c A S$
0 $a$ 2 A 5	$a$ $b$ $	shift 2
0 $a$ 2 A 5 $a$ 2	$b$ $	reduce $A \rightarrow \epsilon$
0 $a$ 2 A 5 $a$ 2 A 5	$b$ $	shift 3
0 $a$ 2 A 5 $a$ 2 A 5 $b$ 3	$	reduce $S \rightarrow b$
0 $a$ 2 A 5 $a$ 2 A 5 S 6	$	reduce $S \rightarrow a A S$
0 $a$ 2 A 5 S 6	$	reduce $S \rightarrow a A S$
0 S 1	$	accept

Assume that we do not have any look-ahead and that we try the right side from left till right. At the beginning we have no choice other than to expand the first production. Then we choose the production $\{$ string $\}$ $\rightarrow$ $a$ and the following tree arises:

From here we can not continue so we backtrack and try the next production, $\{$ string $\}$ $\rightarrow$ $a$ $\{$ endpart $\}$ . Then we do have a choice again and therefore select $\{$ endpart $\}$ $\rightarrow$ $b$ and end up in the following situation:

Obviously we have just mad a wrong choice and so we try $\{$ endpart $\}$ $\rightarrow$ $e$ $\{$ endpart $\}$ . We continue to expand $\{$ endpart $\}$ and again try $\{$ endpart $\}$ $\rightarrow$ $b$ and get stuck. Just as before we change the last choice to $\{$ endpart $\}$ $\rightarrow$ $e$ $\{$ endpart $\}$ and the last thing we do is to take a chance on $\{$ endpart $\}$ $\rightarrow$ $b$ again and this succeeds and thus we are finished:
A left recursive production (e.g. $E \rightarrow E + T$ ) can not be implemented directly:
```
procedure E;
begin
  if E then begin
    if token = '+' then begin
      scan;
      (* ⋮ *)
end;
```
An infinite loop arises as we call the same procedure without consuming any token. Leftrecursion can be eliminated by transforming the production to right-recursive form:

$A \rightarrow A \alpha_1 \vert A \alpha_2 \vert \ldots \vert A \alpha_m \vert \beta_1 \vert \beta_2 \vert \ldots \vert \beta_n$

(where $\beta_i$ can not have $A$ at the beginning) is transformed to:

$A \rightarrow \beta_1 A' \vert \beta_2 A' \vert \ldots \vert \beta_n A'$

$A' \rightarrow \alpha_1 A' \vert \alpha_2 A' \vert \ldots \vert \alpha_m A' \vert \epsilon$

You can also rewrite the grammar in iterative form (EBNF-notation) so that $E \rightarrow E + T$ is rewritten to $E \rightarrow E \lbrace + T \rbrace$ where $\lbrace \ldots \rbrace$ is coded as a while-loop.
Backtracking takes a lot of time, makes error management and the choice of productions difficult. You must also be prepared to “undo” semantic actions (e.g. inserting in the symbol table or generated code).
A recursive descent parser is a top-down parser. It has one token look-ahead which means that the current token is expected to come next in one of the right sides you are working on. If no right side matches, you have a syntax error; if several right sides match, you have ambiguity. By definition a recursive descent parser can not use backtracking.

In an implementation you must first eliminate left recursion and left factorise the right side (alternatively you can rewrite the grammar in EBNF form). Each nonterminal is then implemented as a procedure where each right side of the nonterminal is considered. You will find an example in the solution to the next exercise.

program parser(input, output);

label 99;

var token : char;

procedure S; forward;
procedure A; forward;

procedure scan;
begin
  read(token);
end;

procedure S;
begin
  if (token = 'a') then begin (* S → a A S *)
    scan;
    A;
    S;
  end else if (token = 'b') then (* S → b *)
    scan;
  else begin
    writeln('expected a,b');
    goto 99;
  end;
end;

procedure A;
begin
  if (token = 'c') then begin (* A → c A S *)
    scan;
    A;
    S;
  end else (* A -> ε *)
    ;
end;

begin
  scan; (* one token lookahead *)
  S;
99 : end.

An LR parser is a shift-reduce parser. It works on three data-structures; input (tokens) which is read one symbol at a time; a stack of symbol/state pairs (the symbol is not actually needed) and two parse tables: ACTION and GOTO (see the figure in the book). The actual LR parser is always the same. Tables can be generated automatically by a parser-generator given a grammar for the language. The ACTION table contains information about the next action, i.e. shift, reduce, accept or error. The GOTO table specifies the next state after a reduce action.

Advantages:
- can manage in general all context-free languages
- is more general (no rewriting of the grammar is needed)
- parses quickly
- discovers errors as soon as this is possible
Disadvantages:
- it is difficult to fit in the semantics (only when reduce is performed)
- almost hopeless to generate the tables by hand
A configuration represents the current state in LR the parser and therefore consists of a stack, current token and input.

The start configuration is $( s_0, a_1~a_2 \ldots a_n \vdash )$ where $s_0$ is the start state and $a_1$ is the current token. ( $\vdash$ is an end mark.) Assume that the current configuration is

$( s_0~X_1~s_1~X_2~s_2 \ldots X_m~s_m, a_i~a_{i+1} \ldots a_n \vdash )$

$s_0 \ldots s_m$ are state, $X_1 \ldots X_m$ are vocabulary symbols and $a_i \ldots a_n$ are unread tokens.

If ACTION $\lbrack s_m, a_i \rbrack =$ shift $s$ we get the configuration

$( s_0~X_1~s_1~X_2~s_2 \ldots X_m~s_m~a_i~s, a_{i+1} \ldots a_n \vdash )$

If ACTION $\lbrack s_m, a_i \rbrack =$ reduce $A \rightarrow \beta$ we get

$( s_0~X_1~s_1~X_2~s_2 \ldots X_{m-r}~s_{m-r}~A~s, a_i~a_{i+1} \ldots a_n \vdash )$

where $r = \vert \beta \vert$ and $s$ = GOTO $\lbrack s_{m-r}, A\rbrack$

The other two cases, accept and error, mean that either the parser is finished or that an error has occurred.

The two tables are named ACTION and GOTO. For a given state $s_m$ and a token $a_i$ : ACTION $\lbrack s_m, a_i \rbrack$ can be one of the four actions: shift, reduce, accept or error:

${\rm ACTION}\lbrack s_m, a_i \rbrack = [\begin{array}{l} \mbox{{\sl shift\/} $s$} \\ \mbox{{\sl reduce\/} $A \rightarrow \beta$} \\ \mbox{{\sl accept\/}} \\ \mbox{{\sl error\/}} \end{array} ] .$

The GOTO table is used after each reduce action to get the next state, i.e. GOTO $\lbrack s_{m-r}, A \rbrack$ where $r = \vert \beta \vert$ is the length of the production which was reduced.

The LR parser therefore has the following appearance:

done := false;
while not done do begin
  case ACTION[s[m], a[i]] of
    shift s:
      (* add a[i] and s_new to the stack *)
      configuration := (* See question above *);
    reduce p: (* p: A → β *)
      callSemanticRoutineForRuleP();
      (* remove the handle β from the stack and replace with A *)
      configuration := (* See question above *);
      s := GOTO[s[m-r(* |β| *)], A];
    accept:
      done := true;
    error:
      errorRoutine();
  end;
end;

Stack	Input	Action
0	( id + id ) * id $	shift 4
0 ( 4	id + id ) * id $	shift 5
0 ( 4 id 5	+ id ) * id $	reduce F → id
0 ( 4 F 3	+ id ) * id $	reduce T → F
0 ( 4 T 2	+ id ) * id $	reduce E → T
0 ( 4 E 8	+ id ) * id $	shift 6
0 ( 4 E 8 + 6	id ) * id $	shift 5
0 ( 4 E 8 + 6 id 5	) * id $	reduce F → id
0 ( 4 E 8 + 6 F 3	) * id $	reduce T → F
0 ( 4 E 8 + 6 T 9	) * id $	reduce E → E + T
0 ( 4 E 8	) * id $	shift 11
0 ( 4 E 8 ) 11	* id $	reduce F → ( E )
0 F 3	* id $	reduce T → F
0 T 2	* id $	shift 7
0 T 2 * 7	id $	shift 5
0 T 2 * 7 id 5	$	reduce F → id
0 T 2 * 7 F 10	$	reduce T → T * F
0 T 2	$	reduce E → T
0 E 1	$	accept

A viable prefix is one of the prefixes of a right sentential form which does not contain any symbols to the right of a handle.

An LR(0) item of a production $P$ is $P$ with a dot $\bullet$ somewhere in the right side.

The canonical collection of LR(0) items for a grammar is a number of sets of LR(0) items. These sets form states in a so-called simple LR-parser and can be considered as an NFA for recognising viable prefixes.

Extending the grammar is something we must do for so that we know for certain that accept has occurred. We do this by producing a new start symbol $S'$ and corresponding production

$S'$ $\rightarrow$ $S$ $

where $ marks the end of the string. reduce in this production means accept.

A shift-reduce conflict can arise when you try to produce the ACTION table for an LR parser. It occurs when you have the following two items in a certain item-set:
- <A> → α ●
- → γ ● α β
Here you can choose between reducing with the first production or shifting in the other.

A reduce-reduce conflict occurs when you have the following two items in a certain item-set:
- <A> → α ●
- → β α ●
Here we have more than one so-called complete item and we do not know which production to reduce.
Each state has a number of LR(0) items. Assume that we are working on state i. If there is an item:

<A> → α ● α β

and an edge goes from state i to j marked with the terminal a, set ACTION[i, a] = shift j.

If there is a complete item:

<A> → α ●

set ACTION[i, a] = reduce x where x is the production number for A → α and a is either all nonterminals (LR(0) table) or just those included in FOLLOW(A) (SLR(1) table). If it is the start symbol we are reducing, then of course ACTION[i, a] = accept. If none of the above apply, then ACTION[i, a] = error.

The GOTO table is simple to fill in. If an edge from state i to j is marked with the nonterminal A, set ACTION[i, a] = j.
$digraph figure8 { node [shape=box, fontname="Courier New"]; forcelabels=true; // rankdir=TB; // ordering="out"; start[shape=none, label=""]; S0[xlabel="0:",label="\ <S'> → ● <S>\l\ <S> → ● 1 <S> 1\l\ <S> → ● 0 <S> 0\l\ <S> → ● 2\l" ]; S1[xlabel="1:",label="\ <S'> → <S> ●\l" ]; S2[xlabel="2:",label="\ <S> → 1 ● <S> 1\l\ <S> → ● 1 <S> 1\l\ <S> → ● 0 <S> 0\l\ <S> → ● 2\l" ]; S3[xlabel="3:",label="\ <S> → 1 <S> ● 1\l" ]; S4[xlabel="4:",label="\ <S> → 1 <S> 1 ●\l" ]; S5[xlabel="5:",label="\ <S> → 0 ● <S> 0\l\ <S> → ● 1 <S> 1\l\ <S> → ● 0 <S> 0\l\ <S> → ● 2\l" ]; S6[xlabel="6:",label="\ <S> → 0 <S> ● 0\l" ]; S7[xlabel="7:",label="\ <S> → 0 <S> 0 ●\l" ]; S8[xlabel="8:",label="\ <S> → 2 ●\l" ]; start -> S0 [label=" start "]; S0 -> S1 [label="<S>"]; S0 -> S2 [label=" 1 "]; S0 -> S5 [label=" 0 "]; S0 -> S8 [label=" 2 "]; S2 -> S2 [label=" 1 "]; S2 -> S3 [label="<S>"]; S2 -> S5 [label=" 0 "]; S2 -> S8 [label=" 2 "]; S3 -> S4 [label=" 1 "]; S5 -> S2 [label=" 1 "]; S5 -> S5 [label=" 0 "]; S5 -> S6 [label="<S>"]; S5 -> S8 [label=" 2 "]; S6 -> S7 [label=" 0 "]; }$
With FOLLOW(S) = {$, 1, 0} we get:

$ 1 0 2 S

0 S2 S5 S8 1

1 A

2 S2 S5 S8 3

3 S4

4 R1 R1 R1

5 S2 S5 S8 6

6 S7

7 R2 R2 R2

8 R3 R3 R3
They deal with different classes of languages; LR(0) is “weakest” and LR(k) “strongest”. In other words LR(0) is more restrictive – it can simply not parse certain languages. This is because an LR(0) parser lacks lookahead while an LR(k) parser has k tokens lookahead.

Another difference is that a powerful LR parser often has a significantly bigger table than a weaker variant.
Many rows (states) in the ACTION table have identical content so space can be saved by dividing the contents. The table below,

$ 1 0 2 S

0 S2 S5 S8 1

1 A

2 S2 S5 S8 3

3 S4

4 R1 R1 R1

5 S2 S5 S8 6

6 S7

7 R2 R2 R2

8 R3 R3 R3

can be stored like this:

Other compaction methods are based on “default actions” and replacing error actions by reduce actions to join more states together.

	$	1	0	2	S
0		S2	S5	S8	1
1	A
2		S2	S5	S8	3
3		S4
4	R1	R1	R1
5		S2	S5	S8	6
6			S7
7	R2	R2	R2
8	R3	R3	R3

	$	1	0	2	S
0		S2	S5	S8	1
1	A
2		S2	S5	S8	3
3		S4
4	R1	R1	R1
5		S2	S5	S8	6
6			S7
7	R2	R2	R2
8	R3	R3	R3

Chapter 5 - Syntax-Directed Translation¶

By a syntax-directed translation schema we mean that with each rule in the grammar we associate a so-called semantic routine. The semantic routine is executed when reduce occurs (bottom-up parser) or during expansion (top-down parser).

A semantic action is just a bit of code whose normal task is to calculate attributes which are associated with nonterminals, e.g.:

$E \rightarrow E_1 + E_2$ $\lbrace E.val := E_1.val + E_2.val \rbrace$

Here $val$ is an attribute to $E$ and the semantic action synthesises the value for the left side when the rule is reduced.

The semantic stack can be seen as an extension of the parse stack with an extra field per attribute where the current right side’s attributes are stored. On reduction these values are popped and the left side’s attributes are calculated and pushed back.
In top-down analysis a semantic rule is called when an expansion is performed. In bottom-up analysis a semantic rule is called when a reduction is performed.

S → E       { print E.val }
E → E₁ + T  { E.val := E₁.val + T.val }
  | T       { E.val := T.val }
T → T₁ * F  { T.val := T₁.val + F.val }
  | F       { T.val := F.val }
F → ( E )   { F.val := E.val }
  | integer { F.val := lexval }

procedure semantic(rule : integer);
begin
  case rule of
    1 : writeln(semstk[top]);
    2 : semstk[top-2] := semstk[top] + semstk[top-2];
    3 : ;
    4 : semstk[top-2] := semstk[top] * semstk[top-2];
    5 : ;
    6 : semstk[top-2] := semstk[top-1];
    7 : semstk[top] := lexval;
  end;
end;

See the lecture notes and the course book for some important control structures.

Chapter 6 - Intermediate-Code Generation¶

Optimizations are easier to make on an internal form. The internal form is not profiled to any machine and further it can form the basis for interpreting.
In infix notation the operators are between the operands and parentheses and rules for priority and associativity are also needed to unambiguously interpret an expression.

In postfix notation the operator comes after the operands — the operators come in the order they will be evaluated; the operands come in the same order as in infix notation. Neither parentheses, priority nor associativity rules are needed.

An abstract syntax tree corresponds to a reduced parse tree. An abstract syntax tree contains no redundant information, e.g. no parentheses are needed as the structure is clear from the tree.

Quadruples are of the form

op arg₁ arg₂ res

where op is the operation you want to perform, arg₁ and arg₂ are the arguments and res denotes the result. The last three are usually references to the symbol table.

Triples are of the form

op arg₁ arg₂

and differ from quadruples in that the result of an operation is not written explicitly but represented by the position of the triple. The position can thus serve as an argument in some other triple.
The advantage of triples is that you do not need temporary variables in the symbol table. The disadvantage is that it is hard work to move triples as you also have to change the references to them.
Perform a post-order traversal of the tree.

The stack is used for storing intermediate results. The expression is evaluated from left to right. If a digit appears, it is put on top of the stack; if a variable appears, its address is put on top of the stack; if an operator appears, it is applied to the n uppermost stack elements where n is the operator’s arity.

Example: 34 + (-4 - 3 * 5)

Step	Stack	Input
1	⊣	34 4 @ 3 5 * - + ⊢
2	⊣ 34	4 @ 3 5 * - + ⊢
3	⊣ 34 4	@ 3 5 * - + ⊢
4	⊣ 34 -4	3 5 * - + ⊢
5	⊣ 34 -4 3	5 * - + ⊢
6	⊣ 34 -4 3 5	⊢
7	⊣ 34 -4 15	⊢
8	⊣ 34 -19	⊢
9	⊣ 15	⊢

Note

@ denotes unary minus.

- Postfix code:
```
z x 20 + 15 * y 2 + a / - :=
```
```
a 10 < L₁ JEQZ
 b 5 < L₂ JEQZ x a := L₃ JUMP
 L₂: x a b + := L₃ JUMP
L₁: x a b - :=
L₃:
```
```
L₁ x 10 < L₂: JEQZ
 f f 10 x * + :=
 x x 1 - :=
 L₁ JUMP
L₂
```
- Abstract syntax trees:
- Quadruples:
 
 $op$ $arg_1$ $arg_2$ $res$
 
 1 $+$ $x$ 20 $T_1$
 
 2 $*$ $T_1$ 15 $T_2$
 
 3 $+$ $y$ 2 $T_3$
 
 4 $/$ $T_3$ $a$ $T_4$
 
 5 $-$ $T_2$ $T_4$ $T_5$
 
 6 $:=$ $T_5$ — $z$
 
 $op$ $arg_1$ $arg_2$ $res$
 
 1 $<$ $a$ 10 $T_1$
 
 2 JEQZ $T_1$ — (10)
 
 3 $<$ $b$ 5 $T_2$
 
 4 JEQZ $T_2$ — (7)
 
 5 $:=$ $a$ — $x$
 
 6 JUMP — — (12)
 
 7 $+$ $a$ $b$ $T_3$
 
 8 $:=$ $T_3$ — $x$
 
 9 JUMP — — (12)
 
 10 $-$ $a$ $b$ $T_4$
 
 11 $:=$ $T_4$ — $x$
 
 12
 
 $op$ $arg_1$ $arg_2$ $res$
 
 1 $<$ $x$ 10 $T_1$
 
 2 JEQZ $T_1$ — (9)
 
 3 $*$ 10 $x$ $T_2$
 
 4 $+$ $f$ $T_2$ $T_3$
 
 5 $:=$ $T_3$ — $f$
 
 6 $-$ $x$ 1 $T_4$
 
 7 $:=$ $T_4$ — $x$
 
 8 JUMP — — (1)
 
 9
- Triples
 
 $op$ $arg_1$ $arg_2$
 
 1 $+$ $x$ 20
 
 2 $*$ (1) 15
 
 3 $+$ $y$ 2
 
 4 $/$ (3) $a$
 
 5 $-$ (2) (4)
 
 6 $:=$ (5) $z$
 
 $op$ $arg_1$ $arg_2$
 
 1 $<$ $a$ 10
 
 2 JEQZ (1) (10)
 
 3 $<$ $b$ 5
 
 4 JEQZ (3) (7)
 
 5 $:=$ $a$ $x$
 
 6 JUMP — (12)
 
 7 $+$ $a$ $b$
 
 8 $:=$ (7) $x$
 
 9 JUMP — (12)
 
 10 $-$ $a$ $b$
 
 11 $:=$ (10) $x$
 
 12
 
 $op$ $arg_1$ $arg_2$
 
 1 $<$ $x$ 10
 
 2 JEQZ (1) (9)
 
 3 $*$ 10 $x$
 
 4 $+$ $f$ (3)
 
 5 $:=$ (4) $f$
 
 6 $-$ $x$ 1
 
 7 $:=$ (6) $x$
 
 8 JUMP — (1)
 
 9

	$op$	$arg_1$	$arg_2$	$res$
1	$+$	$x$	20	$T_1$
2	$*$	$T_1$	15	$T_2$
3	$+$	$y$	2	$T_3$
4	$/$	$T_3$	$a$	$T_4$
5	$-$	$T_2$	$T_4$	$T_5$
6	$:=$	$T_5$	—	$z$

	$op$	$arg_1$	$arg_2$	$res$
1	$<$	$a$	10	$T_1$
2	JEQZ	$T_1$	—	(10)
3	$<$	$b$	5	$T_2$
4	JEQZ	$T_2$	—	(7)
5	$:=$	$a$	—	$x$
6	JUMP	—	—	(12)
7	$+$	$a$	$b$	$T_3$
8	$:=$	$T_3$	—	$x$
9	JUMP	—	—	(12)
10	$-$	$a$	$b$	$T_4$
11	$:=$	$T_4$	—	$x$
12

	$op$	$arg_1$	$arg_2$	$res$
1	$<$	$x$	10	$T_1$
2	JEQZ	$T_1$	—	(9)
3	$*$	10	$x$	$T_2$
4	$+$	$f$	$T_2$	$T_3$
5	$:=$	$T_3$	—	$f$
6	$-$	$x$	1	$T_4$
7	$:=$	$T_4$	—	$x$
8	JUMP	—	—	(1)
9

	$op$	$arg_1$	$arg_2$
1	$+$	$x$	20
2	$*$	(1)	15
3	$+$	$y$	2
4	$/$	(3)	$a$
5	$-$	(2)	(4)
6	$:=$	(5)	$z$

	$op$	$arg_1$	$arg_2$
1	$<$	$a$	10
2	JEQZ	(1)	(10)
3	$<$	$b$	5
4	JEQZ	(3)	(7)
5	$:=$	$a$	$x$
6	JUMP	—	(12)
7	$+$	$a$	$b$
8	$:=$	(7)	$x$
9	JUMP	—	(12)
10	$-$	$a$	$b$
11	$:=$	(10)	$x$
12

	$op$	$arg_1$	$arg_2$
1	$<$	$x$	10
2	JEQZ	(1)	(9)
3	$*$	10	$x$
4	$+$	$f$	(3)
5	$:=$	(4)	$f$
6	$-$	$x$	1
7	$:=$	(6)	$x$
8	JUMP	—	(1)
9

Chapter 7 - Run-Time Environments¶

In the symbol table information is gathered about the symbols which appear in a program. The information is gathered during the analysis phase and is used during the synthesis phase.
Name, scope, declared type, address, number of parameters and their type for functions/procedures, array dimensions, etc..
You can either decide on an upper limit for the length of the names and store the string itself with other information in the symbol table. Another alternative is to store an index to a large string table where all the names are stored, separated by some suitable character such as a blank. The latter method does not limit the number of significant characters in an identifier.
If the number of dimensions is fixed, information can be stored in a record, but if the number of dimensions is variable, you must have a list of $\{$ lower limit, upper limit $\}$ pairs.

	Good	Bad
Lists	Simple to implement.	Long search times
	Uses little memory.
	Easy to introduce scoping.
Trees	Easy to introduce scoping.	Should be balanced.
	Acceptable search time.
	Uses little memory.
Hash table	Very fast	Difficult to introduce scoping.
		Extra space for tables.

If you have high demands on searching and inserting performance, you ought to choose the hash table. With not quite so high demands you can choose a tree-based implementation. Linear lists are mostly a sign of total indifference, laziness and ignorance.

The choice of data-structure is also influenced by characteristics of the language, e.g. if nested blocks are not allowed, scope information can be omitted.

In the example below a binary tree is used for storing the symbols. Note that left and right edges matter in a sorted binary tree.
By using a suitably large hash table you can index the symbol table in practically constant time. to determine whether a name $s$ is in the symbol table you hash the first name with a suitable hash function $h$ . The result, $h(s)$ , is then used as an index in the hash table which contains an index to the symbol table where the record for $s$ may be. Collisions occur when several names calculate the same hash value. In a closed hash table linear sounding, for example, is used to handle collisions. The search time can be adjusted by varying the size of the hash table.

An example what the symbol tables look like is given in the answer to the next exercise.

A hash function should spread the keys well and be quick at calculating. Examples of good hash functions (e.g. hashpjw) can be found in the course book.
The following three figures show what the symbol tables looks like at L1, L2 and L3. It is assumed that CAESAR and DAVID have the same hash value. A record in the symbol table contains information on names, block number and type as well as a back link for collision management.
Arrays with a fixed number of elements can be stored directly in the activation record. Dynamic arrays have variable limits and the size is thus unknown at compilation time. A dope vector for these is stored in the activation record. It contains the array’s current limits and a pointer to the space where the array is actually stored which may be above the activation record or in the heap.
call by reference means that you transfer pointers to the actual parameters and thus $m$ will point to the space where $i+1$ is stored and $n$ will point to where $i$ is stored. The result will be $i=3$ .

call by value means that you copy the actual parameters’ values and that odd can not affect the caller’s parameters. The result will be $i=3$ .

call by value/result means that you copy the values of the actual parameters to the formal parameters, run the procedure and then copy the value of the formal parameters to the actual parameters. The result will be $i=4$ .

call by name is based on textual substitution; $m$ and $n$ will be exchanged fort $i+1$ and $i$ before odd is executed. The result will be $i=5$ .
A thunk is a procedure without arguments which is sent as a representative for an actual parameter in a call-by-name strategy. When the thunk is evaluated, it produces the parameter’s rvalue or lvalue depending on which side it appears in an assignment.
Static memory allocation means that the size of all data must be known at compilation time. Recursion and dynamic variables are not allowed and no runtime support is required. All data can be referenced by absolute addresses. As activation records, memory allocation etc. are not required, the program will run faster and use less space.

Example: FORTRAN.
Dynamic memory allocation is needed when you do not know the size of data at compilation time, e.g. dynamic arrays, or when there is recursion. When recursion is possible you can no longer unambiguously map names to memory addresses as several instances of a procedure (with accompanying local variables) can be active at the same time.
Stack allocation means that all data belonging to a block (procedure) is gathered together into an activation record. At a procedure call a new activation record is allocated on the stack. When the procedure returns, the activation record is removed. This technique must be used when there is recursion, e.g. for the Pascal language.

The activation record contains local data for the procedure, the return address, parameters, pointer to the previous activation record, references to non-local variables, dope-vectors etc.

One way to reference non-local variables is to use a display. A display is quite simply
```
display : array [1..MAX_NESTING] of pointer to activation_record;
```
where MAX_NESTING is an upper limit on how deeply you can nest blocks. The table works in a way so that display[in] always points to the surrounding block with nesting depth i. Variables are then referenced by display[in].offset where i is known at compilation time and offset describes where the variable is in the activation record.

A display must be updated when a procedure is called, e.g. by the called procedure saving the old value in the table which is then replaced by a pointer to its own activation record. When the procedure is done, the old value is replaced.
Heap allocation means that memory is allocated dynamically in the heap area during execution and released either explicitly by the programmer (dispose in Pascal) or by a garbage collector (LISP).

The task of a garbage collector is to release space in the heap which is no longer used.

If memory is allocated and released over a longer time, the free memory will be fragmented in many small blocks. This can mean that a memory block can not be allocated even though there is enough free memory in the system. A good garbage collector therefore usually compacts space so that fragmentation is minimised.
See the section on “Memory management in FORTRAN” in the lecture notes.
Static link, dynamic link, return address, parameters, dope-vectors and space for local variables.
See the section on “Dynamic arrays and block structures in ALGOL” in the lecture notes.
Todo

Should be pointers from ● to ole, etc

display stack

2 ?

1 ● ole

display stack

2 ● dole

1 ● ole

Note: when a procedure P at block depth i is called, display[i] must be saved in P’s activation record and display[i] updated to point to P’s activation record. When P is finished, the old value is restored.
Variables in ole are reached by display[1].offset, variables in dole are reached by display[2].offset and global variables are reached by display[0].offset. where offset is the variable’s position in the activation record.

Optimizations can be made, e.g. global variables do not need to be allocated in the stack, but can be written out as an absolute address.

	display	stack
2	?
1	●	ole

	display	stack
2	●	dole
1	●	ole

Chapter 8 - Code Generation¶

A compiler can generate
- absolute code which means that the code is placed directly in memory and execution starts directly. Separate compilation is impossible so this method works best for small programs.
- relocatable code which means that the code can be saved on a file and can be linked with other programs.
- assembly code which requires assembling, linking and loading but which makes code generation much simpler (you avoid having to manage forward references).
Choice of instructions, mutual order between instructions and register use.
See the section on “Code generation from a labelled tree” in the course book or “Code generation for DAG” in the lecture notes.
Peephole-optimization means that you try to make local code improvements, usually at assembly level. You can imagine a window of a pair of instructions in which you try to replace “stupid” instruction sequences by something which is equivalent but which requires less space or time.

The types of optimizations you can perform are (1) finding redundant load and store instructions; (2) recognising a pattern which matches a special instruction; (3) algebraic simplifications (e.g. addition of 0 to a register); (4) replacing an expensive operation by something equivalent and cheaper (e.g. multiplication by 2 can be replaced by a shift instruction) and (5) short-circuiting jump instructions.

Chapter 9 - Machine-Independent Optimizations¶

Algorithm optimization means replacing a slow algorithm with a faster one. Local optimizations are only made within a basic block without information from other blocks. Loop-optimization aims to minimise the number of operations in the loop. Global optimization occurs when the whole program is analysed in its entirety to, for example, remove variables which are never used, not perform calculations whose results are never used and remove code which will never be executed.
A basic block is a sequence of operations with an entry and an exit. No jump instructions may appear within the block (with the exception of the very last instruction).
- constant folding:
  
  Constant sub-expressions are calculated at compilation time. Example:
  
  $N = 5$ ;
  
  ⋮
  
  $i := N + 2$ ;
  
  can be replaced by:
  
  $N = 5$ ;
  
  ⋮
  
  $i := 7$ ;
- Elimination of common sub-expressions:
  
  Avoid performing the same calculation more than once. Example:
  
  $d := d + c*b$ ;
  
  $a := d + c*b$ ;
  
  can be replaced by:
  
  $t := c*b$ ;
  
  $d := d + t$ ;
  
  $a := d + t$ ;
- reduction in strength:
  
  Replace an expensive operation by something cheaper. Example:
  
  $x := y^2$ ;
  
  can be replaced by:
  
  $x := y*y$ ;
Moving loop invariants: An expression whose value is not changed as long as the program is in a loop can be calculated before the loop.

Elimination of induction variables: An induction variable $i$ is a variable which in a loop always increases or decreases its value by a constant $c$ . When $i$ is an index to an array $a$ , you have the chance to eliminate $i$ and instead step forward $a$ by $c$ elements. In general you eliminate the induction variables which are only used for calculating other induction variables.
Can you perform machine-independent optimizations or will the compiler’s portability suffer? Can you use the debugger when optimization has been performed? If not, can you switch off optimization? Is the language’s semantics retained or can, for example, a thoroughly planned numeric expression happen to be rearranged so that overflow occurs? How much memory are you willing to sacrifice for speed?

Chapter X - Bootstrapping¶

See answer to question 5, Chapter 1.
Cross-compilation means that the compiler runs on one machine but generates code to be run on another machine.
The notation $^kC^o_i$ denotes a compiler $C$ written in the language $i$ which compiles the source language $k$ to the object language $o$ . You usually also use a so-called T-diagram:

Todo

figure18

Bootstrapping can now be defined as the method you use to get a compiler $^xC^y_x$ , i.e. implemented in the language which is to be compiled.

In the first exercise we have $^FC^A_?$ at our disposal and we are to deliver $^PC^A_A$ , where $F$ , $P$ and $A$ denote FORTRAN, Pascal and $A$ -assembler respectively. One way of doing this is to first write a Pascal-compiler in FORTRAN which produces $A$ -assembler: $^PC^A_F$ . If we run this through $^FC^A_?$ we get $^PC^A_A$ . Now we “only” need to rewrite the compiler in our own Pascal, i.e. $^PC^A_P$ and run it through $^PC^A_A$ and we get the final product $^PC^A_A$ .

Todo

figure19

The next step, moving the compiler to machine $B$ , can be performed in approximately the same way by generating code for $B$ -assembler as in the figure below:

Todo

figure20

[1]	Even if we do not usually draw all the edges in the DFA which lead to a so-called error state.