0% found this document useful (0 votes)
78 views

CSM 562 Compiler Construction Assignment Kaunda PG8589621, M Phil Computer Science, KNUST

The document discusses compilers, grammars, and parsing. It explains that a grammar defines the rules that determine if a sentence is part of a language. Grammars start with high-level sentence types and define lower levels. For example, the equality rule states an equality statement is an expression followed by "==" followed by another expression. Expressions can be variables, integers, or strings. Variables are sequences of alphanumeric characters. Grammars fully define a language by breaking it down into logical components and their relationships through a system of rules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

CSM 562 Compiler Construction Assignment Kaunda PG8589621, M Phil Computer Science, KNUST

The document discusses compilers, grammars, and parsing. It explains that a grammar defines the rules that determine if a sentence is part of a language. Grammars start with high-level sentence types and define lower levels. For example, the equality rule states an equality statement is an expression followed by "==" followed by another expression. Expressions can be variables, integers, or strings. Variables are sequences of alphanumeric characters. Grammars fully define a language by breaking it down into logical components and their relationships through a system of rules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Kwame Nkrumah University of Science

and Technology, KNUST

COMPILER CONSTRUCTION ASSIGNMENT

KAUNDA, Ismail Ibn Ahmed


M.Phil. Computer Science, KNUST
 Student ID: 20818866
 Exam No: PG8589621

Sunday September 18, 2022 Page 0 of 23


Contents
Explain Assembler, Linkers and Loaders ....................................................................................................... 2
Assembler ................................................................................................................................................. 2
Linkers and loaders ................................................................................................................................... 3
What is grammar?......................................................................................................................................... 4
Given that A is a Regular Expression Show that (A)* = *, for every A .......................................................... 6
Assignment 5 (Midsem) ................................................................................................................................ 7
Determine whether the string 11101 is in each of these sets. ................................................................. 7
Design a top-down parser in Java ............................................................................................................. 7
What are the advantages in code generation from a linear intermediate representation rather than
directly from the syntax tree? ................................................................................................................ 10
Find a deterministic Finite Automaton that recognizes each of the following sets. .............................. 10
Advantage in code generation from linear IR rather than syntax tree ................................................... 11
Describe the techniques of peephole optimization that can be performed effectively and suggest a
strategy for its implementation. ............................................................................................................. 12
Elimination of redundant instructions ................................................................................................ 12
Improvements in flow of control ........................................................................................................ 12
Algebraic simplifications ..................................................................................................................... 13
Use of machine idioms ........................................................................................................................ 13
Produce a state transition diagram (finite state machine ...................................................................... 14
Simple state transition diagram for comment .................................................................................... 14
Elaborate state transition diagram for comment ............................................................................... 14
Implementing the recognizer in java .................................................................................................. 15
Construct a grammar for arithmetic expression..................................................................................... 16
Design a lexer class in Java...................................................................................................................... 16
Compare and contrast: one-pass compiler, two-pass compiler, and multi-pass compiler. ................... 19
What are the pros and cons of introducing Intermediate Representation (IR) in a compiler? .............. 21
Question 4 ............................................................................................................................................... 22
What is an attribute grammar? .......................................................................................................... 22
How can an attribute grammar be used to support semantic analysis? ............................................ 22

Page 1 of 23
Assignment 1:

Explain Assembler, Linkers and Loaders

Assembler

An assembler is a program which converts assembly language source code into an


executable program. This is comparable to a high-level computer language
compiler. A compiler converts code into a program.

However, assembly language is different because each line of code corresponds to a


single low-level CPU Op Code.

Assembly language usually looks a bit like this like this:

The individual instructions (like mov) directly correspond to specific opcodes in the
instruction set of the target CPU.

Page 2 of 23
Linkers and loaders
The basic job of any linker or loader is that: it binds more abstract names to more
concrete names, which permits programmers to write code using the more abstract
names. That is, it takes a name written by a programmer such as getline and binds it
to ‘‘the location 612 bytes from the beginning of the executable code in module
iosys.’’ Or it may take a more abstract numeric address such as ‘‘the location 450
bytes beyond the

beginning of the static data for this module’’ and bind it to a numeric address.

Linking vs. loading

Linkers and loaders perform several related but conceptually separate actions

1. Program loading: Copy a program from secondary storage (which since


about 1968 invariably means a disk) into main memory so it’s ready to run. In
some cases loading just involves copying the data from disk to memory, in
others it involves allocating storage, setting protection bits, or arranging for
virtual memory to map virtual addresses to disk pages.
2. Relocation: Compilers and assemblers generally create each file of object
code with the program addresses starting at zero, but few computers let you
load your program at location zero. If a program is created from multiple
subprograms, all the subprograms have to be loaded at non-overlapping
addresses. Relocation is the process of assigning load addresses to the
various parts of the program, adjusting the code and data in the program to
reflect the assigned addresses. In many systems, relocation happens more
than once. It’s quite common for a linker to create a program from multiple
subprograms, and create one linked output program that starts at zero, with
the various subprograms relocated to locations within the big program. Then
when the program is loaded, the system picks the actual load address and
the linked program is relocated as a whole to the load address.
3. Symbol resolution: When a program is built from multiple subprograms, the
references from one subprogram to another are made using symbols; a main
program might use a square root routine called sqrt, and the math library
defines sqrt. A linker resolves the symbol by noting the location assigned to

Page 3 of 23
sqrt in the library, and patching the caller’s object code to so the call
instruction refers to that location.

Although there’s considerable overlap between linking and loading, it’s reasonable to
define a program that does program loading as a loader, and one that does symbol
resolution as a linker. Either can do relocation, and there have been all-in-one linking
loaders that do all three functions.

Assignment 3

What is grammar?
A grammar is the set of rules that govern how we determine if these sentences are
part of the language or not.

For example, you can imagine there’s a rule in Java’s Grammar called “the equality
condition rule”. This rule says

an equality condition is written in Java language by writing the name of an


expression, followed by the == special symbol, followed by the name of another
expression.

Then, there must be another rule that says what an expression is. And so on for each
logical piece of the language (expressions, variables, keywords, etc.).

Looking deeper down, you can expect that the rules of a grammar start by defining
very high-level type of sentences, and then go to more and more bottom levels. Let
me use again the previous example to clarify this concept.

We know there is a rule about the equality condition. It says an equality statement
must be expression, followed by ==, followed by another expression. We can write is
as
Eq: statement : expression '==' expression

This the highest level of sentences.

However, the rule is not complete in itself: the Grammar must also specify what
an expression is.

Page 4 of 23
Let’s assume an expression can be a variable, or a string, or an integer. Of course,
real programming languages have a more complex definition of expression, but let’s
continue with the simple one. So we would have
expression : variable | int | string

You can see that the character | is used to express the inclusive logical condition
(OR). The Grammar is not done yet, because it will also have to clarify what
a variable is.

So, let’s assume a variable is a sequence of characters in a-z, A-Z and digits in 0-9,
without length limitations in the number of characters. You would express this with
a regular expression rule, which is
variable : [a-zA-Z0-9]*

The * symbol after the square brackets means you can repeat each of the symbols
zero or more times.

Again, real programming languages have more complex definitions for variables, but
I am hoping this simple example clarifies the concept.

There are still two pieces of a previous rule that haven’t been specified
yet: int and string.

Let’s assume that an int is any non-empty sequence of digits, and string is any
sequence of characters in a-z, A-Z. Of course in real applications you would have
many more characters (for instance, ?!+-, etc.), but let’s keep it simple for the
moment. So, you are going to have two more rules:
int : [0-9]+

string : [a-zA-Z]*

At this point we’ve gone the deepest you can in the definition of an equality
condition. Why? Because you cannot expand further these rules. We say that single
characters such as a-z, A-Z and digits 0-9 are the terminal symbols or, more
formally, these are the characters of the Alphabet of the Grammar.

Page 5 of 23
Assignment 4

Given that A is a Regular Expression Show that (A)* = *, for every A


Solution

Page 6 of 23
Assignment 5 (Midsem)
Question 2 (b)

Determine whether the string 11101 is in each of these sets.

a) This set contains all bit strings, so the answer is YES.


b) This set contains all strings consisting of any number of l's, followed by any
number of O's, followed by any number of l's. Since 11101 is such a string, the
answer is YES.
c) Any string belonging to this set must start 110, and 11101 does not, so the answer
is NO.
d) All the strings in this set must in particular have even length. The given string
has odd length, so the answer is NO.
e) The answer is YES. Just take one copy of each of the strings 111 and 0, together
with the required string 1.
f) The answer is YES. This is because 11 from the first set and 101 from the second
set gives the string 11101.

Question 2 (a)

Design a top-down parser in Java

public class Parser


{
private static String parseS(StringBuffer toParse)
{
switch (toParse.charAt(0))
{
case '$':
match(toParse, '$');
return "(S $ " + parseTail1(toParse) + ")";

case 'a':
return "(S " + parseW(toParse) + ")";

Page 7 of 23
default:
System.err.println("parse error");
return "error";
}
}

private static String parseTail1(StringBuffer toParse)


{
switch (toParse.charAt(0))
{
case '.':
return "(tail1)";

case 'a':
case '$':
return "(tail1 " + parseS(toParse) + ")";

default:
System.err.println("parse error");
return "error";
}
}

private static String parseW(StringBuffer toParse)


{
match(toParse, 'a');
return "(W a " + parseTail2(toParse) + ")";
}

private static String parseTail2(StringBuffer toParse)


{
switch (toParse.charAt(0))
{
case 'a':
String wPart = parseW(toParse);
match(toParse, 'b');
match(toParse, 'b');
return "(tail2 " + wPart + " b b)";

Page 8 of 23
case 'b':
match(toParse, 'b');
match(toParse, 'b');
return "(tail2 b b)";

default:
System.err.println("parse error");
return "error";
}
}

private static void match(StringBuffer s, char c)


{
if (s.length() > 0 && s.charAt(0) == c)
{
s.deleteCharAt(0);
}
else
{
System.err.println("parse error " + s + " " + c);
}
}

public static void main(String[] args)


{
StringBuffer buf = new StringBuffer(args[0] + ".");

String parseTree = parseS(buf);


match(buf, '.');

System.out.println(parseTree);
}
}

Page 9 of 23
Question 2 (c)

What are the advantages in code generation from a linear intermediate


representation rather than directly from the syntax tree?
Mostly you can't do interesting optimizations at the AST level, because you need
information how how data flows from one part of the program to another. While data
flow is implicit in the meaning of the AST, it isn't easily determined by inspecting just
the AST, which is why people building compilers and optimizers build other program
representations (including symbol tables, control flow graphs, reaching definitions,
data flow and SSA forms, etc.).

Having a parser for a language is the easy part of analyzing/manipulating that


language. You need all that other stuff to do a good job.

One advantage of applying optimizations in the AST is that it may reduce the
execution time of some back-end optimization pass. However, I believe that these
optimizations need to be done with parsimony because you may be hindering
further optimizations in the code.

Question 2 (b)

Find a deterministic Finite Automaton that recognizes each of the following


sets.

We want to accept only the string 0. Let s 1 be the only final state, where we reach s 1
on input 0 from the start state so. Make a "graveyard" state s2 , and have all other
transitions (there are five of them in all) lead there.

This uses the same idea as in part (a), but we need a few more states. The graveyard
state is s4 . See the picture for details.

Page 10 of 23
In the picture of our machine, we show a transition to the graveyard state whenever
we encounter a 0. The only final state is s 2 , which we reach after 11 and remain at as
long as the input consists just of 1's.

Question 2 (C)

Advantage in code generation from linear IR rather than syntax tree

In one word: optimization.

Each representation you use is good for a particular thing. An AST is not very easy to
use for some types of optimizations (Data-Flow Analysis) while others, such as SSA,
are very good for that kind of task.

“Very good” here means two things:

a) It is easy to get the information required


b) It is possible to get the information required

Some optimization passes can be applied directly over AST. Some can be applied
over machine code. Many optimization passes cannot be applied over AST nor over
machine code.

Page 11 of 23
Question 3 (a)

Describe the techniques of peephole optimization that can be performed


effectively and suggest a strategy for its implementation.
Code generators in most compilers operate locally producing at the best locally
optimal object code fragments which may not be efficient when juxtaposed.

The situation can be improved by peephole optimization techniques.

These techniques are:

Elimination of redundant instructions


Code generators often produce redundant instructions which can be deleted without
any side effect. There may be load and store instructions that act on the same data
unit even before its value has been modified. Such redundant load and store
instructions may be deleted. An unlabeled instruction sequence following an
unconditional jump instruction is unreachable and may be deleted too. A test or a
compare instruction not followed by a conditional branch instruction may also be
deleted. An inconsequential instruction sequence, such as addition of zero or two
consecutive negations, may be deleted too.

Elimination of redundant instructions summary

a) Removal of redundant load and store instructions


b) Removal of unreachable instructions
c) Removal of useless test and compare instructions
d) Removal of inconsequential instruction

Improvements in flow of control


An unconditional or a conditional jump instruction produced by a code generator
may have another jump instruction as its target. The flow of control of the program
can be improved by suitably changing the target address of such a jump instruction,
and in some cases the second jump instruction may be even deleted. Moreover, if a
conditional jump instruction is preceded by a compare or a negation instruction,
then the two instructions may be replaced by a new conditional jump instruction
that pre serves the logic.

Page 12 of 23
Improvements in flow of control summary

1. Elimination and coalescing of jump instructions


2. Modification of comparisons

Algebraic simplifications
If a program uses an expression consisting of two or more constants and no
variables, then that expression may be evaluated at the compile time only. Some
arithmetic instructions can be replaced by much simpler instructions on the
occurrence of some specific operands. For example, x2 can be implemented as x * x
and if a square root instruction is available, then x0.5 may be implemented as sqrt(x).
Similarly, multiplication by a power of 2 may be implemented as left-shift, division by
a power of 2 may be implemented as right-shift, addition of 1 may be implemented
as increment, subtraction of 1 may be implemented as decrement, and
multiplication or division by -1 may be implemented as negation. Logic expressions
may be simplified at the compile time using the concepts of Boolean algebra like the
De Morgan's laws. Additionally, arithmetic instructions may be recorded taking
advantage of the commutative and associative properties of the operations to
facilitate further algebraic simplification.

Algebraic simplifications summary

1. Evaluation of constant expressions (constant folding)


2. Modifying instructions to simpler forms (operator strength reduction)
3. Simplification of logic expressions
4. Reordering arithmetic instructions

Use of machine idioms


To obtain an efficient object program, the machine-specific features of the target
machine need to be exploited at some stage.

A typical target machine supports several addressing modes. Choosing an


appropriate addressing mode can help in optimizing data transfer instructions, and
arithmetic and logical instructions? Target machines often provide some kind of
special instructions. For Example, a target machine may provide an instruction to
duplicate the data unit at the top of the stack. Such special instructions should be
used to obtain an efficient object program.

Page 13 of 23
Use of machine idioms summary

1. Addressing optimizations
2. Using special instructions

Question 3 (b)

Produce a state transition diagram (finite state machine

Produce a state transition diagram (finite state machine) for a recognizer for a
comment in a high-level language where a comment starts with /* and ends with */.
Suggest how this recognizer could be implemented in software.

Simple state transition diagram for comment

Elaborate state transition diagram for comment

Page 14 of 23
Implementing the recognizer in java

import java.io.*;
class anstring
{
public static void main(String args[])throws IOException
{
int q[][]={{1,0},{1,2},{1,3},{0,1}};
String st;
BufferedReader obj= new BufferedReader (new InputStreamReader (System.in));
System.out.println("Enter string:");
st=obj.readLine();
int s=0;
for(int i=0; i<st.length();i++)
{
if (st.charAt(i)==”/”)
{
s=q[s][0];
}
else if (st.charAt(i)==”*”)
{
s=q[s][1];
}
}
if (s==2)
{
System.out.println("This string is a comment");
}
else
{
System.out.println("This string is not a comment");
}
}
}

Page 15 of 23
Question 3 (d)

Construct a grammar for arithmetic expression

(i) Assuming left-to-right associativity and precedence of "*" over "+", the
grammar is shown below.

Question 4 (a)

Design a lexer class in Java

Design a lexer class in Java that scans statements from a source program, breaks
them into tokens, and stores them into symbol table. Lexical error message should
be displayed if an invalid token is encountered during scanning.

Lexer class

import java.lang.*;
import java.io.*;

// Released under the GNU General Public License Version 2, June 1991.

public class Lexical // Lexical processor of symbols


{ InputStream inp;

int sy = -1; // lexical state variables


char ch = ' '; byte[] buffer = new byte[1]; boolean eof = false;
String theWord = "<?>"; int theInt = 666;

Page 16 of 23
public static final int // symbol codes...
word = 0,
numeral = 1,
open = 2, // (
close = 3, // )
plus = 4, // +
minus = 5, // -
times = 6, // *
over = 7, // /
eofSy = 8;

public static final String[] Symbol = new String[]


{ "<word>", "<numeral>",
"(", ")",
"+", "-", "*", "/",
"<eof>"
};//Symbol

public Lexical(InputStream inp) // constructor


{ this.inp = inp; insymbol(); }//constructor

public int sy() { return sy; } // Define


public boolean eoi() { return sy == eofSy; } // what a
public String theWord() { return theWord; } // Lexical
public int theInt() { return theInt; } // object is

public void insymbol() // insymbol


// get the next symbol from the input stream
{ if(sy == eofSy) return;
while( ch == ' ' ) getch(); // skip white space

if( eof ) sy = eofSy;

else if( Character.isLetter(ch) ) // words


{ StringBuffer w = new StringBuffer();
while( Character.isLetterOrDigit(ch) )
{ w.append(ch); getch(); }
theWord = w.toString(); sy = word;
}

else if( Character.isDigit(ch) ) // numbers


{ theInt = 0;
while( Character.isDigit(ch) )
{ theInt = theInt*10 + ((int)ch) - ((int)'0'); getch(); }

Page 17 of 23
sy = numeral;
}

else // special symbols


{ int ch2 = ch; getch();
switch( ch2 )
{ case '+': sy = plus; break;
case '-': sy = minus; break;
case '*': sy = times; break;
case '/': sy = over; break;
case '(': sy = open; break;
case ')': sy = close; break;
default: error("bad symbol");
}
}
}//insymbol

void getch() // getch


// NB. changes variable ch as a side-effect.
{ ch = '.';
if( sy == eofSy ) return;
try { int n = 0;
if( inp.available() > 0 ) n = inp.read(buffer);
if( n <= 0 ) eof = true; else ch = (char)buffer[0];
}
catch(Exception e){ }
if(ch == '\n' || ch == '\t') ch = ' ';
}//getch

void skipRest()
{ if( ! eof ) System.out.print("skipping to end of input...");
int n = 0;
while( ! eof )
{ if( n%80 == 0 ) System.out.println(); // break line
System.out.print(ch); n++; getch();
}
System.out.println();
}//skipRest

public void error(String msg) // error


{ System.out.println("\nError: " + msg +
" sy=" + sy + " ch=" + ch +
" theWord=" + theWord + " theInt=" + theInt);
skipRest();
System.exit(1);

Page 18 of 23
}//error

// the following main() allows Lexical to be tested in isolation


public static void main(String[] argv)
{ System.out.println("--- Testing Lexical, L.Allison, CSSE, "
+ "Monash Uni, .au ---");
for(int i=0; i < argv.length; i++) // command line params if any
System.out.print("argv[" + i + "]=" + argv[i] + "\n");
Lexical lex = new Lexical(System.in);
while( ! lex.eoi() )
{ int sy = lex.sy();
System.out.print(sy + ": ");
if(sy == Lexical.word) System.out.print(lex.theWord());
if(sy == Lexical.numeral) System.out.print(lex.theInt());
System.out.println(",");
lex.insymbol();
}
System.out.println("--- end ---");
}//main

}//Lexical class

Question 4(b)

Compare and contrast: one-pass compiler, two-pass compiler, and multi-pass


compiler.

If you look at compiler internally, they operate in phases. Typical phases might be:

* Parse and syntax check


* Build symbol tables
* Perform semantic sanity check
* Determine control flow
* Determine data flow
* Generate some "intermediate" language (representing abstract instructions)
* Optimize the intermediate language
* Generate machine code from the optimized language

What a specific compiler does for phases varies from compiler to compiler. Each of
these steps pushes the program representations closer to the final machine code.

Page 19 of 23
An N-pass compiler (single-pass, two-pass, or multi-pass) would bundle one or
more of these steps into a single pass.

A single-pass bundles all the phases into one- it emits assembly (or binary code)
directly during parsing, without building an intermediate representation (IR) of code,
such as an Abstract Syntax Tree (AST).

For example, a hypothetical compiler based on parser generator (like GNU Bison),
can emit assembly directly inside semantic actions of grammar productions, as
shown below.

Now, consider the code below.

This code cannot be compiled with a single-pass compiler, as the compiler has no
knowledge of Edge when it is first encountered. This applies to absolutely any
mutually-recursive definitions, or even just plain forward references that you may
encounter.

Page 20 of 23
It goes further. A single-pass compiler is, quite understandably, missing a lot of the
context for each token in the program, which severely limits the range of possible
optimisations that can be done. Let’s take, for example:

Single-pass compiler has no possibility to determine that storing 42 in memory is


useless, and will generate the machine code verbatim, while even the dumbest
multi-pass compiler will convert this code into:

A multi-pass compiler might go even further and actually replace every single
invocation of foo() in the final code with constant 42, something that a single-pass
compiler is, again, completely unable to do, because it is missing a lot of the context.
In summary, single-pass compilers are, quite obviously, much simpler than multi-
pass compilers, but they are also significantly less powerful, and cannot handle
mutual recursion.
Question 4 (c)

What are the pros and cons of introducing Intermediate Representation (IR)
in a compiler?
Pros Cons

1. If a compiler translates the source Different IR affects speed and make the
language to its target machine compiler complex in architecture.
language without having the
option for generating intermediate
code, then for each new machine,
a full native compiler is required.
2. Intermediate code eliminates the
need of a new full compiler for
every unique machine by keeping
the analysis portion same for all

Page 21 of 23
the compilers.
3. It becomes easier to apply the
source code modifications to
improve code performance by
applying code optimization
techniques on the intermediate
code.
4. IR helps to separate the language
dependent front end from the
hardware dependent back end.
That way you don’t have to write a
second back end if you write
another front end for a different
language. Or another front end if
you port the same language to a
different architecture.
5. The intermediate code can be
used as the input to an interpreter,
which can then double as a
debugger; assuming that you will
be writing the compiler in its own
language after version

Question 4
What is an attribute grammar?
Attribute grammar is a medium to provide semantics to the context-free grammar
and it can help specify the syntax and semantics of a programming language.
Attribute grammar (when viewed as a parse-tree) can pass values or information
among the nodes of a tree.

How can an attribute grammar be used to support semantic analysis?


By defining attributes that represent inferences about each point in the tree
representing the program, and defining pre-grammar rule/per tree node type
computation combining already computed attributes to infer new values based on
already computed attributes and the type of tree node.

Page 22 of 23
The value in doing on a per grammar rule basis is that it gives a very good way to
approach of the problem of doing global analysis using a lot of individually local
computations which are easy to understand.

As an example, one might define an attribute called “Type” with the intention of
computing Type for every expression tree node. Then an attribute rule for a “+ node
might combine the Type values for the children of the “+“ node to compute the type
of the result of adding the operands.

Page 23 of 23

You might also like