The OC compiler was written as a quarter long project for CMPS104A - Fundamentals of Compiler Design during Fall of 2015.

All source code [HERE](https://github.com/manmeet3/OC-compiler)
SYNOPSIS:
oc [-ly] [-@ flag ...] [-D string] program.oc

OPTIONS:
-@ flags Call set_debugflags, and use DEBUGF and DEBUGSTMT for debugging.
-D string Pass this option and its argument to cpp.
-D__OCLIB_OH__ to suppress inclusion of the code from
oclib.oh when testing a program. -l Debug yylex() with yy_flex_debug = 1
-y Debug yyparse() with yydebug = 1

    OUTPUT ----------------- FILENAME
    String set ---------------- prog_name.str
    Scanned tokens ------- prog_name.tok
    Abstract syntax tree - prog_name.ast
    Symbol table ----------- prog_name.sym
    Interm. language ----- prog_name.oil
The process was broken down into 5 stages:
I. Preprocessor and token generation
II. Lexical Analyzer (using flex)
III. LALR(1) Parser (using bison)
IV. Symbols and Type Checking
V. Intermediate Language Emission

I. Preprocessor and string set generation

The first part of the compiler consisted of writing a main program for the language oc. Included in it was a string set ADT which was used to create string sets based on the output from the C preprocessor. These sets are written to a file with .str suffix and the specified filename suffix.

The main program called a test harness for the string set ADT which works as follows: after filtering the input through the C preprocessor, a line is read using fgets(3) and tokenized using strtok_r(3) with the string "*\t\n". After that, the main program calls the string set ADT operation to dump the string set into its trace file. The purpose of the string set is to keep track of strings in a unique manner. If, for example, "abc" is entered multiple times, it appears only once in the table. And so, instead of using strcmp(3), one can just compare pointers.

II. Lexical Analyzer

The second part consisted of augmenting the string table from part I to a scanner writtein in flex. The string sets from the part I are identified as tokens, based on the regex identified in the flex "scanner.l" file.

The tokens in the oc language include:
Special symbols :

[] ( ) [ ] { } ; , . = == != < <= > >= + - * / % !

Reserved words :

void bool char int string struct if else while return false true null ord chr new

Identifiers: Any sequence of upper or lower-case ASCII letters, digits, and underscores, but may not begin with a digit.
Integer constants: consist of any sequence of decimal digits. (No octa, hex or floating point support.)

Char constants consist of a pair of single quote marks with a single character or escape between them.

String constants consist of a pair of double quote marks with zero or more characters or escapes between them

Output directives from C preprocessor must be scanned and use to error print locations in the source code.

These tokens are written into a .tok file.
#cat foo.tok
2 16.003 264 TOK_KW_RETURN (return)
2 16.010 61 '=' (=)
2 20.008 258 TOK_IDENT (hello)
2 20.010 271 TOK_LIT_INT (1234)
2 25.002 123 '{' ({})
2 26.008 272 TOK_LIT_STRING ("beep")

III. LALR Parser

Part three consisted of writing an LALR parser using bison.
HTML5 Icon
These are the grammatical rules that the parser follows. In the implementation, they are translated to bison acceptable form. The bison acceptable form includes explicitly enumerating all possible rules with operators in them. Bison operator precedence declarations reduce the number of necessary rules. HTML5 Icon

This table shows operator precedence and associativity. It contains a superset of information used in the %left and %right declarations in the bison file.

Construction of the abstract syntax tree is in a way such that all operators and operator-like tokens are the parents of their operands, which are adopted as their children. The children may be leaf nodes (identifiers or constants) or other expressions. Constants and identifiers are always leaf nodes. In general, interior nodes may have an arbitrarily large number of children.

More generally, bison reads tokens created by the scanner and pushes them onto a stack along with their semantic values. The stack is called the parser stack. Pushing a token is traditionally called shifting. And this shifting takes place based on the grammar rules. Lastly, shift reduce conflicts are some of the most common errors while parsing that can result in the parser getting stuck in a loop.

The following is a partial output of a .ast file that the abstract syntax tree is written to.

#cat foo.ast
ROOT "" 0.0.0
| FUNCTION "" 0.1.0
| | INT "int" 0.1.0
| | | DECLID "fac" 0.1.4
| | PARAM "(" 0.1.8
| | | INT "int" 0.1.9
| | | | DECLID "n" 0.1.13
| | BLOCK "{" 0.1.15
| | | VARDECL "=" 0.2.9
| | | | INT "int" 0.2.3
| | | | | DECLID "f" 0.2.7
.
.
.

IV. Symbols and Type Checking

The fourth part consisted of creating a symbol table and type checking. The symbol table maintains all of the identifiers in an oc program so they can be looked up when referenced. Additionally, it also maintains scope and range information that determines from where a symbol may be referenced. Symbols in oc are of three types:
  1. Type Names: Consist of reserved words [void, bool, char, int, string] and identifier names defined via struct definitions. These all have a global score and are entered into type name table.
  2. Field Names: These are identifiers that may be used immediately following the dot (.) operator for field selection. The same name may be used in different structs without conflict.
  3. Function and Variable Names: These identifiers are entered into the identifier symbol tables. All functions have global scope, whereas variables may have global or local scope, nested arbitrarily deep. Local scopes override global scopes and disjoint scopes may contain variables with the same names.
There are also, three categories of types in oc and each has groups of types within it. Briefly, they are
  1. void is neither primitive nor reference.
  2. Primitive types are represented by inline inside structs and local or global variables.
  3. Reference types are pointers to objects on the heap. They can be local, global, or fields. All object types reside on the heap. Pointer arithmetic is prohibited.
Type Checking involves a post-order depth-first traversal of the AST created in part III. The following names are used:
primitive is any primitive type
base is any type that can be used as a base type for an array
any is either primitive or reference

The rules for type checking are listed as follows:
a) Two types are compatible if they are the same type or if one type is any and the other is null.
b) When the right side of a production is empty, there are no type attributes. Only expressions have type attributes, not statements.
c) Assignment (=) results in inheriting the type of its left operand.
d) Fields following a selector have the field attribute but no type attribute.
e) Identifiers have the type attributes that they derive from the symbol table.
f) Field selection sets the selector (.) attribute as follows : The left operand must. be a struct type or an error message is generated.
g) For a CALL, evaluate the types of each of the arguments, and look up the function in the identifier table. Then verify that the arguments are compatible with the parameters and error if not.
h) The expression operand of both if and while must be of type bool. i) The indexing operator for an array returns the address of one of its elements, and for a string, the address of one of its characters.

The following is a partial output of a .sym file that the symbol table is written to.

#cat foo.sym
node (0.1.7) {0} struct "node"
 foo (0.1.18) field {node} int
 link (0.1.27) field {node} struct "node"
func (0.2.5) {0} struct "node" function
  head (0.2.15) {1} struct "node" variable lval param
 length (0.2.24) {1} int variable lval param
  a (0.3.7) {1} int variable lval
  b (0.3.14) {1} string variable lval
  c (0.3.24) {1} struct "node" variable lval
.
.
.

V. Intermediate Code Generation

An intermediate language is a very low level language used by a compiler to perform optimizations and other changes before emitting final assembly language for some particular machine. It generally matches common assembly language state- ment semantics, but in a typeful manner.

The intermediate langugae here looks very much like C, except that each oil statement should be capable of being translated into a single assembly language instruction. The basic grammer used in an oil program is the same metanotation as was used in part III.

HTML5 Icon

Program and Function Structure

  1. Structure definitions come first with the struct keyword and closing brace left-aligned, and the fields indented.
  2. Then, all string constants are listed in the order they appear in the program. Global declarations are all left-aligned.
  3. Next are all the global variable declarations.
  4. Finally, functions are emitted with one parameter per line. The function names and the two braces are left aligned, as are labels, but everything else is indented.

Statements

Statements are all indented, except for labels, and all declarations of local variables must be initialized.
  1. A label is emitted unindented on a line by itself. It consists of a keyword from the source language followed by the coordinates taken from the node
  2. A goto statement may be conditional or unconditional. If conditional, must be preceded by an if and an operand
  3. A return statement may or may not have an operand.
  4. An assignment statement takes an expression as an operand. A type is present if the identifier has not previously been declared.
  5. A function may only have operands in parentheses, but may have as many as appropriate.

Expressions

Expressions are simple and restricted to three-address code.
  1. Binary operators are the same as in the source program where 0 is false, anything else true.
  2. The unary operators are also the same
  3. Address computation for indexing and field selection is taken care of by the type checker.
  4. Bool data type is replaced by char

Name Mangling

HTML5 Icon

Code Emission

Code is emitted from the AST in part 3 with the benefit of the symbol table and type checker. Since the suffix oil is not recognized by gcc, in order to make it compile, the below shell script can be used, given the name of executable.

#!/bin/sh -x
# $Id: ocl,v 1.3 2012-11-16 20:57:44-08 - - $
for file in $*
do
gcc -g -o $file -x c $file.oil oclib.c
done

The following is a sample output of an oil program that the intermediate code is written to.

#cat foo.oil
int __n;
int __fac (
  int _12_n)
{
  int _12_f = 1;
while_5_2_3:;
  char b1 = _12_n > 1;
  if (!b1) goto break_5_2_3;
  int i1 = _12_f * _12_n;
  _12_f = i1;
  int i2 = _12_n - 1;
  _12_n = i2;
  goto while_5_2_3;
break_5_2_3:
  return _12_f
}
void __ocmain (void)
{
  __n = 1;
while_5_11_0:;
  char b2 = __n <= 5;
  if (!b2) goto break_5_11_0;
  int i3 = __fac (__n);
  __puti (i3);
  __endl();
  int i4 = __n + 1;
  __n = i4;
  goto while_5_11_0;
break_5_11_0:;
}
HTML5 Icon Supported operations include: file creation, directory traversal, file reads, writes and seeks, etc. Soft or hard links are not supported
comments powered by Disqus