Chap 11
Chap 11
Chap 11
The development of lexical analysis and parsing tools has been an important area of
research in computer science. This work has produced the lexer and parser generators
lex and yacc whose worthy scions camllex and camlyacc are presented in this chapter.
These two tools are the de-facto standard for implementing lexers and parsers, but there
are other tools, like streams or the regular expression library str, that may be adequate
for applications which do not need a powerful analysis.
The need for such tools is especially acute in the area of state-of-the-art programming
languages, but other applications can profit from such tools: for example, database
systems offering the possibility of issuing queries, or spreadsheets defining the contents
of cells as the result of the evaluation of a formula. More modestly, it is common to
use plain text files to store data; for example system configuration files or spreadsheet
data. Even in such limited cases, processing the data involves some amount of lexical
analysis and parsing.
In all of these examples the problem that lexical analysis and parsing must solve is
that of transforming a linear character stream into a data item with a richer structure:
a string of words, a record structure, the abstract syntax tree for a program, etc.
All languages have a set of vocabulary items (lexicon) and a grammar describing how
such items may be combined to form larger items (syntax). For a computer or program
to be able to correctly process a language, it must obey precise lexical and syntac-
tic rules. A computer does not have the detailed semantic understanding required to
resolve ambiguities in natural language. To work around the limitation, computer lan-
guages typically obey clearly stated rules without exceptions. The lexical and syntactic
structure of such languages has received formal definitions that we briefly introduce in
this chapter before introducing their uses.
288 Chapter 11 : Tools for lexical analysis and parsing
Chapter Structure
This chapter introduces the tools of the Objective Caml distribution for lexical analysis
and parsing. The latter normally supposes that the former has already taken place. In
the first section, we introduce a simple tool for lexical analysis provided by module
Genlex. Next we give details about the definition of sets of lexical units by introducing
the formalism of regular expressions. We illustrate their behavior within module Str
and the ocamllex tool. In section two we define grammars and give details about
sentence production rules for a language to introduce two types of parsing: bottom-up
and top-down. They are further illustrated by using Stream and the ocamlyacc tool.
These examples use context-free grammars. We then show how to carry out contextual
analysis with Streams. In the third section we go back to the example of a BASIC
interpreter from page 159, using ocamllex and ocamlyacc to implement the lexical
analysis and parsing functions.
Lexicon
Lexical analysis is the first step in character string processing: it segments character
strings into a sequence of words also known as lexical units or lexemes.
Module Genlex
This module provides a simple primitive allowing the analysis of a string of characters
using several categories of predefined lexical units. These categories are distinguished
by type:
# type token =
Kwd of string
| Ident of string
| Int of int
| Float of float
| String of string
| Char of char ; ;
of letters and digits including underscore ( ) or apostrophe (’). Such a string should
not start with a digit. We also consider as identifiers (for this module at least) strings
containing operator symbols, such as +, *, > or =. Finally, constructor Kwd defines the
category of keywords containing distinguished identifiers or special characters (specified
by the programmer when invoking the lexer).
The only variant of the token type controlled by parameters is that of keywords. The
following primitive allows us to create a lexical analyser (lexer) taking as keywords the
list passed as first argument to it.
# Genlex.make lexer ; ;
- : string list -> char Stream.t -> Genlex.token Stream.t = <fun>
The result of applying make lexer to a list of keywords is a function taking as input
a stream of characters and returning a stream of lexical units (of type token.)
Thus we can easily obtain a lexer for our BASIC interpreter. We declare the set of
keywords:
# let keywords =
[ "REM"; "GOTO"; "LET"; "PRINT"; "INPUT"; "IF"; "THEN";
"-"; "!"; "+"; "-"; "*"; "/"; "%";
"="; "<"; ">"; "<="; ">="; "<>";
"&"; "|" ] ; ;
Function line lexer takes as input a string of characters and returns the correspond-
ing stream of lexemes.
Use of Streams
We can carry out the lexical analysis “by hand” by directly manipulating streams.
The following example is a lexer for arithmetical expressions. Function lexer takes
a character stream and returns a stream of lexical units of type lexeme Stream.t1 .
Spaces, tabs and newline characters are removed. To simplify, we do not consider
variables or negative integers.
Function lexint carries out the lexical analysis for the portion of a stream describing
an integer constant. It is called by function lexer when lexer finds a digit on the
input stream. Function lexint then consumes all consecutive digits to obtain the
corresponding integer value.
Regular Expressions
2
Let’s abstract a bit and consider the problem of lexical units from a more theoretical
point of view.
2. Note of translators: From an academic standpoint, the proper term would have been “Rational
Expressions”; we chose the term “regular” to follow the programmers’ tradition.
Lexicon 291
From this point of view, a lexical unit is a word. A word is formed by concatening
items in an alphabet. For our purposes, the alphabet we are considering is a subset
of the ASCII characters. Theoretically, a word may contain no characters (the empty
word3 ) or just a single character. The theoretical study of the assembly of lexical items
(lexemes) from members of an alphabet has brought about a simple formalism known
as regular expressions.
Definition A regular expression defines a set of words. For example, a regular ex-
pression could specify the set of words that are valid identifiers. Regular expressions
are specified by a few set-theoretic operations. Let M and N be two sets of words.
Then we can specify:
1. the union of M and N , denoted by M | N .
2. the complement of M , denoted by ^M . This is the set of all words not in M .
3. the concatenation of M and N . This is the set of all the words formed by placing
a word from M before a word from N . We denote this set simply by M N .
4. the set of words formed by a finite sequence of words in M , denoted M +.
5. for syntactic convenience, we write M ? to denote the set of words in M , with
addition of the empty word.
Individual characters denote the singleton set of words containing just that character.
Expression a | b | c thus describes the set containing three words: a, b ant c. We
will use the more compact syntax [abc] to define such a set. As our alphabet is ordered
(by the ASCII code order) we can also define intervals. For example, the set of digits
can be written: [0-9]. We can use parentheses to group expressions.
If we want to use one of the operator characters as a character in a regular expression,
it should be preceded by the escape character \. For example, (\*)* denotes the set
of sequences of stars.
Recognition While regular expressions are a useful formalism in their own right, we
usually wish to implement a program that determines whether a string of characters (or
one of its substrings) is a member of the set of words described by a regular expression.
For that we need to translate the formal definition of the set into a recognition and
expression processing program. In the case of regular expressions such a translation can
be automated. Such translation techniques are carried out by module Genlex in library
Str (described in the next section) and by the ocamllex tools that we introduce in
the following two sections.
Usually the lexical description files are given the extension .mll. Later, to obtain a
Objective Caml source from a lex file.mll you type the command
ocamllex lex_file.mll
A file lex file.ml is generated containing the code for the corresponding analyzer.
This file can then be compiled with other modules of an Objective Caml application.
For each set of lexical analysis rules there is a corresponding function taking as input a
lexical buffer (of type Lexing.lexbuf) and returning the value defined by the semantic
actions. Consequently, all actions in the same rule must produce values of the same
type.
The general format for an ocamllex file is
{
294 Chapter 11 : Tools for lexical analysis and parsing
header
}
trailer-and-end
}
Both section “header” and “trailer-and-end” are optional. They contain Objective
Caml code defining types, functions, etc. needed for processing. The code in the last
section can use the lexical analysis functions that will be generated by the middle sec-
tion. The declaration list preceding the rule definition allows the user to give names to
some regular expressions. They can later be invoked by name in the definition of rules.
Example Let’s revisit our BASIC example. We will want to refine the type of lexical
units returned. We will once again define function lexer (as we did on page 163) with
the same type of output (lexeme), but taking as input a buffer of type Lexing.lexbuf.
{
let string chars s =
String.sub s 1 ((String.length s)-2) ; ;
}
The translation of this file by ocamllex returns function lexer of type Lexing.lexbuf
-> lexeme. We will see later how to use such a function in conjunction with syntactic
analysis (see page 305).
Syntax
Thanks to lexical analysis, we can split up input streams into more structured units:
lexical units. We still need to know how to assemble these units so that they amount
to syntactically correct sentences in a given language. The syntactic assembly rules
are defined by grammar rules. This formalism was originally developed in the field of
linguistics, and has proven immensely useful to language-theoretical mathematicians
and computer scientists in that field. We have already seen on page 160 an instance of
a grammar for the Basic language. We will resume this example to introduce the basic
concepts for grammars.
Grammar
Formally, a grammar is made up of four elements:
1. a set of symbols called terminals. Such symbols represent the lexical units of
the language. In Basic, the lexical units (terminals) are: the operator- and arith-
metical and logical relation-symbols (+, &, <, <=, ..), the keywords of the language
(GOTO, PRINT, IF, THEN, ..), integers (integer units) and variables (variable units).
2. A set of symbols called non-terminals. Such symbols stand for syntactic terms of
the language. For example, a Basic program is composed of lines (and thus we
have the term Line), a line may contain and Expression, etc.
3. A set of so-called production rules. These describe how terminal and non-terminals
symbols may be combined to produce a syntactic term. A Basic line is made up
of a number followed by an instruction. This is expressed in the following rule:
Line ::= integer Instruction
For any given term, there may be several alternative ways to form that term. We
separate the alternatives with the symbol — as in
296 Chapter 11 : Tools for lexical analysis and parsing
where (R1) (R2) (R3) and (R4) are the names given to our rules. After lexical analysis,
the expression 1*(2+3) becomes the sequence of lexemes:
To analyze this sentence and recognize that it really belongs to the language of arith-
metic expressions, we are going to use the rules from right to left: if a subexpression
matches the right-side member of a rule, we replace it with the corresponding left-side
member and we re-run the process until reducing the expression to the non-terminal
start (here Exp). Here are the stages of such an analysis4 :
(R1)
integer * ( integer + integer ) ←− Exp * ( integer + integer )
(R1)
←− Exp * ( Exp + integer )
(R1)
←− Exp * ( Exp + Exp )
(R2)
←− Exp * ( Exp )
(R4)
←− Exp * Exp
(R3)
←− Exp
Starting from the last line containing only Exp and following the arrows upwards we
read how our expression could be produced from the start rule Exp: therefore it is a
well-formed sentence of the language defined by the grammar.
4. We underline the portion of input processed at each stage and we point out the rule used.
Syntax 297
Top-down Parsing
The analysis of the expresion 1*(2+3) introduced in the previous paragraph is not
unique: it could also have started by reducing integers from right to left, which would
have permitted rule (R2) to reduce 2+3 from the beginning instead. These two ways to
proceed constitute two types of analysis: top-down parsing (right-to-left) and bottom-
up parsing (left-to-right). The latter is easily realizable with lexeme streams using
module Stream. Bottom-up parsing is that carried-out by the ocamlyacc tool. It uses
an explicit stack mechanism like the one already described for the parsing of Basic
programs. The choice of parsing type is significant, as top-down analysis may or may
not be possible given the form of the grammar used to specify the language.
A Simple Case
The canonical example for top-down parsing is the prefix notation of arithmetic ex-
pressions defined by:
In this case, knowing the first lexeme is enough to decide which production rule can
be used. This immediate predictability obviates managing the parse stack explicitly
by instead using the stack of recursive calls in the parser. Therefore, it is very easy to
write a program implementing top-down analysis using the features in modules Genlex
and Stream. Function infix of is an example; it takes a prefix expression and returns
its equivalent infix expression.
# let lexer s =
let ll = Genlex.make lexer ["+";"*"]
in ll (Stream.of string s) ; ;
val lexer : string -> Genlex.token Stream.t = <fun>
# let rec stream parse s =
298 Chapter 11 : Tools for lexical analysis and parsing
The usual grammar for arithmetical expressions on page 296 is not directly suitable for
top-down analysis: it does not satisfy any of the above-stated criteria. To be able to use
top-down parsing, we must reformulate the grammar so as to suppress left-recursion
and non-determinism in the rules. For arithmetic expressions, we may use, for instance:
Note that the use of the empty word ² in the definition of NextExpr is compulsory
if we want a single integer to be an expression.
Syntax 299
Our grammar allows the implementation of the following parser which is a simple
translation of the production rules. This parser produces the abstract syntax tree of
arithmetic expressions.
# let rec rest = parser
[< ’Lsymbol "+"; e2 = atom >] → Some (PLUS,e2)
| [< ’Lsymbol "-"; e2 = atom >] → Some (MINUS,e2)
| [< ’Lsymbol "*"; e2 = atom >] → Some (MULT,e2)
| [< ’Lsymbol "/"; e2 = atom >] → Some (DIV,e2)
| [< >] → None
and atom = parser
[< ’Lint i >] → ExpInt i
| [< ’Lsymbol "("; e = expr ; ’Lsymbol ")" >] → e
and expr s =
match s with parser
[< e1 = atom >] →
match rest s with
None → e1
| Some (op,e2) → ExpBin(e1,op,e2) ; ;
val rest : lexeme Stream.t -> (bin_op * expression) option = <fun>
val atom : lexeme Stream.t -> expression = <fun>
val expr : lexeme Stream.t -> expression = <fun>
The problem with using top-down parsing is that it forces us to use a grammar which is
very restricted in its form. Moreover, when the object language is naturally described
with a left-recursive grammar (as in the case of infix expressions) it is not always trivial
to find an equivalent grammar (i.e. one defining the same language) that satisfies the
requirements of top-down parsing. This is the reason why tools such as yacc and
ocamlyacc use a bottom-up parsing mechanism which allows the definition of more
natural-looking grammars. We will see, however, that not everything is possible with
them, either.
Bottom-up Parsing
On page 165, we introduced intuitively the actions of bottom-up parsing: shift and
reduce. With each of these actions the state of the stack is modified. We can deduce
from this sequence of actions the grammar rules, provided the grammar allows it, as
in the case of top-down parsing. Here, also, the difficulty lies in the non-determinism
of the rules which prevents choosing between shifting and reducing. We are going to
illustrate the inner workings of bottom-up parsing and its failures by considering those
pervasive arithmetic expressions in postfix and prefix notation.
The Good News The simplified grammar for postfix arithmetic expressions is:
300 Chapter 11 : Tools for lexical analysis and parsing
This grammar is dual to that of prefix expressions: it is necessary to wait until the
end of each analysis to know which rule has been used, but then one knows exactly
what to do. In fact, the bottom-up analysis of such expressions resembles quite closely
a stack-based evaluation mechanism. Instead of pushing the results of each calculation,
we simply push the grammar symbols. The idea is to start with an empty stack, then
obtain a stack which contains only the start symbol once the input is used up. The
modifications to the stack are the following: when we shift, we push the present non-
terminal; if we may reduce, it is because the first elements in the stack match the
right-hand member of a rule (in reverse order), in which case we replace these elements
by the corresponding left-hand non-terminal.
Figure 11.2 illustrates how bottom-up parsing processes expression: 1 2 + 3 * 4 +.
The input lexical unit is underlined. The end of input is noted with a $ sign.
The Bad News The difficulty of migrating the grammar into the recognition pro-
gram is determining which type of action to apply. We will illustrate this difficulty with
three examples which generate three types of indeterminacy.
The first example is a grammar for expressions using only addition:
The indeterminacy in this grammar stems from rule (R2). Let’s suppose the following
situation:
In such a case, it is impossible to determine whether we have to shift and push the +
or to reduce using (R2) both E0’s and the + in the stack. We are in the presence of
a shift-reduce conflict. This is because expression integer + integer + integer can be
produced in two ways by right-derivation.
(R2)
First way: E0 −→ E0 + E0
(R1)
−→ E0 + integer
(R2)
−→ E0 + E0 + integer
etc.
(R2)
Second way: E0 −→ E0 + E0
(R2)
−→ E0 + E0 + E0
(R1)
−→ E0 + E0 + integer
etc.
The expressions obtained by each derivation may look similar from the point of view
of expression evaluation:
(integer + integer) + integer and integer + (integer + integer)
but different for building a syntax tree (see figure 6.3 on page 166).
The second instance of a grammar generating a conflict between shifting and reducing
has the same type of ambiguity: an implicit parenthesizing. But contrary to the previous
case, the choice between shifting and reducing modifies the meaning of the parsed
expression. Let’s consider the following grammar:
302 Chapter 11 : Tools for lexical analysis and parsing
We find in this grammar the above-mentioned conflict both for + and for *. But there
is an added conflict between + and *. Here again, an expression may be produced in
two ways. There are two right-hand derivations of
integer + integer * integer
(R3)
First way: E1 −→ E1 * E1
(R1)
−→ E1 * integer
(R2)
−→ E1 + E1 * integer
etc.
(R2)
Second way: E1 −→ E1 + E1
(R3)
−→ E1 + E1 * E1
(R1)
−→ E1 + E1 * integer
etc.
There is now but a single way to reach the production sequence integer + integer *
integer: using rule (R1).
The third example concerns conditional instructions in programming languages. A lan-
guage such as Pascal offers two conditionals : if .. then and if .. then .. else.
Let’s imagine the following grammar:
We cannot decide whether the first elements in the stack relate to conditional (R1), in
which case it must be reduced, or to the first Instr in rule (R2), in which case it must
be shifted.
Besides shift-reduce conflicts, bottom-up parsing may also generate reduce-reduce con-
flicts.
We now introduce the ocamlyacc tool which uses the bottom-up parsing technique
and may find these conflicts.
General format The syntax description files for ocamlyacc use extension .mly by
convention and they have the following structure:
%{
header
}%
declarations
%%
rules
%%
trailer-and-end
The rule format is:
non-terminal : symbol. . . symbol { semantic action }
| ...
| symbol. . . symbol { semantic action }
;
A symbol is either a terminal or a non-terminal. Sections “header” and “trailer-and-
end” play the same role as in ocamllex with the only exception that the header is only
304 Chapter 11 : Tools for lexical analysis and parsing
visible by the rules and not by declarations. In particular, this implies that module
openings (open) are not taken into consideration in the declaration part and the types
must therefore be fully qualified.
Semantic actions Semantic actions are pieces of Objective Caml code executed
when the parser reduces the rule they are associated with. The body of a semantic
action may reference the components of the right-hand term of the rule. These are
numbered from left to right starting with 1. The first component is referenced by $1,
the second by $2, etc.
Start Symbols We may declare several start symbols in the grammar, by writing
in the declaration section:
For each of them a parsing function will be generated. We must precisely note, always
in the declaration section, the output type of these functions.
Lexical units Grammar rules make reference to lexical units, the terminals or ter-
minal symbols in the rules.
Certain lexical units, like identifiers, represent a set of (character) strings. When we
find an identifier we may be interested in recovering its character string. We specify
in the parser that these lexemes have an associated value by enclosing the type of this
value between < and >:
in which case it is pointless to declare a symbol which represents them: they are directly
processed by the parser without passing through the lexer. In the interest of uniformity,
we do not advise this procedure.
%left PLUS
%left MULT
Two operators declared on the same line have the same precedence.
• -b name: the generated Objective Caml files are name.ml and name.mli;
• -v: create a file with extension .output contaning rule numeration, the states in
the automaton recognizing the grammar and the sources of conflicts.
Joint usage with ocamllex We may compose both tools ocamllex and ocamlyacc
so that the transformation of a character stream into a lexeme stream is the input to
the parser. To do this, type lexeme should be known to both. This type is defined in
the files with extensions .mli and .ml generated by ocamlyacc from the declaration of
the tokens in the matching file with extension .mly. The .mll file imports this type;
ocamllex translates this file into an Objective Caml function of type Lexing.lexbuf
-> lexeme. The example on page 307 illustrates this interaction and describes the
different phases of compilation.
Contextual Grammars
Types generated by ocamlyacc process languages produced by so-called context-free
grammars. A parser for such a grammar does not depend on previously processed
syntactic values to process the next lexeme. This is not the case of the language L
described by the following formula:
306 Chapter 11 : Tools for lexical analysis and parsing
Function parse w1 builds the memorizing function for the first w under the guise of a
list of atomic stream parsers (i.e. for a single token):
The result of the function returned by parse w1 is simply the character string contain-
ing the parsed lexical unit.
Function parse w2 takes as argument a list built by parse w1 to compose each of its
elements into a single parsing function:
# let rec parse w2 l =
match l with
p :: pl → (parser [< x = p; l = (parse w2 pl) >] → x^l)
| [] → parser [<>] → "" ; ;
val parse_w2 : (’a Stream.t -> string) list -> ’a Stream.t -> string = <fun>
The result of applying parse w2 will be the string representing subword w. By con-
struction, function parse w2 will not be able to recognize anything but the subword
visited by parse w1.
Using the ability to name intermediate results in streams, we write the recognition
function for the words in the language L:
# let parse L = parser [< l = parse w1 ; ’C; r = (parse w2 l) >] → r ; ;
val parse_L : token Stream.t -> string = <fun>
Basic Revisited 307
Here are two small examples. The first results in the string surrounding C, the second
fails because the words surrounding C are different:
# parse L [< ’A; ’B; ’B; ’C; ’A; ’B; ’B >]; ;
- : string = "abb"
# parse L [< ’A; ’B; ’C; ’B; ’A >]; ;
Uncaught exception: Stream.Error("")
Basic Revisited
We now want to use ocamllex and ocamlyacc to replace function parse on page 169
for Basic by some functions generated from files specifying the lexicon and syntax of
the language.
To do this, we may not re-use as-is the type of lexical units that we have defined. We
will be forced to define a more precise type which permits us to distinguish between
operators, commands and keywords.
We will also need to isolate the type declarations describing abstract syntax within
a file basic types.mli. This will contain the declaration of type sentences and all
types needed by it.
%}
Precedence rules between operators once again take the values assigned by functions
priority uop and priority binop defined when first giving the grammar for our
Basic (see page 160).
%right Lneg
%left Land Lor
%left Lequal Lrel
%left Lmod
%left Lplus Lminus
%left Lmult Ldiv
%nonassoc Lop
Symbol Lop will be used to process unary minus. It is not a terminal in the grammar,
but a “pseudo non-terminal” which allows overloading of operators when two uses of
an operator should not receive the same precedence depending on context. This is the
case with the minus symbol (-). We will reconsider this point once we have specified
the rules in the grammar.
Since the start symbol is line, the function generated will return the syntax tree for
the parsed line.
Basic Revisited 309
%start line
%type <Basic_types.phrase> line
Grammar rules are decomposed into three non-terminals: line for a line; inst for
an instruction in the language; exp for expressions. The action associated with each
rule simply builds the corresponding abstract syntax tree.
%%
line :
Lint inst Leol { Line {num=$1; inst=$2} }
| Lcmd Leol { phrase of cmd $1 }
;
inst :
Lrem { Rem $1 }
| Lgoto Lint { Goto $2 }
| Lprint exp { Print $2 }
| Linput Lident { Input $2 }
| Lif exp Lthen Lint { If ($2, $4) }
| Llet Lident Lequal exp { Let ($2, $4) }
;
exp :
Lint { ExpInt $1 }
| Lident { ExpVar $1 }
| Lstring { ExpStr $1 }
| Lneg exp { ExpUnr (NOT, $2) }
| exp Lplus exp { ExpBin ($1, PLUS, $3) }
| exp Lminus exp { ExpBin ($1, MINUS, $3) }
| exp Lmult exp { ExpBin ($1, MULT, $3) }
| exp Ldiv exp { ExpBin ($1, DIV, $3) }
| exp Lmod exp { ExpBin ($1, MOD, $3) }
| exp Lequal exp { ExpBin ($1, EQUAL, $3) }
| exp Lrel exp { ExpBin ($1, (bin op of rel $2), $3) }
| exp Land exp { ExpBin ($1, AND, $3) }
| exp Lor exp { ExpBin ($1, OR, $3) }
| Lminus exp %prec Lop { ExpUnr(OPPOSITE, $2) }
| Lpar exp Rpar { $2 }
;
%%
These rules do not call for particular remarks except:
exp :
...
| Lminus exp %prec Lop { ExpUnr(OPPOSITE, $2) }
310 Chapter 11 : Tools for lexical analysis and parsing
It concerns the use of unary -. Keyword %prec that we find in it declares that this rule
should receive the precedence of Lop (here the highest precedence).
| ’\n’ { Leol }
| ’!’ { Lneg }
| ’&’ { Land }
| ’|’ { Lor }
| ’=’ { Lequal }
| ’%’ { Lmod }
| ’+’ { Lplus }
| ’-’ { Lminus }
| ’*’ { Lmult }
| ’/’ { Ldiv }
Note that we isolated symbol = which is used in both expressions and assignments.
Only two of these regular expressions need further remarks. The first concerns comment
lines ("REM" [^ ’\n’]*). This rule recognizes keyword REM followed by an arbitrary
number of characters other than ’\n’. The second remark concerns character strings
(’"’ [^ ’"’]* ’"’) considered as sequences of characters different from " and con-
tained between two ".
Compiling, Linking
The compilation of the lexer and parser must be carried out in a definite order. This
is due to the mutual dependency between the declaration of lexemes. To compile our
example, we must enter the following sequence of commands:
ocamlc -c basic_types.mli
ocamlyacc basic_parser.mly
ocamllex basic_lexer.mll
ocamlc -c basic_parser.mli
ocamlc -c basic_lexer.ml
ocamlc -c basic_parser.ml
Which will generate files basic lexer.cmo and basic parser.cmo which may be
linked into an application.
We now have at our disposal all the material needed to reimplement the application.
We suppress all types and all functions in paragraphs “lexical analysis” (on page 163)
and “parsing” ( on page 165) of our Basic application; in function one command (on
page 174), we replace expression
match parse (input line stdin) with
with
match line lexer (Lexing.from string ((input line stdin)^"\n")) with
We need to remark that we must put back at the end of the line the character ’\n’
which function input line had filtered out. This is necessary because the ’\n’ char-
acter indicates the end of a command line (Leol).
312 Chapter 11 : Tools for lexical analysis and parsing
Exercises
Evaluator
We will use ocamlyacc to implement an expression evaluator. The idea is to perform
the evaluation of expressions directly in the grammar rules.
We choose a (completely parenthesized) prefix arithmetic expression language with
variable arity operators. For example, expression (ADD e1 e2 .. en) is equivalent to
e1 + e2 + .. + en. Plus and times operators are right-associative and subtraction
and division are left-associative.
1. Define in file opn parser.mly the parsing and evaluation rules for an expression.
Summary 313
Summary
This chapter has introduced several Objective Caml tools for lexical analysis (lexing)
and syntax analysis (parsing). We explored (in order of occurrence):
• module Str to filter rational expressions;
• module Genlex to easily build simple lexers;
• the ocamllex tool, a typed integration of the lex tool;
• the ocamlyacc tool, a typed integration of the yacc tool;
• the use of streams to build top-down parsers, including contextual parsers.
Tools ocamllex and ocamlyacc were used to define a parser for the language Basic
more easily maintained than that introduced in page 159.
To Learn More
The reference book on lexical analysis and parsing is known affectionately as the
“dragon book”, a reference to the book’s cover illustration. Its real name is Compil-
ers: principles, techniques and tools ([ASU86]). It covers all aspects of compiler design
and implementation. It explains clearly the construction of automata matching a given
context-free grammar and the techniques to minimize it. The tools lex and yacc are
described in-depth in several books, a good reference being [LMB92]. The interesting
features of ocamllex and ocamlyac with respect to their original versions are the inte-
gration of the Objective Caml language and, above all, the ability to write typed lexers
and parsers. With regard to streams, the research report by Michel Mauny and Daniel
de Rauglaudre [MdR92] gives a good description of the operational semantics of this
extension. On the other hand, [CM98] shows how to build such an extension. For a
better integration of grammars within the Objective Caml language, or to modify the
grammars of the latter, we may also use the camlp4 tool found at:
Link: http://caml.inria.fr/camlp4/
314 Chapter 11 : Tools for lexical analysis and parsing