Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Context-Free Grammar (CFG)

Dr. Nadeem Akhtar


Assistant Professor
Department of Computer Science & IT
The Islamia University of Bahawalpur

PhD – Computer Science


IRISA – University of South Brittany – Bretagne - FRANCE.
Introduction
There are four important components in a grammatical
description of a language:
1. There is a finite set of symbols that form the strings
of the language being defined. We call this alphabet
the terminals, or terminal symbols.
2. There is a finite set of variables, also called sometimes
non-terminals or syntactic categories. Each variable
represents a language; i.e., a set of strings.
3. One of the variables represents the language being
defined; it is called the start symbol. Other variables
represent auxiliary classes of strings that are used to
help define the language of the start symbol.
Introduction
4. There is a finite set of productions or rules that represent the
recursive definition of a language. Each production consists of:
(a) A variable that is being (partially) defined by the production. This
variable is often called the head of the production.
(b) The production symbol →
(c) A string of zero or more terminals and variables. This string, called
the body of the production, represents one way to form strings in
the language of the variable of the head. In so doing, we leave
terminals unchanged and substitute for each variable of the body
any string that is known to be in the language of that variable.

The four components just described form a context-free grammar, or


just grammar, or CFG. We shall represent a CFG by its four
components, that is, G = (V, T, P, S), where V is the set of variables,
T the terminals, P the set of productions, and S the start symbol.
Context-Free Grammar (CFG)
A CFG has four components
1) A set of terminal symbols, sometimes referred to as “tokens”.
Terminals are the elementary symbols of the language defined by
the grammar.
2) A set of non-terminals sometimes called “syntactic variables”. Each
non-terminal represents a set of strings of terminals
In stmt → if (expr) stmt else stmt
stmt and expr are non-terminals.
3) A set of productions
Each production consists of a non-terminal, called head or left side
of the production, an arrow, and a sequence of terminals and/or
non-terminals, called the body or right side of the production.
4) A designation of one of the non-terminals as the start symbol
(head)
Formal definition of Context-Free
Grammar
A context-free grammar is a 4-tuple (V, Σ, R, S),
where
1. V is a finite set called the variables,
2. Σ is a finite set, disjoint from V, called the
terminals,
3. R is a finite set of rules, with each rule being a
variable and a string of variables and terminals,
and
4. S Є V is the start variable.
Regular Languages
• Closed under Union, Concatenation and
Closure (∗)
• Recognizable by finite-state automata
• Denoted by Regular Expressions
• Generated by Regular Grammars
Context-Free Grammars
• More general productions than regular
grammars
S→w
where w is any string of terminals and non-
terminals
• What languages do these grammars generate?
S →(A) S → ε | aSb
A → ε | aA | ASA
Context-free languages more
general than regular languages
• {anbn | n ≥ 0} is not regular
‣ but it is context-free

• Why are they called “context-free”?


‣ Context-sensitive grammars allow more than
one symbol on the LHS of productions
– xAy → x(S)y can only be applied to the non-
terminal A when it is in the context of x and y
Context-free grammars are widely
used for programming languages

From the definition of Algol-60:


procedure_identifier::= identifier.
actual_parameter::= string_literal | expression | array_identifier | switch_identifier | procedure_identifier.
letter_string::= letter | letter_string letter.
parameter_delimiter::= "," | ")" letter_string ":" "(".
actual_parameter_list::= actual_parameter | actual_parameter_list parameter_delimiter actual_parameter.
actual_parameter_part::= empty | "(" actual_parameter_list} ")".
function_designator::= procedure_identifier actual_parameter_part.
Example
adding_operator::= "+" | "−" .
multiplying_operator::= "×" | "/" | "÷" .
primary::= unsigned_number | variable | function_designator | "(" arithmetic_expression ")".
factor::= primary | factor | factor power primary.
term::= factor | term multiplying_operator factor.
simple_arithmetic_expression::= term | adding_operator term |
simple_arithmetic_expression adding_operator term.
if_clause::= if Boolean_expression then.
arithmetic_expression::= simple_arithmetic_expression |
if_clause simple_arithmetic_expression else arithmetic_expression.

if a<0 then U+V else if a*b < 17 then U/V else if k <> y then V/U
else 0
Example derivation in a Grammar
• Grammar: start symbol is A
A → aAa
A→B
B → bB
B→ε

• Sample Derivation:
A ⇒ aAa ⇒ aaAaa ⇒ aaaAaaa ⇒ aaaBaaa ⇒ aaabBaaa
⇒ aaabbBaaa ⇒ aaabbaaa

• Language?
Derivations in Tree Form
Example CFG
• An example of a context-free grammar, which we call G1 .
A → 0A1
A→B
B→#

Collection of substitution rules, called productions. Each rule appears as a line in the
grammar, comprising a symbol and a string separated by an arrow. The symbol is
called a variable.

The string consists of variables and terminals.


The variable symbols often are represented by capital letters.
The terminals are analogous to the input alphabet and often are represented by
lowercase letters, numbers, or special symbols.
One variable is designated as the start variable. It usually occurs on the left-hand side
of the topmost rule.
Grammar G1 contains three rules. G1 's variables are A and B, where A is the start
variable. Its terminals are 0, 1, and #.
Example CFG
Grammar is used to describe a language by
generating each string of that language in the
following manner.
1. Write down the start variable. It is the variable
on the left-hand side of the top rule, unless
specified otherwise.
2. Find a variable that is written down and a rule
that starts with that variable. Replace the
written down variable with the right-hand side
of that rule.
3. Repeat step 2 until no variables remain.
Example CFG
Context-free grammar, G1 .
A → 0A1
A→B
B→#

• Grammar G 1 generates the string 000#111.


The sequence of substitutions to obtain a string is
called a derivation. A derivation of string 000#111 in
grammar G 1 is
A ⇒ 0A1 ⇒ 00A11 ⇒ 000A111 ⇒ 000B111 ⇒ 000#111
Parse tree for 000#111 in G1
Example CFG
All strings generated in this way constitute the
language of the grammar.
We write L(G1) for the language of grammar G1 .
Grammar G1 shows that L(G1) is {0n#1n | n > 0}.
Any language that can be generated by some
context-free grammar is called a context-free
language (CFL).
Context Free Grammar (CFG)
Example
Java if-else statement
If (expression) statement else statement
stmt → if (expr) stmt else stmt
The arrow may be read as “can have the form”.
Such a rule is called a production
Context Free Grammar (CFG)
Example
list → list + digit
list → list - digit
list → digit
digit → 0|1|2|3|4|5|6|7|8|9

list → list + digit | list – digit | digit

The terminals are + - 0 1 2 3 4 5 6 7 8 9


Context Free Grammar (CFG)
Example
Function call
call → id(optparams)
optparams → params | є
params → params, param | param
Context Free Grammar (CFG)
• Example
• Operators on the same line have the same
associativity and precedence
left-associative: + -
left-associative: * /
Two non-terminals expr and term for the two
levels of precedence, non-terminal factor for
generating basic units in expressions
Context Free Grammar (CFG)
factor → digit | (expr)
Binary operators * and / have the highest precedence
term → term * factor
| term / factor
| factor
Similarly, expr
expr → expr + term
| expr - term
| term
The resulting grammar is therefore
expr → expr + term | expr – term | term
term → term * factor | term / factor | factor
factor → digit | (expr)
Context Free Grammar (CFG)
Example:
A grammar for a subset of Java statement

stmt → id = expression;
| if(expression) stmt
| if(expression) stmt else stmt
| while(expression) stmt
| do stmt while (expression);
| {stmts}

stmts → stmts stmt


| є
Context Free Grammar (CFG)
Example:
Grammar for statement blocks and conditional
statements:

stmt → if expr then stmt else stmt


| if stmt then stmt
| begin stmtList end

stmtList → stmt; stmtList | stmt


Context Free Grammar (CFG)
• Exercise
Consider the context-free grammar
S → SS+ | SS* | a

(a) Show how the string aa+a* can be generated by


this grammar
(b) Construct a parse tree for this string
(c) What language does this grammar generate?
Justify your answer.
Parse Trees
A → XYZ
then a parse tree may have an interior node
labelled A with three children labeled X, Y, and
Z
Parse Trees
Formally, given a context-free grammar, a parse-tree according
to the grammar is a tree with the following properties:
(1) The root is labelled by the start symbol
(2) Each leaf is labelled by a terminal or by є
(3) Each interior node is labelled by a non-terminal
(4) If A is the non-terminal labelling some interior node and
X1,X2,…,Xn are the labels of the children of that node from
left to right, then there must be a production
A → X1X2…Xn
Here, X1X2…Xn each stand for a symbol that is either a terminal or
a non-terminal.
As a special case, if A → є is a production, then a node labelled A
may have a single child labelled є
CFG vs. Regex
• CFGs are a more powerful notation than
regexes
– Every construct that can be described by a regex
can also be described by the CFG, but not vice-
versa
– Every regular language is a context-free language,
but not vice versa.
CFG vs. Regex
Regex: (a|b)*abb Grammar:

A → aA 0 | bA 0 | aA 1
Describe the same
language: the set of → bA 2
strings of a’s and b’s
ending with abb → bA 3
→∈
CFG vs. Regex
• Language L = {anbn | n>=1} can be described by a grammar but not by a
regex
• Suppose L was defined by some regex
– We could construct a DFA with a finite number of states, say k, to accept L
Path aj-i State si: For an input beginning
with more than k a’s
Path ai
--- si aibi is in the language: A path
s0 bi from si to state f
Path ajbi is also possible
--- Path bi This DFA accepts both aibi
and ajbi

DFA cannot count, i.e., keep track of


f the number of a’s before it sees the b’s
NFA To CFG Conversion
• We can mechanically construct the CFG from an
NFA
• Converting the NFA for (a|b)*abb into CFG
– For each state i of the NFA, create a non-terminal Ai
– If i has a transition to j on input a, add Ai → aAj
– If i has a transition to j on input ε, add Ai → A j
– If i is an accepting state, add Ai →∈
– If i is the start state, make Ai the start symbol of the
grammar.
BNF: Meta-Syntax for CFGs
• <postal-address> ::= <name-part> <street-address>
<zip-part>
• <name-part> ::= <personal-part> <last-name>
<opt-jr-part> <EOL>
| <personal-part> <name-part>
• <personal-part> ::= <first-name> | <initial> "."
• <street-address> ::= <house-num> <street-name>
<opt-apt-num> <EOL>
• <zip-part> ::= <town-name> "," <state-code>
<ZIP-code> <EOL>
• <opt-jr-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
32
Left-Most Derivation Parse Tree

33
Right-Most Derivation Parse Tree

34
Ambiguous Grammar
• A grammar can have more than one parse tree
generating a given string of terminals. Such a
grammar is said to be ambiguous.

• Grammar is ambiguous, a terminal string that


yield of more than one parse tree.
Ambiguous Grammar
If a grammar generates the same string in several
different ways, we say that the string is derived
ambiguously in that grammar. If a grammar
generates some string ambiguously we say that
the grammar is ambiguous.

A string w is derived ambiguously in context-free


grammar G if it has two or more different leftmost
derivations. Grammar G is ambiguous if it
generates some string ambiguously.
Ambiguous Grammar
Consider grammar G2:
<EXPR> → <EXPR>+<EXPR>
I <EXPR> x <EXPR>
I ( <EXPR>) I a

This grammar generates the string a+axa


ambiguously.
Ambiguous Grammar
The two parse trees for the string a+axa in
grammar G2
Ambiguity
• Example
String → String + String | String – String
|0|1|2|3|4|5|6|7|8|9
Ambiguity
E → E + E | E ∗ E | ( E ) | id
String id * id + id has the following two parse trees

Enforces precedence of * over + Doesn’t enforce this precedence


40
Dealing with Ambiguity
• The most direct way is to re-write the
grammar unambiguously
E → E + E | E ∗ E | ( E ) | id

E → E '+ E | E '
E ' → id * E ' | id | ( E ) * E ' | ( E )
Enforces precedence of * over +
41
Example
E → E '+ E | E '
id + id * id
E ' → id * E ' | id | ( E ) * E ' | ( E )
E
id *id + id

E E’ + E

E’ E E’
+ id

id E’ id E’
* *
id

id id
42
Example
Another Ambiguous Grammar

• S→x Rewrite it as:


T→x
• S→y
T→y
• S→z T→z
• S→S+S S→S+T
• S→S–S S→S–T
• S→S*S S→S*T
• S→S/S S→S/T
• S → (S) T→(S)
S→T
Generates two parse trees for x + y * z Enforces precedence of * over +

TRY DIFFERENT INPUTS AT HOME 43


Ambiguity – The Dangling Else
E → if E then E
| if E then E else E
| OTHER

• This is an ambiguous grammar

44
Dangling Else
E → if E then E
| if E then E else E | OTHER
• The expression
if E 1 then if E 2 then E 3 else E 4
has two parse trees
The
if E else E ‘ELSE’ if E then E
then E
should
be
consider
if E then E ed with if E then E else E
which
‘THEN’ 45
Dangling Else

The
if E else E ‘ELSE’ if E then E
then E
should
be
consider
if E then E ed with if E then E else E
which
‘THEN’

Typically we want this parse tree


‘Else’ matches the closest unmatched ‘Then’

46
Dangling Else
E → matchedIF //all THEN are matched
| unmatchedIF //someTHEN is unmatched
matchedIF→ if E then matchedIFelse matchedIF
| OTHER
umatchedIF→ if E then E
| if E then matchedIFelse unmatchedIF

47
Dangling Else
• Consider again the expression
if E 1 then if E 2 then E 3 else E 4

if E then E if E
else MIF
then MIF

if E else MIF
then MIF

A Valid Parse Tree for an Not Valid because the THEN


umatchedIF is not a matchedIF

48
Ambiguity
There are no general techniques for handling
ambiguity
It is impossible to automatically convert an
ambiguous grammar into an unambiguous one
If used sensibly, ambiguity can simplify the
grammar
• Disambiguation Rules: Instead of re-writing the
grammar, we can
– Use the ambiguous grammar
– Along with disambiguation rules. 49
Disambiguation Rules
• Precedence and Associativity Declarations
• %left: all tokens following this declaration are
left-associative
• %right: all tokens following this declaration are
right-associative
• Precedence is established by the order of the
%left and %right declarations
• %left ‘+’ ‘-’
• %right ‘*’ ‘/’
– ‘*’ has a higher precedence than ‘+’, so ‘1+2*3’ would
be evaluated as ‘1+(2*3)’
• %nonassoc: the specified operators may not be
used together, e.g., %nonassoc ‘>’ ‘<‘. 50
Associativity Example
E → E + E | int

Input int + int + int

%left + 51
Precedence Example
E → E + E | E * E | int

Input int + int * int

%left +
% left * 52
Associativity of Operators
• The operator + associates to the left
An operator with + signs on both sides of it
belongs to the operator to its left.

In most programming languages the four


arithmetic operators, addition, subtraction,
multiplication, and division are left
associative.
Right Associative Operator
• The operator = associates to the right
right → letter = right | letter
letter → a | b |…| z

Parse tree for 9 – 5 – 2 grows down towards the


left, whereas parse tree for a=b=c grows
down towards the right
Precedence of Operators
Associativity rules for + and * apply to
occurrences of the same operator
Rule
* has the highest precedence than + if * takes its
operands before + does
• Multiplication and division have higher
precedence than addition and subtraction.
• 9 + 5 * 2 and 9 * 5 + 2 equivalent to 9 + (5 * 2)
and (9 * 2) + 2
Example
Grammar that defines simple arithematic expressions
In this grammar the terminal symbols are id + - * ? ()
The non-terminal symbols are expression, term and factor; and
expression is the start symbol

expression → expression + term


expression → expression - term
expression → term
term → term * factor
term → term / factor
term → factor
factor → (expression)
factor → id
Example
The above grammar can be written concisely
as:

E → E + T | E – T | T
E → T * F | T / F | F
F → (E) | id
Derivations Using a Grammar
We apply the productions of a CFG to infer that certain
strings are in the language of a certain variable.
There are two approaches to this inference.
The more conventional approach is to use the rules
from body to head. That is, we take strings known to
be in the language of each of the variables of the
body, concatenate them, in the proper order, with
any terminals appearing in the body, and infer that
the resulting string is in the language of the variable
in the head. This procedure is called Recursive
inference.
Derivations Using a Grammar
• There is another approach to defining the
language of a grammar, in which we use the
productions from head to body. We expand the
start symbol using one of its productions (i.e.,
using a production whose head is the start
symbol). We further expand the resulting
string by replacing one of the variables by the
body of one of its productions, and so on, until
we derive a string consisting entirely of
terminals. The language of the grammar is all
strings of terminals that we can obtain in this
way. This use of grammars is called derivation.
Example
Let us explore a more complex CFG that
represents (a simplification of) expressions in
a typical programming language. First we shall
limit ourselves to the operators + and *,
representing addition and multiplication. We
shall allow arguments to be identifiers, but
instead of allowing the full set of typical
identifiers (letters followed by zero or more
letters and digits), we shall allow only the
letters a and b and the digits 0 and 1. Every
identifier must begin with a or b, which may be
followed by any string in {a, b, 0, 1}*.
Example
We need two variables in this grammar. One, which we
call E, represents expressions. It is the start symbol
and represents the language of expressions we are
defining. The other variable, I, represents identifiers.
Its language is actually regular; it is the language of
the regular expression
(a | b)(a | b | 0 | 1)*
However, we shall not use regular expressions directly
in grammars. Rather we use a set of productions
that say essentially the same thing as this regular
expression.
Example
A context-free grammar for simple expressions

1. E → I
2. E → E+E
3. E → E*E
4. E → (E)

5. I → a
6. I → b
7. I → Ia
8. I → Ib
9. I → I0
10. I → I1
Example
The grammar for expressions is stated
formally as G = ({E, I}, T, P, E), where T is the
set of symbols {+, *, (,), a, b, 0, 1} and P is the
set of productions shown above.
Tree Traversal
• Tree traversals are used for describing
attribute evaluation and for specifying the
execution of code fragments in a translation
scheme.
• A traversal of a tree starts at the root and
visits each node of the tree in some order.
Depth-First Traversal
• Depth-first traversal starts at the root and
recursively visits the children of each node in
any order, not necessarily from left to right . It
is called "depth-first“ because it visits an
unvisited child of a node whenever it can, so it
visits nodes as far away from the root (as
"deep") as quickly as it can.
Depth-First Traversal
• The procedure visit(N) in Fig. is a depth-first
traversal that visits the children of a node in
left-to-right order, as shown in Fig.
Depth-First Traversal
Depth-First Traversal
• Synthesized attributes can be evaluated
during any bottom-up traversal, that is, a
traversal that evaluates a:ttributes at a node
after having evaluated attributes at its
children.
• In general, with both synthesized and
inherited attributes, the matter of evaluation
order is quite complex
Questions
• CFG representing the Regular Expression a+
A → aA

• CFG representing the Regular Expression b*


B →
B → bB
Questions
• CFG representing the Regular Expression a*b+
(i.e. start with any number of a’s followed by non-zero numbers of
b)
S → R | aS
R → b | bR

• A CFG representing the Regular Expression ab+a


(i.e. start with a followed by non-zero numbers of b’s and ends with a)
S → aRa
R → b | bR
Questions
Every construct described by a Regular Expression can also
be described by a CFG. Consider the following regular
expression:

(a|b)*abb where Σ = {a, b}


Create an equivalent CFG of the above regular expression

A0 aA0 | bA0 | aA1


A1 bA2
A2 bA3
A3 є

You might also like