Curriculum¶

Context Free Grammars(CFG),Concept of parsing, Parsing Techniques, Top-Down Parsers : Introduction, Predictive Parsing - Removal of left recursion, Removal of left factoring, Recursive Descent Parsing, Predictive LL( k ) Parsing Using Tables, Bottom Up parsing: Introduction, Shift-Reduce Parsing Using the ACTION/GOTO Tables, Table Construction, SLR(1), LR(1), and LALR(1) Grammars, Introduction to YACC Tool & YACC file specification, Error detection and recovery in YACC.

In Short¶

LL(1) - Matched | Stack | Input | Action
LR(0)... - Stack | Input | Action - Remember to create augmented grammar ; s' -> .s

Role of the Parser:¶

Takes tokens from lexical analyzer as input. It verifies the string of token names can be generated by the grammar for the src language.
Conceptually, constructs a parse tree and passes it ahead for well formed programs.
Practically, parse tree need not be constructed explicitly, since checking and translation actions can be interspersed with parsing.
The parser is part of the front end of the compiler.

Three General Types of Parsers:¶

Universal
- Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar.
- However, too inefficient to use in production compilers
Top-down
- Build parse trees from the top(root) to the bottom(leaves).
Bottom-up
- Build parse trees from the bottom(leaves) to the top(root) .
In Both Top-down and bottom-up methods, parser is scanned from left to right one symbol at a time.

Programming Errors¶

Lexical Error
- Misspelled identifiers, keywords, operators, etc.
Syntactic Error
- misplaced semicolons, extra/missing braces.
- Detection is very quick. Lexical analyzer can detect an error as soon as a prefix occurs that cannot be completed to form a string of the language.
Semantic Error
- Type mismatch between operator and operand
Logical Error
- Incorrect reasoning, misled logic, etc.
- Not caught by parser, lexical analyzer or compiler in general

Error Recovery Strategies¶

Panic Mode Recovery
- Parser discards input symbols until a (set of) synchronising tokens are found.
- Synchronising tokens are usually delimiters, such as semicolon or }, whose role in the source program is clear and unambiguous.
- Skips a considerable amount of input without checking it for additional errors.
- Guaranteed to not go in an infinite loop!
Phrase Level Recovery
- Parser may perform local correction on remaining input.
  - Replace prefix of remaining input such that it can continue parsing.
- Typical local corrections:
  - Replace comma with semicolon
  - Remove extra semicolons
  - Insert missing semicolon
- Used in several error-repairing compilers as it can correct any input strings.
- Drawback
  - Difficulty in coping situations when actual error occurred before point of detection.
  - It might happen that a missing parenthesis might be affection a block of code (a missing closing brace in if clause); but Phrase level correction will only correct for the point of error and not block wide changes.
  - if (x > 10
    y = x + 5;
    z = x - 3;
  - The above code block has error on first condition, but it will be detected on the first semicolon,
  - When The recovery mechanism tries to fix this, it might replace it with if (x > 10) but this will only include the y=x+5 line and not the z=x-3 line.
  - If the error had other cascading effects, fixing the immediate phrase won’t always resolve the subsequent errors that occurred due to original mistake.
Error Productions
- Productions for common mistakes to generate appropriate error diagnostics.
Global Correction
- There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction.
- Given, input input string $x$ and grammar $G$, these algorithms will find a parse tree for a related string $y$ such that number of insertions, deletions, substitutions required to transform $x\rightarrow y$ are as small as possible.
- Very Costly to implement in terms of time and space
- Also, a closest correct program may not be what the programmer had in mind.

CFG¶

consists of terminals, nonterminals, a start symbol, and productions
CFG in Brief
CFG Notation Conventions
- Terminals
  - lowercase letters
  - operators symbols
  - punctuation symbols
  - Digits 0...9
  - bold strings -> if,id (keywords that represent a single terminal symbol)
- Non Terminals
  - Upper case letters $A,\ B,\ C$
  - Lowercase/italic names -> $expr,\ term,\ stmt$
- The above conventions are guidelines and are not strictly followed.

Derivations, Ambiguity¶

Derivations
- bottom-up parsing is related to a class of derivations known as "rightmost" derivations
- Suppose '$A \rightarrow \gamma$' and we have, $\alpha A \beta \Rightarrow \alpha \gamma \beta$ .
  - The symbol $\Rightarrow$ means "derives in one step"
  - The symbol $\overset{*}{\Rightarrow}$ means "derives in 0 or more steps"
  - The symbol $\overset{+}{\Rightarrow}$ means "derives in 1 or more steps"
Language Generated by Grammar
- The language generated by a grammar $L(G)$ is the set of all sentences that can be derived from the start symbol.
- Language generated by a grammar is said to be a context-free language.
Parse trees and derivations
- Leaves of a parse tree are labelled by non-terminals or terminals and, read from left to right.
- This reading done at any step during the parsing, constitute a sentential[0^] form called yield or frontier.
Ambiguity
- Grammar that produces more than one parse tree for some sentence is said to be ambiguous
- More than one leftmost derivation $(\underset{lm}{\Rightarrow})$ or more than one rightmost derivation $(\underset{rm}{\Rightarrow})$ present for the same sentence.
- Ambiguity confuses the parsers, thus it is desired for grammar to be made unambiguous and clear.
Grammars are more powerful than Regular Expressions
- Every Regular expression construct can be described by a grammar, but not vice versa.
- Every Regular language is a context free language, but not vice versa.
- Regex to Grammar
  1. For each state $i$ create a non-terminal $A_i$
  2. If
    1. state $i$ has transition to state $j$ on input $a$, add production $A_i \rightarrow aA_j$
    2. state $i$ goes to state $j$ on input $\epsilon$, add production $A_i \rightarrow A_j$
  3. If $i$ is an accepting state, add $A_i \rightarrow \epsilon$
  4. If $i$ is start state, make $A_i$ the start symbol of the grammar.
A grammar can count two items but not three.

Ambiguity Elimination¶

Eliminating Ambiguity
- // todo #todo

Elimination of Left Recursion¶

Convert left-recursive productions from $A\rightarrow\ A\alpha\ |\ \beta$ to
- $A \rightarrow\ \beta A^{'}$
- $A^{'}\rightarrow\ \alpha A^{'}\ |\ \epsilon$

Algorithm to Eliminate Left Recursion:

arrange the nonterminals in some order A1, A2, . . . , A,.
    for ( each i from 1 to n ) {
        for ( each j from 1 to i - 1 ) {
            // todo ;( I know this is the imp part tho )
            - for now just repeat the above formula for each Aalpha_n pair
        }
        eliminate the immediate left recursion among the Ai-productions
    }

todo¶

Left Factoring¶

We defer the decision by expanding $A\ to\ \alpha A^{'}$
Example,
- if $A \rightarrow\ \alpha \beta_1 \ | \alpha \beta_2$ have to make the decision of choosing between B1, B2 at current state itself.
- We wan't to delay this decision as we might not be able to tell which production to choose to expand at the current moment, so we defer.
- How? $A \rightarrow \alpha A^{'}$ and then $A^{'} \rightarrow \ \beta_1 \ | \beta_2$
- Here we made the decision into a completely new state, by then we can process the symbol for alpha.
Algorithm
- For each non-terminal A,
  - find longest prefix $\alpha$ common to two or more of its alternatives.
  - Find all of the A-productions, $A \rightarrow \alpha \beta_1\ | \alpha \beta_2 \ | ... \ | \ \alpha \beta_n \ | \gamma$ , where $\gamma$ represents the remaining productions that do not share the common prefix $\alpha$, and replace with:
    - $A \rightarrow \alpha A^{'}\ | \gamma$
    - $A^{'} \rightarrow \beta_1 \ | \beta_2 \ | ... | \beta_n$
- Repeat the above step till no Non-Terminal's productions have any common prefix.
Need of this? We sometimes need to see the state data before deciding whether we are going to need expansion to B1 or B2. At such times deferring helps!

Top Down Parsing¶

Key problem $\rightarrow$ determining the production to be applied for a non-terminal
Backtracking parsers (Recursive Descent) are not seen frequently, as they are too costly and not practical.

FIRST and FOLLOW¶

$FIRST(\alpha)$ is set of terminals that occur at the beginning of the strings derived from $\alpha$
Finding $FIRST(X)$
- X is a terminal, FIRST(X) is X
- X is a non terminal, and $X \rightarrow Y_1Y_2...Y_k$ then, recursively check productions of $Y_i$ from $i=1 \ to \ i=k$.
  - The first terminal you encounter is added to $FIRST(X)$
  - If $\epsilon$ is present in all of the Productions, then add $\epsilon$ to $FIRST(X)$
- $X \rightarrow \epsilon$ is a production, add $\epsilon$ to $FIRST(X)$
$FOLLOW(A)$ set of terminals that occur immediately to the right of A, in some sentential form.
- If A is the rightmost symbol in some sentential form, then $\$$ is in the $FOLLOW(A)$ set. (Remember $\$$ marks end of input)
Finding $FOLLOW(X)$
- Place $\$$ in $FOLLOW(S)$ where S is the start symbol
  - why? Because after the star symbol string, there is always going to be end of grammar ultimately
- Production $A \rightarrow \alpha B \beta$, then everything in $FIRST(\beta)$ except $\epsilon$ is in $FOLLOW(B)$
- Production $A \rightarrow \alpha B$ or $A \rightarrow \alpha B \beta$ where $FIRST(\beta)$ contains $\epsilon$, then everything in $FOLLOW(A)$ is in $FOLLOW(B)$
  - Since A is ending with B here, so whatever will be after A, will occur after B as B is the ending of A. With me?

LL(1) Grammars¶

LL1?
- first L stands for Left to right scanning
- Second L for producing a leftmost derivation
- 1 for using one input symbol of lookahead at each step
No left-recursive or ambiguous grammar can be LL(1)
A grammar G is LL(1) iff $A \rightarrow \alpha|\beta$ are two distinct productions such that:
- for no terminals $a$ both $\alpha$ and $\beta$ derive strings beginning with a.
- At most one of $\alpha$ and $\beta$ can derive empty string
- If $\beta \overset{*}{\Rightarrow} \epsilon$, then $\alpha$ does not derive any string beginning with a terminal in $FOLLOW(A)$ and vice versa!
  - The rule says that if A can turn into ε (like A→β), then other ways to replace A (like A→α) should not produce any strings that start with terminals found in the FOLLOW set of A.
  - Why? It helps avoid confusion while parsing
LL1 Parsers - Predictive Parsers
- Predicts which production rule to use, by looking up ahead upto 1 symbol using FIRST and FOLLOW functions
- Production $A \rightarrow \alpha$ is chosen if the next input symbol a is in $FIRST(\alpha)$.
- Only complication, when $a = \epsilon$ or $a \overset{*}{\Rightarrow}\epsilon$
  - We should again choose $A\rightarrow \alpha$ if current input symbol is in $FOLLOW(A)$or $\$$ been reached and $\$$ in $FOLLOW(A)$
Predictive Parser Table Generation Method
- For each production $A \rightarrow \alpha$ do:
  - For each terminal a in $FIRST(A)$
    - add $A \rightarrow \alpha$ to M[A,$\alpha$]
  - If $\epsilon$ is in $FIRST(\alpha)$, then
    - $\$$ is in $FOLLOW(A)$, add $A \rightarrow \alpha$ to M[A,$\$$].
    - for each terminal $b$ in $FOLLOW(A)$, add $A \rightarrow \alpha$ to M[A,$b$]
- An empty entry in the table is used to represent error
- In Short
  - Calculate First & Follow of the Grammar
  - Columns are terminals, Rows are Productions
  - For every production rule's first except $\epsilon$
    - add to table[NonTerminal, NonTerminal's first] = Respective Production
  - For $\epsilon$ cases
    - When a production produces $\epsilon$, then we place the production in the Follow of the Head (LHS) of the production
  - REMEMBER,
    - all $\epsilon$ productions are placed under FOLLOW sets
    - Remaining productions are placed under the FIRSTs.

Non Recursive Predictive Parsing¶

Table driven parser has four components
- Input buffer (string to be parsed followed by $)
- Stack containing sequence of grammar symbols
- Parsing table
- Output stream
Parser is controlled by a program, that considers X (the symbol on top of stack) and a (the current input symbol)

Algorithm

Initial state
- $w\$$ in input buffer
- start symbol $S$ of Grammar on top of stack, above $\$$

$ip \rightarrow$ first symbol of $w\$$; $X \rightarrow$ top of stack symbol

    while X != $:
        if (X == a):
            pop(); ip++;
        else if(X is a nonTerminal):
            error();
        else if(M[X,a] is an error entry):
            error();
        else if(M[X,a] = X -> Y1Y2..Yk):
            output the production;
            pop();
            push Yk, Yk-1, ... Y1 on stack with Y1 on top
        X = top Symbol again

Table perparation (4 columns):
- Matched
  - the string matched so far
- Stack
  - $ (bottom of stack) on RHS
  - Place the Start Symbol, $ -> E$
  - Top of stack denoted by X.
- Input
  - Input string
  - current character denoted by a
- Action
  - output PRODUCTION
  - match TERMINAL
- Continue above steps till both the stack and input is empty i.e. $\$$

Error Recovery¶

Error detected when, terminal on top of stack does not match next input symbol

Panic Mode¶

skip symbols on input until a token in a selected set of synchronising token appears.
andd synch tokens to all the $FOLLOW(X)$ that haven't been already entered in the parsing table
If you get any symbols that do not match with the input symbol, skip it.
Issue here is that PANIC MODE does not address error messages

Phrase Level Recovery¶

Fill blank routines with pointers to error routines
These routines may change, insert, delete a symbol and issue the appropriate error messages.
Alteration of Stack symbols, pushing new symbols on stack is questionable though, why? It might lead to infinite loop.
Finding a leftmost derivation for an input string.

Bottom Up Parsing¶

Parse tree construction begins at leaves and works its way up
KEY Decision $\rightarrow$ when to reduce and what production to apply? as the parser proceeds.

Table Format¶

3 Rows
- Stack
  - $0 on LHS
  - initialise with 0
- Input
  - Input string from current string onwards
- Action
  - Shift
  - Reduce by ...
Acc when reached first statement (Augmented Grammar)

Reductions¶

Think of Bottom up parsing as process of "reducing" string $w$ to start Symbol $S$ of Grammar
At each $reduction$ step, a specific substring matching the body of a production is replaced by the non-terminal at the head of that production

Handle Pruning¶

A "handle" is a substring that matches the body of a production, and its reduction represents one step along the reverse of rightmost derivation
The leftmost substring that matches the body of some production need not be a handle.
Why a "handle" and not "the handle"? The grammar might be ambiguous

Shift-Reduce Parsing¶

form of bottom-up parsing where,
- Stack $\rightarrow$ Grammar Symbols
- Input buffer $\rightarrow$ rest of the string to be parsed
We use $\$$ to mark bottom of the stack (on the left side here), and the right end of the input.
During a left-to-right scan of input string
- parser shifts input symbols onto the stack. Till it is ready to reduce some string $\beta$ on top of stack
- Then reduces $\beta$ to head of the appropriate production
- Repeat this cycle until it has detected an error or until the stack contains the start symbol and input is empty
Shift Reduce parser can take 4 steps:
- Shift - Shift next input symbol to top of stack
- Reduce - right end of the string to be reduced must be at top of stack.
- Accept
- Error
Conflicts
- shift/reduce conflict
- reduce/reduce conflict

REDUCE Step¶

Number the original Productions! r1,r2,r3

SLR(1)¶

Same as LR(0), but calculate $FOLLOW$ of non-terminals.
The only change is that there is a restriction in reducing entries.
For the rules that have been reduced, make the entry for reduce in the follow of the production head symbol.

CLR(1)¶

aka, LR(1)
Used by Java, C, etc. parser
Most powerful parser
FOLLOW is not calculated by the grammar, but calculated at runtime.
Every itemset $I_j$ has it's own production rules followed by their own $FOLLOW(X)$
- $S^{'}\rightarrow S. \ , \ \$$

FootNotes:¶

[^0]: sentential form: Any valid sequence of symbols derived from the start symbol according to the grammar's rules, regardless it is fully derived or an intermediate form.