Lexical Analysis

---¶

Read input characters from src program, group into lexemes and produce as output a sequence of tokens for each lexeme in source program.
Parser calls the lexical analyzer to get tokens.
Stripping out comments and white-space, correlating error messages generated by the compiler with the src program are some of the tasks performed by lexical analyzer.
Lexical Analyzer is divided into two processes:
1. Scanning - comment deletion, white space compaction
2. Lexical Analysis - sequence of tokens produced here

token - <token_name, attribute_value> stored in symbol table. attribute_value is optional. Parser processes the token_name symbols.
Pattern is a description for the acceptable lexeme form. Example,
number -> [0-9]*[\.[0-9]]? the REGEX accepts floating point and integer numbers.
lexeme -> sequence of characters that matches the pattern for a token and is identified by lexical analyzer.

To get the lexeme, we often have to look ahead of the current lexeme to match the pattern.
A two-buffer scheme handles large lookaheads.
Two buffers each of size n are loaded to save time; save time? Yes, from having to read from a buffer and waiting for the next buffer. Solution, pre-fetch and have two buffers.
If eof occurs at the end of buffer, it denotes end of buffer. If eof occurs anywhere else it denotes end of the input.
Sentinels
- The sentinel is a special character that cannot be part of the source program - eof.
- For each character read we have two tests
  - Is end of buffer?
  - What character is read?

\(prefix\) \(\rightarrow\) remove 0 or more symbols from the end of string
\(suffix \rightarrow\) remove 0 or more symbols from the beginning
\(substring\rightarrow\) string after deleting any prefix/suffix
\(subsequence\rightarrow\) string after deleting 0 or more not necessarily subsequent characters (deleting any characters in the string)
\(proper\ prefix/suffix/substring\rightarrow\) not \(\epsilon\) or not equal to the original string.

Built recursively out of smaller REGEX
Each regular expressions \(r\) denotes a language \(L(r)\)
Algebraic Laws for Regular Expressions!
Extensions of REGEX:
- One of more instances \(r^{+}\)
  - \(a^{+}\) is {a,aa,aaa,aaaa,...}
- Zero or one instance \(r?\)
  - \(a?\) is {\(\epsilon\) | a}
- Character classes \([a_1a_2...a_n]\)
  - \([abc]\) \(\leftrightarrow\) \(a|b|c\)
  - \([a-z]\) \(\leftrightarrow\) \(a|b|...|z|\)
Normal strategy is to take the longest prefix of the input that matches any pattern as a lexeme.

Collection of nodes/circles called states.
- Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns.