This article covers some very basic concepts and algorithms for lexical analysis.

Regular Expressions

Alphabet and Language

Language Operations

Definition of Regular Expression

It is defined in recursion. Base:

Induction: if r and s are regular exprs,

Nondeterministic Finite Automaton (NFA)

Regular Expr to NFA

A regular expression can be expressed as an NFA. An NFA is formally a 5-tuple, $(Q,\Sigma,\Delta,q_0,F)$, which:

From the initial state, an NFA maps each state $s$ to a set $S$ of succeeding states. One can recursively construct the corresponding NFA from a regular expression, in a bottom-up manner:

Regular Expr to NFA

Evaluate an Expression Using NFA

For NFA, the basic idea is to maintain a set of reachable states $S$. In each iteration, use the current symbol $c$ to expand $S=\epsilon-closure(move(S,c))$. Finally, check whether an accepting state is inside $S$.

  S = epsilon_closure(s_0);
  c = next_char();
  while(c != eof) {
    S = epsilon_closure(move(S, c));
    c = next_char();
  }
  return S intersect F is not empty ? True : False;

Deterministic Finite Automaton (DFA)

DFA is a special case of NFA. A DFA has these additional properties:

Evaluate an Expression Using DFA

It is trivial to evaluate using a DFA, since the maintained state is deterministic at any time. Follow edges out from the states, until a dead state or a final state is met:

  s = s_0;
  c = next_char();
  while(c != eof) {
    s = move(s, c);
    c = next_char();
  }
  return s is in F ? True : False;

NFA to DFA

We can easily construct an NFA that each node is paired with another in the expression. This is hard for DFA, however. To get an equivalent DFA, the key idea is to group all states that can be reached at the same time as one DFA node.

First, We have a $S=\epsilon-closure(n_0)$ that each state inside it can be the first symbol of the expression. Then, for each input symbol $c$, we compute the next group of states $S’=\epsilon-closure(move(S,c))$ and mark it as a new node(if not before). Finally, link $S$ and $S’$ with $c$.

  Dstates.insert(epsilon_closure(s_0));
  while(Dstates is not empty) {
    T = Dstates.pop();
    if(visited[T] == True)
      continue;
    else
      visited[T] = True;
    
    for(each input symbol a) {
      U = epsilon_closure(move(T, a));
      Dstates.insert(U);
      Dtran[T, a] = U;
    }
  }

Minimizing a DFA

Multiple equivalent DFAs are possible, but only one is optimal. It means some nodes in a DFA can be combined: if two nodes respond equally for any input symbol, we can bind them together with the input edges unioned but the out edges unchanged.

Therefore, it is easy to distinguish them by repeat tests with each input symbol. Ultimately you will get groups of nodes, which nodes inside a group can be merged but are distinguishable between groups.