Tuesday, February 21, 2012

The Real Reason for Lexing

All too often I see introductory texts say that parsing is really a two phase process:
  1. lexing
  2. (actual) parsing
And they say that phase 1 is done for efficiency reasons. Which is right, but only in an indirect manner. The reality of the situation is that people need(ed) to do parsing using only 1 token of lookahead. If you don't split your stream into tokens, even trivial languages cannot be parsed deterministically: e.g. seeing an "i" you don't know (yet) if what follows is an if-then-else statement or the identifier "iamalongidentifier". However, you can hack around this problem by first tokenizing/scanning/lexing the input stream. Then the lookahead is the entire string "iamalongidentifier" (or "if"), which solves the determinacy problem. There you have it: the real reason for this (historical) split-up into phases.

No comments:

Post a Comment