Lexical analysis

In case of a programming language, the categories include identifiers, operators, grouping symbols and data types.However, lexers can sometimes include some complexity, such as phrase structure processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression /\s{1}/).Tokens are often defined by regular expressions, which are understood by a lexical analyzer generator such as lex, or handcoded equivalent finite-state automata.In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters.In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value.For example, in the source code of a computer program, the string might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value: Lexers may be written by hand.These tools generally accept regular expressions that describe the tokens allowed in the input stream.However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs (which for some purposes may count as single tokens).Tokenization is particularly difficult for languages written in scriptio continua, which exhibit no word boundaries, such as Ancient Greek, Chinese,[4] or Thai.Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a language model that identifies collocations in a later processing step.These generators are a form of domain-specific language, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer.[dubious – discuss] With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements.[citation needed] It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer.Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation.A more complex example is the lexer hack in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase since typedef names and variable names are lexically identical but constitute different token classes.Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name.