This divide-and-conquer approach is mostly called parsing altogether so the distinction between lexing and parsing is quite encapsulated. Nevertheless, the Xtext grammar definition honors both aspects: It is possible to define (parser) rules that are processed by the parser and it is also possible to define terminal rules. Those will be handled in the lexer. So the when should I use parser rules and when should I use terminal rules?
Production Rules also: Parser RulesThe obvious case and just for the sake of completeness: Production rules will yield an instance in the abstract syntax tree. These can only be implemented by the parser thus there is no question whether to use terminals instead. Production rules are the most common rules in almost every Xtext grammar.
Data Type RulesThose are a completely different thing even though they are handled by the parser, too: Where ordinary parser rules produce instances of EClasses, data type rules will return data types (you did not guess that, did you?). Data types in the sense of Xtext and its usage of the Eclipse Modeling Framework are basically primitive Java types, Strings or other commons like BigDecimal or enums. The parser will not create those on its own but rather pass the consumed tokens as a string to a value converter. The language developer is responsible for converting the string to a data type.
Terminal RulesTerminal rules are essentially the same as data type rules when you only consider the interface to the grammar. Internally they are completely different since they will not be processed by the parser but by the lexer. The consequences are quite severe if you want to get a working grammar. But one step after the other: As already mentioned, terminal rules can return the very same things as data type rules can. That is, they yield Strings, ints or other primitives. But since they are handled by the lexer, they are not quite as powerful as data type rules are.
Implementation aspectsThe lexer is a pretty dumb component which is generated in a way that weights performance over intuitive behavior. Where the parser generator will produce nice error messages in case of ambiguous data types rules, conflicting terminals are mostly resolved by a first-come-first-served (FCFS, also FIFO) principle. For terminals it's crucial to get them in the right order. Consider the following terminal rules:
terminal ID: ('a'..'z') ('a'..'z'|'0'..'9')*;The ID rule shall consume something that starts with a lowercase letter and is followed by a letter or number. The CHARS rule is pretty close to that one: It shall match a sequence that contains only lowercase letters or underscores. The problem with these is that the matched sequences are not mutually exclusive. If you take the input abc as an example, it will be matched as an ID given the two rules above. As soon as you switch the order of the declarations, the sequence abc will out of a sudden be returned as CHARS. That's one thing that you have to be a aware of if you use terminal rules. But there is more to keep in mind.
terminal CHARS: ('a'..'z'|'_')+;
Terminal rules are applied without any contextual information which has some interesting implications. The plus: it's easily possible to use the lexer on partial input - entry points can be computed almost trivially. There is no such thing as an entry rule as for the parser. But the disadvantages have to be taken into account, too. The lexer does not care about the characters that are still to come in the input sequence. Everything that matches will be consumed. And not reverted (as of Antlr 3.2). To explain what that means, let's take another example:
terminal DECIMAL: INT '.' INT;The rules define decimal numbers in a hypothetical language that also supports method invocation. At a first glance, things seem to work fine: 123 is consumed as an INT where 123.456 will be a decimal. But the devil's in the details. Let's try to parse the string 123.toString(). The lexer will find an INT 123 - so far, so good. Now it sees a dot which is expected by the terminal rule DECIMAL. The lexer will consume the dot and try to read an INT afterwards - which is not present. Now it'll simply fail for the DECIMAL rule but never revert that dot character which was consumed almost by accident. The lexer will create an invalid token sequence for the parser and the method call cannot be read successfully. That's because the lexer simply does not know about things like the expectation in the current parser state. Attempts to define decimals, qualified names or other more complex strings like dates are very error prone and can often be implemented quite easy by means of data type rules:
terminal INT: '0'..'9'+;
DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;