Wednesday, November 14, 2012

Xtext Corner #9 - About Keywords, Again

In the last weeks, I compiled some information about proper usage of keywords and generally about terminals in Xtext:
  • Keywords may help to recover from parse errors in a sense that they guide the parser.
  • It's recommended to use libraries instead of a hard wired keyword-ish representation for some built in language features.
  • Data type rules are the way to go if you want to represent complex syntactical concepts as atomic values in the AST.
In addition to these hints, there is one particular issue that arises quite often in the Xtext forum. People often wonder why their grammar does not work properly for some input files but perfectly well for others. What it boils down to in many of these examples is this:
Spaces are evil!
This seems to be a bold statement but let me explain why I think that keywords should never contain space characters. I'm assuming you use the default terminals but actually this fits for almost all terminal rules that I've seen so far. There is usually a concept of an ID which is defined similar to this:

terminal ID: 
  ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*;

IDs and Keywords

IDs start with a character followed by an arbitrary number of additional characters or digits. And keywords usually look quite similar to an ID. No surprises so far. Now let's assume a keyword definition like 'some' 'input' compared to 'some input'. What happens if the lexer encounters an input sequence 'som ' is the following. It'll start to consume the leading 's' and has not yet decided which token to emit, since it could become a keyword or an identifier. Same for the 'o' and the 'm'. The trailing space is the character where it can finally decide that 'som ' contains two tokens: an identifier and a whitespace token. So far so good.

Let the Parser Fail - For Free

Now comes the tricky part since the user continues to type an 'e' after the 'm': 'some '. Again, the lexer starts with the 's' and continues to consume the 'o', 'm' and 'e'. No decision was made yet: it could still be an ID or the start of the keyword 'some input'. The next character is a space, and that's the crucial part here: If grammar contains a keyword 'some input', the space is expected since it is part of the keyword. Now, the lexer has only one valid alternative. After the space it is keen on consuming an 'i', 'n', 'p', 'u' and 't'. Unfortunately, there is no 'i' in the parsed text since the lexer already reached the end of the file.

As already mentioned in an earlier post, the lexer will never roll back to the token 'some' in order to create an ID token and a subsequent whitespace. In fact the space character was expected as part of single token so it was safe to consume it. Instead of rolling back and creating two tokens, the lexer will emit an error token which cannot be handled by the parser. Even though the text appeared to be a perfectly valid ID followed by a whitespace, the parser will fail. That's why spaces in keywords are considered harmful.

In contrast, the variant with two split keywords of the grammar works fine. Here, the user is free to apply all sorts of formatting to the two adjacent keywords, any number of spaces, line breaks or even comments can appear between them, are valid and handled well by the parser. If you are concerned about the convenience in the editor - after all, a single keyword with a space seems to be more user friendly in the content assistant - I recommend to tweak that one instead of using an error prone grammar definition.

2 comments:

Aaron Digulla said...

Shouldn't that be "current parser technologies are evil"?

Spaces are a great way to visually align input.

The problem starts when (lazy?) developers cut corners and create lexers and parsers which aren't smart enough to handle common cases like dangling else or mixing grammars (i.e. when you want to embed SQL into a Java grammar).

Sebastian Zarnekow said...

Aaron,

visual alignment should not be tied to the semantics of the code. Instead that's a matter of the pretty printer or formatter - and of course it's something that depends on the habits of the developer.

I doubt that there are better solutions for the rather complex problem of mixing languages than a divide and conquer approach so I don't consider that cutting corners but a pragmatic approach to solve most of the common problems.

Regards,
Sebastian