- Keywords may help to recover from parse errors in a sense that they guide the parser.
- It's recommended to use libraries instead of a hard wired keyword-ish representation for some built in language features.
- Data type rules are the way to go if you want to represent complex syntactical concepts as atomic values in the AST.
Spaces are evil!
IDs and KeywordsIDs start with a character followed by an arbitrary number of additional characters or digits. And keywords usually look quite similar to an ID. No surprises so far. Now let's assume a keyword definition like 'some' 'input' compared to 'some input'. What happens if the lexer encounters an input sequence 'som ' is the following. It'll start to consume the leading 's' and has not yet decided which token to emit, since it could become a keyword or an identifier. Same for the 'o' and the 'm'. The trailing space is the character where it can finally decide that 'som ' contains two tokens: an identifier and a whitespace token. So far so good.
Let the Parser Fail - For FreeNow comes the tricky part since the user continues to type an 'e' after the 'm': 'some '. Again, the lexer starts with the 's' and continues to consume the 'o', 'm' and 'e'. No decision was made yet: it could still be an ID or the start of the keyword 'some input'. The next character is a space, and that's the crucial part here: If grammar contains a keyword 'some input', the space is expected since it is part of the keyword. Now, the lexer has only one valid alternative. After the space it is keen on consuming an 'i', 'n', 'p', 'u' and 't'. Unfortunately, there is no 'i' in the parsed text since the lexer already reached the end of the file.
As already mentioned in an earlier post, the lexer will never roll back to the token 'some' in order to create an ID token and a subsequent whitespace. In fact the space character was expected as part of single token so it was safe to consume it. Instead of rolling back and creating two tokens, the lexer will emit an error token which cannot be handled by the parser. Even though the text appeared to be a perfectly valid ID followed by a whitespace, the parser will fail. That's why spaces in keywords are considered harmful.
In contrast, the variant with two split keywords of the grammar works fine. Here, the user is free to apply all sorts of formatting to the two adjacent keywords, any number of spaces, line breaks or even comments can appear between them, are valid and handled well by the parser. If you are concerned about the convenience in the editor - after all, a single keyword with a space seems to be more user friendly in the content assistant - I recommend to tweak that one instead of using an error prone grammar definition.