Parsec and Left Recursive Grammars

The traditional way to specify the syntax of a programming language is to write a Context Free Grammar (CFG) in Extended Backus-Naur Form (EBNF). Briefly, this form consists of a series of rules, each describing the form of some syntax element. For example, we might have a rule for postfix expression in C:

postfix-expression:
   primary-expression
   | postfix-expression '[' expression ']'
   | postfix-expression '(' argument-expression-list ')'
   | postfix-expression '.' identifier
   ⋮

This says that a postfix expression can be a primary expression (constant, identifier, or expression in parentheses) or it can be a postfix expression followed by some operator. An expression like

rectangle.bottomLeft.x

is thus parsed as

((primary-expression) '.' identifier) '.' identifier

The rule for postfix-expression is known as left recursive because postfix-expression appears as the leftmost symbol in some of the productions. This causes a problem for most left-to-right parsers. When they attempt to parse a postfix expression, they might first attempt to parse it as a primary-expression. If that fails, they might attempt to parse it as an array subscript. To do this, they first need to parse a postfix-expression. This leads to an infinite recursion, where the parser continues to nest deeper and deeper, while never consuming any input.

The traditional solution is to rewrite our rule into a right recursive form. In this case, we notice that every postfix-expression begins with a primary-expression, followed by a (possibly empty) chain of subscripts, member accesses and function calls. We can thus rewrite the grammar as

postfix-expression: primary-expression postfix-expression'

postfix-expression': empty
                   | '[' expression ']' postfix-expression'
                   | '(' argument-expression-list ')' postfix-expression'
                   | '.' identifier postfix-expression'
                   ⋮

Unfortunately, this changes the associativity of the operations. Instead of being left associative, they are now right associative, and our example would be parsed as

primary-expression ('.' identifier ('.' identifier))

The solution to this is somewhat dependent on the particulars of your parser. I have been using Parsec for parsing. The left recursive grammar would be written as

data PostfixExpression = Primary PrimaryExpression
                       | ArraySubscript PostfixExpression Expression
                       | FunctionCall PostfixExpression [ArgumentExpression]
                       | Member PostfixExpression Identifier
                       ⋮
                       
postfixExpression =
    (try $ liftM2 ArraySubscript postfixExpression $ brackets expression)
    <|> (try $ liftM2 FunctionCall postfixExpressioon $ parens $ many argumentExpression)
    <|> (try $ liftM2 Member postfixExpression (dot >> identifier))
    ⋮
    <|> liftM Primary primaryExpression

This builds an abstract syntax tree and nicely matches the structure of the grammar. Unfortunately, it doesn't complete (as we have seen). The advice on most blogs I have seen for dealing with left recursion is to use the chainl1 combinator. This would be a nice solution, but it assumes all the subexpressions have the same type as the result of combining them. chainl1 can be adapted to handle the case of subexpressions of type a and a result of type b. This was useful for handling arithmetic expressions.

chainl1' :: Parser a -> Parser (b -> a -> b) -> (a -> b) -> Parser b
chainl1' parser op baseConstructor = parser >>= (rest . baseConstructor)
    where rest context = do
                             f <- op
                             y <- parser
                             rest $ f context y
                         <|> return context

data AdditiveExpression = MultiplicativeExpression MultiplicativeExpression
                        | Add AdditiveExpression MultiplicativeExpression
                        | Subtract AdditiveExpression MultiplicativeExpression

parsePlusMinus = (reservedOp '+' >> return Add) <|> (reservedOp '-' >> return Subtract)
additiveExpression = chainl1' multiplicativeExpression parsePlusMinus MultiplicativeExpression

This works beautifully for defining the arithmetic of a language, which is often many levels deep and each level has similar structure to all the others. Postfix expressions lack this regular structure, so we need a slightly more involved approach. It might be possible to make an even more general chain combinator, which associates each operator parser with an expression parser, but it would only be used in this location, so it is simpler to just handle this case independently. The structure is similar to chainl1', but with several alternatives

postfixExpression = Primary primaryExpression >>= rest
    where rest context =
                         (brackets expression >>= rest . ArraySubscript context)
                         <|> (parens expression >>= rest . FunctionCall context)
                         <|> (dot >> identifier >>= rest . Member context)
                         <|> context

This runs nicely and still preserves clarity.

Coming up with this solution made showed just how valuable having the source for a solution that does something similar to what you want is, and how valuable having a package management system (Hackage) that can show source is.

Edit: I messed up the final version of postfixExpression. The post was written before I had completed the loop inherent in the C expression type, and consequently couldn't test the parser. This is a case where implicit typing bit me. I knew rest should have been

rest :: PostfixExpression -> ParsecT String u Identity PostfixExpression

but hadn't annotated it. My original version was taking a PasecT as its argument, which lead back to the looping problem.