Expressions
XL expressions are pretty similar to any old expressions you'd find in any old language (or in any new language, for that matter.) The stuff we do here will be pretty much the same in any parser or compiler you write in PCCTS, so pay attention! I know you're falling asleep after reading thirty-odd pages of drivel, but hang with me for a little longer!
There are a few things to notice about using PCCTS when you write expressions. The big one is that PCCTS is an LL(k) parser-generator. That means several things, but the biggest (with regard to expressions) is that left recursion is not allowed. See some compiler books for the details; in a nutshell
a : a PLUS a ;
would generate code like
a() { a(); match(PLUS); a(); }
which would obviously recursively call itself until you run out of stack space. There are several ways around this. You have probably seen the following type of code used to describe parts of expressions:
a : a PLUS a | t ;
To get rid of the left-recursion that renders LL(k) grammars helpless, you can use some wonderfully-mechanical algorithms to turn it into
a : a1 a_tail ; a1 : t ; a_tail : PLUS t a_tail | // nothing ;
Yuck! It works, but how long did you have to think about it to understand what it means?
Notice that the mechanically-generated code utilizes tail-recursion (recursion as the last element of a rule, also know as right-recursion) to keep adding (PLUS t) to the end of the "a" thing we are creating. Eventually, it ends with the
empty (epsilon) production in place of a_tail
Thinking about right-recursion, recall that right recursion can always be replaced very easily by a loop. (From where do you recall this? I dunno. Some programming class.) In PCCTS, we have the advantage of being able to use ()* and ()+ closures using EBNF notation. In other words, we can easily represent loops. So, a rule like
a_tail : PLUS t a_tail | // nothing ;
can be re-written as
a_tail : (PLUS t)* ;
which is much better so far. Now look what we have:
a : a1 a_tail ; a1 : t ; a_tail : (PLUS t)* ;
Notice something about a_tail
now? Without the tail recursion, it's only used in one place, so we can substitute it. And while we're at it, let's substitute a1
in place as well, yielding
a : t (PLUS t)* ;
When we look at that, we see that that's exactly what we wanted. Basically, PLUS is associative, so the order in which you add things really doesn't matter. You can keep saying "PLUS something else" at the end of our PLUS expression. The above notation is actually very good, because it matches what your brain does as it adds new parts to the expression as it reads left-to-right.
The moral of the above nonsense is that an expression-like rule such as
a : a OPERATOR | t ;
can be re-written
a : t (OPERATOR t)* ;
Now that we're armed, lets start organizing for our attack on XL's expression rules!
Precedence
In the description of XL above, we defined that the operators have the following precedences:
Boolean negation | not |
Unary adding operators | + - |
Multiplying operators | * / mod |
Binary adding operators | + - |
Relational operators | = /= < <= > >= |
Logical operators | and or |
What exactly does this mean? Well, it means that if we see a not somewhere in an expression, we should resolve it before we resolve any other operators (unless, of course, there are specific precedences specified by using parentheses.) So if we had an expression like
a and x or not y and 2 + -q * s < 10
we would first evaluate the "not" (let's use parentheses to specify evaluation grouping)
a and x or (not y) and 2 + -q * s < 10
then take care of the unary -
a and x or (not y) and 2 + (-q) * s < 10
then the *
a and x or (not y) and 2 + ((-q) * s) < 10
then the binary +
a and x or (not y) and(2 + ((-q) * s)) < 10
then the relational <
a and x or (not y) and((2 + ((-q) * s)) < 10)
then the boolean and and or
(((a and x) or (not y)) and ((2 + ((-q) * s)) < 10))
Whoa! Is that right? In most languages, and has a higher precedence than or. But in our precedence chart for XL, and and or have the same precedence. So while we might expect to see the evaluation proceed like
((a and x) or ((not y) and ((2 + ((-q) * s)) < 10)))
that's not how it goes, because and and or have the same precedence, we have to just read them left to right. (In this situation, XL's design is less intuitive because of the preponderance of other languages that give and a higher precedence than or. But that was the language designer's choice, and we'll live with it.)
So what do we do with these precedences? Basically
- each level of precedence represents a rule in the grammar
- Each operator in that precedence level is one alternative in that rule
- The rules will nest with the highest precedence operators being the most-deeply nested rules
- Each precedence level will only imbed the next higher precedence level.
- The highest precedence level will only use "primitive" expression elements, such as variable references and literals.
So lets start at the top of the precedence chart. I'm going to go out on a limb and avoid the terms term, factor, simple expression. I could never remember which was supposed to be which from a mathematics point of view. Instead, I'll follow Sun's lead in their Java grammar, and use more descriptive names.
We need to start with our primitive expression elements. In XL, these are constant values, variable references, and parenthesized expressions. Remember, parenthesizing an expression basically makes it the most important thing that is currently being evaluated.
primitiveElement : constantValue | variableReference | LPAREN expression RPAREN ;
So far so good. Now, let's just walk down the precedence chart. First, boolean negation:
booleanNegationExpression : (NOT)* primitiveElement ;
Notice that the NOT is optional here, and you can have as many of them as you like. That is because we want to be able to just pass any expression element through a rule without modifying it, enabling us to have an expression just be a primitiveElement
if that's what the user wants. Next, we have unary PLUS and MINUS:
// in the grammar signExpression : (PLUS|MINUS)* booleanNegationExpression ;
Something I like to do (and this is strictly style preference) is to define a token class for each operator set, which would make this:
// in the scanner specification #tokclass ADD_OP {PLUS MINUS} // in the grammar signExpression : (ADD_OP)* booleanNegationExpression ;
which I personally find a bit more readable, but you might think differently. Just another possibility. Notice that the ( )* can evaluate to "no operators," meaning that booleanNegationExpression
can be passed straight through
this precedence level without any modification.
Let us have as many plusses or minuses in front of our expression so far. Next, we have our multiplying operators, *, / and mod:
// in the scanner specification #tokclass MULT_OP {TIMES DIV MOD} // in the grammar multiplyingExpression : signExpression (MULT_OP signExpression)* ;
Which lets us keep tacking on multiplying operators ad nauseum.
Notice that this precedence rule imbedded the previous rule, and again, can let it just pass signExpression
right on through without modifying it. By now, things should be seeming a bit more clear. Next, we move onto the binary adding
operations:
// use same ADD_OP in the scanner specification // in the grammar addingExpression : multiplyingExpression (ADD_OP multiplyingExpression)* ;
Will we have a conflict between an ADD_OP acting as a binary add operator and one acting as a unary operator? No. The reason is that if we see one in between two partial expressions, it must be a binary operator. If we see another right after it (and any number of them after that) they must be working (covertly) as unary operators. Next we have the relational operators:
// in the scanner specification #tokclass RELATIONAL_OP {EQUALS NOT_EQUALS GT GTE LT LTE} // in the grammar relationalExpression : addingExpression (RELATIONAL_OP addingExpression)* ;
Finally, we get to the operators with the lowest precedence. At this level, we can call the result a full expression. This level encompasses and and or operators:
// in the scanner specification #tokclass BOOLEAN_OP {AND OR} // in the grammar expression : relationalExpression (BOOLEAN_OP relationalExpression)* ;
Tack on as many and/or partial expressions as you like...
Now that we've gotten through expressions, it really doesn't seem so bad. The precedence table actually made it easier, rather than more confusing.
At this point, we have specified the entire grammar for an XL parser. We'll see the whole grammar together soon, but first we need to discuss the code that holds it all together.