=head1 A Pegex Calculator
When you look around the web for stuff about parsing, you inevitably find a
bunch of examples about how to write a precedence parser for mathematical
expressions.
=over
=item L
=item L
=item L
=item L
=back
This tutorial is the Pegex version of that. Pegex actually comes with an
examples directory that contains two arithmetic expression parser/evaluator
caclulator programs:
=over
=item L
=item L
=back
They both do the same thing but using different parsing approaches. We'll
cover both of them in detail. I hope you'll find that Pegex handles operator
precedence parsing elegantly.
=head1 The Problem
Precedence parsers are interesting because you need to deal with operators
that have the same precedence, different precedences and both left and right
associativity.
Consider the equation:
1 - 2 ^ 3 ^ 4 + 5 * 6 / 7
Normal precedence and associativity rules make this the same as:
(1 - (2 ^ (3 ^ 4)) + ((5 * 6) / 7))
Our calculator should produce the same result for both. Note that this means we
will be parsing numbers, 5 operators, parentheses, and separating whitespace.
Here's an example of the calculator program running:
> perl examples/calculator1.pl
Enter an equation: 1+2*3
1+2*3 = 7
Enter an equation: (1 + 2) * 3
(1 + 2) * 3 = 9
Enter an equation:
=head1 The Solutions
Most of the solutions that you'll read about on the web, involve (or assume) a
lexing/tokenizing step before parsing. Pegex always parses an input stream
directly, pulling out "tokens" that it needs using regex captures. So the parse
happens as one operation, which has many advantages.
But how do we get the operator precedence rules into this? Well, we have 2
different ways:
=head2 calculator1.pl - Operatator Precedence Climbing
Our first example calculator uses what is known as the Operator Precedence
Climbing method. See:
L.
This is basically a clever technique of specifying our grammar rules such that
they imply precedence. Here's the pegex grammar from the code:
expr: add_sub
add_sub: mul_div+ % /~([])~/
mul_div: exp+ % /~([])~/
exp: token+ % /~~/
token: /~~/ expr /~~/ | number
number: /~(?+)~/
It's a little bit wonky but it works. It says that any expression is an
add/subtract and that an add/subtract is really a multiply/divide etc. Finally
after the last operator comes the number token and the parens.
It feels a bit backwards. One of the biggest drawbacks of PCM is that it
becomes less and less efficient with more and more operators. It needs to go
through the whole tree, just to find each number.
But it works and the code is minimal. The receiver class gets the numbers in
the correct order, immediately evaluates the answer and returns the answer for
each level. Whatever the return value of the final operation is, becomes the
result of the parse. Here's the receiver class:
{
package Calculator;
use base 'Pegex::Tree';
sub got_add_sub {
my ($self, $list) = @_;
$self->flatten($list);
while (@$list > 1) {
my ($a, $op, $b) = splice(@$list, 0, 3);
unshift @$list, ($op eq '+') ? ($a + $b) : ($a - $b);
}
@$list;
}
sub got_mul_div {
my ($self, $list) = @_;
$self->flatten($list);
while (@$list > 1) {
my ($a, $op, $b) = splice(@$list, 0, 3);
unshift @$list, ($op eq '*') ? ($a * $b) : ($a / $b);
}
@$list;
}
sub got_exp {
my ($self, $list) = @_;
$self->flatten($list);
while (@$list > 1) {
my ($a, $b) = splice(@$list, -2, 2);
push @$list, $a ** $b;
}
@$list;
}
}
As you can see, it has an action method for each level or precedence.
It loops over the expression, evaluating it. Whether it loops from
left to right or right to left depends on the associativity that we
want to use.
Our runner code looks like this:
while (1) {
print "\nEnter an equation: ";
my $input = <>;
chomp $input;
last unless length $input;
calc($input);
}
sub calc {
my $expr = shift;
my $calculator = pegex($grammar, receiver => 'Calculator');
my $result = eval { $calculator->parse($expr) };
print $@ || "$expr = $result\n";
}
And that's the whole thing. We have a working calculator as specced!
However the real point of this is to explore good parsing techniques,
and the PCM leaves us wanting to try something more efficient. Let's
try another approach...
=head2 calculator2.pl - Shunting Yard Algorithm
An age old way of parsing expressions is to somehow get the numbers and
operators into an RPN (Reverse Polish Notation) stack, which is each operand
follow by its operator. Once in that form, precedence and associativity are
accounted for.
For example:
1 / 2 - ( -3 * 4 )
becomes:
1, 2, /, -3, 4, *, -
To evaluate an RPN you pop off an operator and then attempt to pop
off and operand. If the operand is another operator you recurse. When
you have 2 operands you do the operation and put the result back on
the stack. When there is only 1 element on the stack, you are done.
That's your result.
Let's look at our new grammar in C:
expr: operand (operator operand)*
operator: /~([])~/
operand: num | /~~/ expr /~~/
num: /~(?+)~/
This is much easier to understand. We are just parsing out the
tokens. In a (very real) sense, we are using Pegex as a lexer.
Now let's look at the receiver class:
{
package Calculator;
use base 'Pegex::Tree', 'Precedence';
my $operator_precedence_table = {
'+' => {p => 1, a => 'l'},
'-' => {p => 1, a => 'l'},
'*' => {p => 2, a => 'l'},
'/' => {p => 2, a => 'l'},
'^' => {p => 3, a => 'r'},
};
sub got_expr {
my ($self, $expr) = @_;
$self->precedence_rpn($expr, $operator_precedence_table);
}
}
This is also much simpler. There's only one method. What's going on?
Well the secret is that I put the code to turn the tokens into RPN in
a separate base class called C.
This is an implementation of Edsger Dijkstra's famous Shunting-yard
Algorithm from 1961! It's only 20 lines of Perl. I won't include it
inline here, but have a look at it for yourself.
L
The Shunting-yard algorithm simply takes a list of expression tokens
and transforms them into an RPN stack. It uses information from a
precedence/associativity table like the one above.
Unlike C where we evaluated as we parsed,
C creates an RPN which is akin to an AST. In other
words, it's more like something an actually language compiler would
do.
But we are writing a calculator and we still need to evaluate this pupy. I
changed the runner code to look like this:
sub calc {
my $expr = shift;
my $calculator = pegex($grammar, receiver => 'Calculator');
my $rpn = eval { $calculator->parse($expr) };
my $result = RPN::evaluate($rpn);
print $@ || "$expr = $result\n";
}
As you can see I also moved the evaluator code into a
separate/reusable module: C. This is another 30
lines of Perl. Please take a look at it.
L
So overall, this second solution was a bit more code, but also feels
more solid on several levels.
=head1 Conclusion
Pegex strives to be the nicest and most reusable way to write new parsers.
Operator precedence parsers are a necessary part of parsing mathematical
expressions and computer languages. This tutorial showed you 2 ways to do it.
As the demands for Pegex grow, we may see even more ways to do it.