package yagg::Tutorial; # This is only a container for the yagg tutorial to allow viewing # tutorial-style documentation by way of the perldoc utility. 1; __END__ =head1 NAME yagg::Tutorial - Tutorial for yagg =head1 SYNOPSIS # To use the generator ./yagg -m nonterminals.yg terminals.lg ./output/progs/generate 5 =head1 OVERVIEW This tutorial will show you how to use B, by way of two examples. In the first example, we create a simple logical expression generator from scratch. In the second example, we create a more sophisticated logical expression generator from existing parser/lexer input files, such as those used by YACC/Bison and LEX/FLEX. These examples, plus another more sophisticated fault tree generator are included with the distribution in the F directory. It is assumed that the reader knows a little about formal grammars. Ideally, the reader would have some experience writing grammars for input to parser generators like YACC and Bison. =head1 EXAMPLE: GENERATOR IMPLEMENTED FROM SCRATCH Let's say that we want to create a logical expression generator that generates expressions containing binary operators, parentheses, negation, and atomic propositions. Examples might be I<-(p or q)>, I

-q>, and I

. Here is a simple grammar that would create such expressions: wfe : wfe binary_operator wfe | "(" wfe ")" | unary_operator wfe | atomic ; binary_operator : "<=>" | "and" | "or" | "=>" ; unary_operator : "-" ; atomic : "p" | "q" | "r" ; You can place the above grammar verbatim into a "language grammar" input file, adding a line containing C<%%> at the top of the file. This is enough for the generator to do its job. Running the generator as: $ yagg -r2 foo.yg will generate the generator, compile it, and run it to create expressions of length 2. =head2 Input Files Another way to write the input files is to put the productions for terminals such as C<"("> and C into a "terminal generator" file. This approach is more similar to the approach taken by tools such as Lex and YACC, which separate the lexical analysis of tokens from the grammar-based parsing. Here's the resulting grammar file Here's the grammar file, F: %% wfe : wfe BINARY_OPERATOR wfe | LEFT_PAREN wfe RIGHT_PAREN | UNARY_OPERATOR wfe | ATOMIC ; Here we're using upper-case as a convention to indicate the terminals. Notice that the strings associated with C, C, and C have been removed. They are located in the terminal grammar file, F: %% ( "<=>" | "and" | "or" | "=>" ) return BINARY_OPERATOR; "-" return UNARY_OPERATOR; "(" return LEFT_PAREN; ")" return RIGHT_PAREN; ( "p" | "q" | "r" ) return ATOMIC; This file is similar to a LEX input file, but instead of regular expressions you must provide one of several terminal formats. (We'll describe these shortly.) For example, the last rules says that an atomic proposition is either I

, I, or I. =head2 Generating and Running the Generator Next, use these input files to generate the generator: $ yagg -m logical_expression.yg logical_expression.lg And then run the generator: $ ./output/progs/generator 3 You can use the B<-r> flag to run the generator for a given length. You can also use the B<-o> flag to place the output in a different directory. B<-o> can specify a directory on a remote machine using any of the B formats. In this case, B<-m> and B<-r> will execute on the remote machine instead of the local one. =head2 Terminal Generation Formats You'll notice that the generator prints out a number of very similar expressions, such as I

versus I. You can control how terminals are generated by using one of several terminal specification formats: =over =item Simple String (C<"p">) A simple string is used wherever the terminal appears. =item Simple Alternation (C<( "p" | "q" )>) When such a terminal appears, multiple versions will be generated, one for each alternative. =item Equivalence Alternation (C<[ "p" | "q" ]>) Terminals of this type are treated as belonging to an equivalence class. That is, a "q" is only generated if the current string already has a "p". This is useful for terminals that represent things like variable names. It's easier to explain equivalence with an example. Suppose you use the simple alternation: C<"p" | "q"> for a terminal C. The output for C will be: p p p q q p q q In this case, the first and last generated string are basically the same because they both have identical elements. Similarly, the second and third generated string are the same since they both have different elements. As far as our logical expressions example, we only really care about the first two cases. We can use the equivalence alternation form to get what we want: C<[ "p" | "q" ]>. In this case, the output will be: p p p q =item Equivalence Generator (C<[ "p#" ]>) This form is the same as the equivalence alternation, except that the names are generated as they are needed by substituting "C<#>" in the string with a number starting at 1, and the number of generated strings is unbounded. For example, if you use C<[ "p#" ]> for C, the output for C will be: p1 p1 p1 p2 =back If we want to remove the generation of strings with semantically redundant atomic propositions, we would need to modify the C terminal specification like so: [ "p" | "q" | "r" ] return ATOMIC; If you now rerun the B and string generator: $ yagg -r 3 logical_expression.yg logical_expression.lg you will see that the strings have been constrained. Also note that the generation is faster this time---the comilation process only needs to recompile the implementation of the C generator component. =head2 More Information This example is in the F directory of the distribution. See the C file. =head1 EXAMPLE: GENERATOR IMPLEMENTED FROM AN EXISTING PARSER In this second example, let's say that you've already created a parser for your format using LEX and YACC, or FLEX and Bison. If you have, you probably already have the following: =over =item * A F input file for LEX/FLEX =item * A F input file for YACC/Bison =item * An include file for headers needed by the parser and lexer =item * Extra files for custom data types and whatnot needed for parsing =back The following instructions tell you how to convert them into input files for B. In this example, I will use the example parser located in the F directory of the distribution. (The F<.y> and F<.l> files are in the F subdirectory.) The resulting files are in the F directory, along with a F file. (By the way, if you have problems or have to alter the steps while working with your own files, please let me know so that I can update this tutorial.) =head2 Prepare your terminal F file =over =item 1. Copy F to F =item 2. Delete everything in the definitions section, except for an C<%{ ... %}> and C<%option prefix=...>, if they are there. =item 3. In the rules section: =over =item 1. Delete everything in the action blocks except any C statements =item 2. Delete any start conditions from the rules =item 3. Delete any rules with empty action blocks =item 4. Replace any regular expressions in the rules with appropriate generators. (See above.) =back =item 4. In the code header block, comment out the #include for the parser-generated header file. (This would normally by F, but you may have renamed it.) =item 5. If you used the B<-Pfoo> flag to change the symbol prefix from I, you'll need to add an C<%option prefix=foo>. =back =head2 Prepare your nonterminal F file =over =item 6. Copy F to F =item 7. For any rule that has a side-effect, add another "unaction" block immediately following that reverses the effect. You don't need to worry about any changes to C<$$>---they will be handled automatically. If you have an undo block, and if any of the positional parameters such as C<$1> are pointers, you only need to delete them if (1) they would normally be deleted in the action block, and (2) if you actually use them in the unaction block. If you compare the F<.yg> file to the original F<.y> file, you'll see that I added only one undo block, because all the other action blocks only manipulated C<$$>. =item 8. Any functions that you call should not have side effects. If they do, you'll need to reverse them in the unaction block(s). =back =head2 Prepare your include file and additional files =over =item 9. Create a directory like F. =item 10. Put all your source code in that directory, being careful to create proper subdirectories if you need them. For example, if you have C<#include "module/header.h">, you'd want to put F in F. (You don't however, need to change the include to C<#include> at all. The default C++ extension is .cc--you may need to rename your files. (Alternatively, you can copy the GNUmakefile and edit the value of CCFILEEXT.) =back =head2 Generate the generator =over =item 11. Run the following command to create the C++ generator code. ./yagg -f -u input_user_code_myproject yacc.yg lex.lg =back =head2 Build the generator =over =item 12. By default, the output is placed in the F directory. Change to that directory and run GNU make to compile it. cd output make =back You can do this automatically by giving B the -m flag =head2 Run the generator =over =item 13. Generate the strings by running the F program in the F directory, giving it a positive integer for the length of the strings to generate. ./progs/generate 9 =back Specify any length for the generated strings you want. An error message will be displayed if you choose too small a number. =head2 More Information This example is in the F directory of the distribution. See the F file. =head1 CONTROLLING GENERATED STRINGS The most common problem you will encounter is exponential explosion of strings. If B produces strings that you think it should not, first check that the grammar actually says what you think it says. For example, consider this grammar, which generates strings of "a" and "b" in any order (see F in F): node_list : node_list node | node ; node : "a" | "b" ; Obviously we could move C to the terminal file and use equivalence classes to limit the generated strings. But for now, let's assume explore ways to limit the strings using the grammar only. One way is to force all "a"'s to precede all "b"'s. We can do this by modifying the grammar (see F in F): node_list : a_list b_list | a_list | b_list ; a_list : a_list "a" | "a" ; b_list : b_list "b" | "b" ; to change an unordered list of "a"'s and "b"'s so that "b"'s will follow "a"'s. Another approach might be to keep the same grammar but add context-sensitive checks (see F in F). In order to do this, we need to pass the names of the nodes back up the grammar productions, which means that we need to define some return types and add action blocks: // Include some declarations that we need %{ #include #include using namespace std; %} // Define the return types %union { list text_list; string text; } // Define the grammar production return types %type node_list %type node %% node_list : node_list node { if ($$.size() > 0 && $$.back() == "b" && $2 == "a") yyerror("\"a\" can't follow \"b\"!"); else { $$.push_back($2); } } | node { $$.push_back($1); }; node : "a" { $$ = $1; } | "b" { $$ = $1; } ; In this case, we check the node_list as we build it to make sure that we never add an "a" after a "b". The call to C is how the generator knows that the string is invalid. Note that I didn't need to write any unaction blocks because the actions did not change any state other than C<$$>. Which method is better? I would use the latter only for constraints which truly are context sensitive. Restructuring the grammar also results in faster string generation because there is no time wasted generating strings that will only be invalidated later. NOTE: this grammar takes advantage of a feature of the generator--that the C<%union> isn't really used as a union, so you can define types of different (and dynamic) sizes. Bad things would happen if you tried to use this grammar with YACC or Bison. To fix it, you would need to change the C<%union> types to pointers and add C and C to the action blocks, and change "a" and "b" to tokens. See F in F for these changes. You can also compare this to the F file for the parser in the F subdirectory. For a more realistic example of limiting the generated strings, compare the F and F examples in the F directory of the distribution. =head1 CONTROLLING OUTPUT FORMATTING To control output, first create a F directory. Run yagg once so that it creates the F directory. Copy F to F. Modify the C function, which inserts a space by default between strings and adds a newline at the end. Now, when you run yagg, add the B<-u user_code> option so that your modified F will override the normal generated one. If you want more control, you can remove the whitespace insertion in C, then make whitespace an explicit part of your grammar. =head1 AUTHOR David Coppit =cut