source: trunk/flex/doc/flex.info-1@ 3032

Last change on this file since 3032 was 3031, checked in by bird, 18 years ago

flex 2.5.33.

File size: 43.3 KB
Line 
1This is flex.info, produced by makeinfo version 4.5 from flex.texi.
2
3INFO-DIR-SECTION Programming
4START-INFO-DIR-ENTRY
5* flex: (flex). Fast lexical analyzer generator (lex replacement).
6END-INFO-DIR-ENTRY
7
8
9 The flex manual is placed under the same licensing conditions as the
10rest of flex:
11
12 Copyright (C) 1990, 1997 The Regents of the University of California.
13All rights reserved.
14
15 This code is derived from software contributed to Berkeley by Vern
16Paxson.
17
18 The United States Government has rights in this work pursuant to
19contract no. DE-AC03-76SF00098 between the United States Department of
20Energy and the University of California.
21
22 Redistribution and use in source and binary forms, with or without
23modification, are permitted provided that the following conditions are
24met:
25
26 1. Redistributions of source code must retain the above copyright
27 notice, this list of conditions and the following disclaimer.
28
29 2. Redistributions in binary form must reproduce the above copyright
30 notice, this list of conditions and the following disclaimer in the
31 documentation and/or other materials provided with the
32 distribution.
33 Neither the name of the University nor the names of its contributors
34may be used to endorse or promote products derived from this software
35without specific prior written permission.
36
37 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
38WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
39MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
40
41File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir)
42
43flex
44****
45
46 This manual describes `flex', a tool for generating programs that
47perform pattern-matching on text. The manual includes both tutorial and
48reference sections.
49
50 This edition of `The flex Manual' documents `flex' version 2.5.33.
51It was last updated on 20 February 2006.
52
53* Menu:
54
55* Copyright::
56* Reporting Bugs::
57* Introduction::
58* Simple Examples::
59* Format::
60* Patterns::
61* Matching::
62* Actions::
63* Generated Scanner::
64* Start Conditions::
65* Multiple Input Buffers::
66* EOF::
67* Misc Macros::
68* User Values::
69* Yacc::
70* Scanner Options::
71* Performance::
72* Cxx::
73* Reentrant::
74* Lex and Posix::
75* Memory Management::
76* Serialized Tables::
77* Diagnostics::
78* Limitations::
79* Bibliography::
80* FAQ::
81* Appendices::
82* Indices::
83
84 --- The Detailed Node Listing ---
85
86Format of the Input File
87
88* Definitions Section::
89* Rules Section::
90* User Code Section::
91* Comments in the Input::
92
93Scanner Options
94
95* Options for Specifing Filenames::
96* Options Affecting Scanner Behavior::
97* Code-Level And API Options::
98* Options for Scanner Speed and Size::
99* Debugging Options::
100* Miscellaneous Options::
101
102Reentrant C Scanners
103
104* Reentrant Uses::
105* Reentrant Overview::
106* Reentrant Example::
107* Reentrant Detail::
108* Reentrant Functions::
109
110The Reentrant API in Detail
111
112* Specify Reentrant::
113* Extra Reentrant Argument::
114* Global Replacement::
115* Init and Destroy Functions::
116* Accessor Methods::
117* Extra Data::
118* About yyscan_t::
119
120Memory Management
121
122* The Default Memory Management::
123* Overriding The Default Memory Management::
124* A Note About yytext And Memory::
125
126Serialized Tables
127
128* Creating Serialized Tables::
129* Loading and Unloading Serialized Tables::
130* Tables File Format::
131
132FAQ
133
134* When was flex born?::
135* How do I expand \ escape sequences in C-style quoted strings?::
136* Why do flex scanners call fileno if it is not ANSI compatible?::
137* Does flex support recursive pattern definitions?::
138* How do I skip huge chunks of input (tens of megabytes) while using flex?::
139* Flex is not matching my patterns in the same order that I defined them.::
140* My actions are executing out of order or sometimes not at all.::
141* How can I have multiple input sources feed into the same scanner at the same time?::
142* Can I build nested parsers that work with the same input file?::
143* How can I match text only at the end of a file?::
144* How can I make REJECT cascade across start condition boundaries?::
145* Why cant I use fast or full tables with interactive mode?::
146* How much faster is -F or -f than -C?::
147* If I have a simple grammar cant I just parse it with flex?::
148* Why doesnt yyrestart() set the start state back to INITIAL?::
149* How can I match C-style comments?::
150* The period isnt working the way I expected.::
151* Can I get the flex manual in another format?::
152* Does there exist a "faster" NDFA->DFA algorithm?::
153* How does flex compile the DFA so quickly?::
154* How can I use more than 8192 rules?::
155* How do I abandon a file in the middle of a scan and switch to a new file?::
156* How do I execute code only during initialization (only before the first scan)?::
157* How do I execute code at termination?::
158* Where else can I find help?::
159* Can I include comments in the "rules" section of the file?::
160* I get an error about undefined yywrap().::
161* How can I change the matching pattern at run time?::
162* How can I expand macros in the input?::
163* How can I build a two-pass scanner?::
164* How do I match any string not matched in the preceding rules?::
165* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
166* Is there a way to make flex treat NULL like a regular character?::
167* Whenever flex can not match the input it says "flex scanner jammed".::
168* Why doesnt flex have non-greedy operators like perl does?::
169* Memory leak - 16386 bytes allocated by malloc.::
170* How do I track the byte offset for lseek()?::
171* How do I use my own I/O classes in a C++ scanner?::
172* How do I skip as many chars as possible?::
173* deleteme00::
174* Are certain equivalent patterns faster than others?::
175* Is backing up a big deal?::
176* Can I fake multi-byte character support?::
177* deleteme01::
178* Can you discuss some flex internals?::
179* unput() messes up yy_at_bol::
180* The | operator is not doing what I want::
181* Why can't flex understand this variable trailing context pattern?::
182* The ^ operator isn't working::
183* Trailing context is getting confused with trailing optional patterns::
184* Is flex GNU or not?::
185* ERASEME53::
186* I need to scan if-then-else blocks and while loops::
187* ERASEME55::
188* ERASEME56::
189* ERASEME57::
190* Is there a repository for flex scanners?::
191* How can I conditionally compile or preprocess my flex input file?::
192* Where can I find grammars for lex and yacc?::
193* I get an end-of-buffer message for each character scanned.::
194* unnamed-faq-62::
195* unnamed-faq-63::
196* unnamed-faq-64::
197* unnamed-faq-65::
198* unnamed-faq-66::
199* unnamed-faq-67::
200* unnamed-faq-68::
201* unnamed-faq-69::
202* unnamed-faq-70::
203* unnamed-faq-71::
204* unnamed-faq-72::
205* unnamed-faq-73::
206* unnamed-faq-74::
207* unnamed-faq-75::
208* unnamed-faq-76::
209* unnamed-faq-77::
210* unnamed-faq-78::
211* unnamed-faq-79::
212* unnamed-faq-80::
213* unnamed-faq-81::
214* unnamed-faq-82::
215* unnamed-faq-83::
216* unnamed-faq-84::
217* unnamed-faq-85::
218* unnamed-faq-86::
219* unnamed-faq-87::
220* unnamed-faq-88::
221* unnamed-faq-90::
222* unnamed-faq-91::
223* unnamed-faq-92::
224* unnamed-faq-93::
225* unnamed-faq-94::
226* unnamed-faq-95::
227* unnamed-faq-96::
228* unnamed-faq-97::
229* unnamed-faq-98::
230* unnamed-faq-99::
231* unnamed-faq-100::
232* unnamed-faq-101::
233* What is the difference between YYLEX_PARAM and YY_DECL?::
234* Why do I get "conflicting types for yylex" error?::
235* How do I access the values set in a Flex action from within a Bison action?::
236
237Appendices
238
239* Makefiles and Flex::
240* Bison Bridge::
241* M4 Dependency::
242
243Indices
244
245* Concept Index::
246* Index of Functions and Macros::
247* Index of Variables::
248* Index of Data Types::
249* Index of Hooks::
250* Index of Scanner Options::
251
252
253File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top
254
255Copyright
256*********
257
258
259 The flex manual is placed under the same licensing conditions as the
260rest of flex:
261
262 Copyright (C) 1990, 1997 The Regents of the University of California.
263All rights reserved.
264
265 This code is derived from software contributed to Berkeley by Vern
266Paxson.
267
268 The United States Government has rights in this work pursuant to
269contract no. DE-AC03-76SF00098 between the United States Department of
270Energy and the University of California.
271
272 Redistribution and use in source and binary forms, with or without
273modification, are permitted provided that the following conditions are
274met:
275
276 1. Redistributions of source code must retain the above copyright
277 notice, this list of conditions and the following disclaimer.
278
279 2. Redistributions in binary form must reproduce the above copyright
280 notice, this list of conditions and the following disclaimer in the
281 documentation and/or other materials provided with the
282 distribution.
283 Neither the name of the University nor the names of its contributors
284may be used to endorse or promote products derived from this software
285without specific prior written permission.
286
287 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
288WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
289MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
290
291File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top
292
293Reporting Bugs
294**************
295
296 If you have problems with `flex' or think you have found a bug,
297please send mail detailing your problem to
298<flex-help@lists.sourceforge.net>. Patches are always welcome.
299
300
301File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top
302
303Introduction
304************
305
306 `flex' is a tool for generating "scanners". A scanner is a program
307which recognizes lexical patterns in text. The `flex' program reads
308the given input files, or its standard input if no file names are
309given, for a description of a scanner to generate. The description is
310in the form of pairs of regular expressions and C code, called "rules".
311`flex' generates as output a C source file, `lex.yy.c' by default,
312which defines a routine `yylex()'. This file can be compiled and
313linked with the flex runtime library to produce an executable. When
314the executable is run, it analyzes its input for occurrences of the
315regular expressions. Whenever it finds one, it executes the
316corresponding C code.
317
318
319File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top
320
321Some Simple Examples
322********************
323
324 First some simple examples to get the flavor of how one uses `flex'.
325
326 The following `flex' input specifies a scanner which, when it
327encounters the string `username' will replace it with the user's login
328name:
329
330
331 %%
332 username printf( "%s", getlogin() );
333
334 By default, any text not matched by a `flex' scanner is copied to
335the output, so the net effect of this scanner is to copy its input file
336to its output with each occurrence of `username' expanded. In this
337input, there is just one rule. `username' is the "pattern" and the
338`printf' is the "action". The `%%' symbol marks the beginning of the
339rules.
340
341 Here's another simple example:
342
343
344 int num_lines = 0, num_chars = 0;
345
346 %%
347 \n ++num_lines; ++num_chars;
348 . ++num_chars;
349
350 %%
351 main()
352 {
353 yylex();
354 printf( "# of lines = %d, # of chars = %d\n",
355 num_lines, num_chars );
356 }
357
358 This scanner counts the number of characters and the number of lines
359in its input. It produces no output other than the final report on the
360character and line counts. The first line declares two globals,
361`num_lines' and `num_chars', which are accessible both inside `yylex()'
362and in the `main()' routine declared after the second `%%'. There are
363two rules, one which matches a newline (`\n') and increments both the
364line count and the character count, and one which matches any character
365other than a newline (indicated by the `.' regular expression).
366
367 A somewhat more complicated example:
368
369
370 /* scanner for a toy Pascal-like language */
371
372 %{
373 /* need this for the call to atof() below */
374 #include math.h>
375 %}
376
377 DIGIT [0-9]
378 ID [a-z][a-z0-9]*
379
380 %%
381
382 {DIGIT}+ {
383 printf( "An integer: %s (%d)\n", yytext,
384 atoi( yytext ) );
385 }
386
387 {DIGIT}+"."{DIGIT}* {
388 printf( "A float: %s (%g)\n", yytext,
389 atof( yytext ) );
390 }
391
392 if|then|begin|end|procedure|function {
393 printf( "A keyword: %s\n", yytext );
394 }
395
396 {ID} printf( "An identifier: %s\n", yytext );
397
398 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
399
400 "{"[\^{}}\n]*"}" /* eat up one-line comments */
401
402 [ \t\n]+ /* eat up whitespace */
403
404 . printf( "Unrecognized character: %s\n", yytext );
405
406 %%
407
408 main( argc, argv )
409 int argc;
410 char **argv;
411 {
412 ++argv, --argc; /* skip over program name */
413 if ( argc > 0 )
414 yyin = fopen( argv[0], "r" );
415 else
416 yyin = stdin;
417
418 yylex();
419 }
420
421 This is the beginnings of a simple scanner for a language like
422Pascal. It identifies different types of "tokens" and reports on what
423it has seen.
424
425 The details of this example will be explained in the following
426sections.
427
428
429File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top
430
431Format of the Input File
432************************
433
434 The `flex' input file consists of three sections, separated by a
435line containing only `%%'.
436
437
438 definitions
439 %%
440 rules
441 %%
442 user code
443
444* Menu:
445
446* Definitions Section::
447* Rules Section::
448* User Code Section::
449* Comments in the Input::
450
451
452File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format
453
454Format of the Definitions Section
455=================================
456
457 The "definitions section" contains declarations of simple "name"
458definitions to simplify the scanner specification, and declarations of
459"start conditions", which are explained in a later section.
460
461 Name definitions have the form:
462
463
464 name definition
465
466 The `name' is a word beginning with a letter or an underscore (`_')
467followed by zero or more letters, digits, `_', or `-' (dash). The
468definition is taken to begin at the first non-whitespace character
469following the name and continuing to the end of the line. The
470definition can subsequently be referred to using `{name}', which will
471expand to `(definition)'. For example,
472
473
474 DIGIT [0-9]
475 ID [a-z][a-z0-9]*
476
477 Defines `DIGIT' to be a regular expression which matches a single
478digit, and `ID' to be a regular expression which matches a letter
479followed by zero-or-more letters-or-digits. A subsequent reference to
480
481
482 {DIGIT}+"."{DIGIT}*
483
484 is identical to
485
486
487 ([0-9])+"."([0-9])*
488
489 and matches one-or-more digits followed by a `.' followed by
490zero-or-more digits.
491
492 An unindented comment (i.e., a line beginning with `/*') is copied
493verbatim to the output up to the next `*/'.
494
495 Any _indented_ text or text enclosed in `%{' and `%}' is also copied
496verbatim to the output (with the %{ and %} symbols removed). The %{
497and %} symbols must appear unindented on lines by themselves.
498
499 A `%top' block is similar to a `%{' ... `%}' block, except that the
500code in a `%top' block is relocated to the _top_ of the generated file,
501before any flex definitions (1). The `%top' block is useful when you
502want certain preprocessor macros to be defined or certain files to be
503included before the generated code. The single characters, `{' and
504`}' are used to delimit the `%top' block, as show in the example below:
505
506
507 %top{
508 /* This code goes at the "top" of the generated file. */
509 #include <stdint.h>
510 #include <inttypes.h>
511 }
512
513 Multiple `%top' blocks are allowed, and their order is preserved.
514
515 ---------- Footnotes ----------
516
517 (1) Actually, `yyIN_HEADER' is defined before the `%top' block.
518
519
520File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format
521
522Format of the Rules Section
523===========================
524
525 The "rules" section of the `flex' input contains a series of rules
526of the form:
527
528
529 pattern action
530
531 where the pattern must be unindented and the action must begin on
532the same line. *Note Patterns::, for a further description of patterns
533and actions.
534
535 In the rules section, any indented or %{ %} enclosed text appearing
536before the first rule may be used to declare variables which are local
537to the scanning routine and (after the declarations) code which is to be
538executed whenever the scanning routine is entered. Other indented or
539%{ %} text in the rule section is still copied to the output, but its
540meaning is not well-defined and it may well cause compile-time errors
541(this feature is present for POSIX compliance. *Note Lex and Posix::,
542for other such features).
543
544 Any _indented_ text or text enclosed in `%{' and `%}' is copied
545verbatim to the output (with the %{ and %} symbols removed). The %{
546and %} symbols must appear unindented on lines by themselves.
547
548
549File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format
550
551Format of the User Code Section
552===============================
553
554 The user code section is simply copied to `lex.yy.c' verbatim. It
555is used for companion routines which call or are called by the scanner.
556The presence of this section is optional; if it is missing, the second
557`%%' in the input file may be skipped, too.
558
559
560File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format
561
562Comments in the Input
563=====================
564
565 Flex supports C-style comments, that is, anything between /* and */
566is considered a comment. Whenever flex encounters a comment, it copies
567the entire comment verbatim to the generated source code. Comments may
568appear just about anywhere, but with the following exceptions:
569
570 * Comments may not appear in the Rules Section wherever flex is
571 expecting a regular expression. This means comments may not appear
572 at the beginning of a line, or immediately following a list of
573 scanner states.
574
575 * Comments may not appear on an `%option' line in the Definitions
576 Section.
577
578 If you want to follow a simple rule, then always begin a comment on a
579new line, with one or more whitespace characters before the initial
580`/*'). This rule will work anywhere in the input file.
581
582 All the comments in the following example are valid:
583
584
585 %{
586 /* code block */
587 %}
588
589 /* Definitions Section */
590 %x STATE_X
591
592 %%
593 /* Rules Section */
594 ruleA /* after regex */ { /* code block */ } /* after code block */
595 /* Rules Section (indented) */
596 <STATE_X>{
597 ruleC ECHO;
598 ruleD ECHO;
599 %{
600 /* code block */
601 %}
602 }
603 %%
604 /* User Code Section */
605
606
607File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top
608
609Patterns
610********
611
612 The patterns in the input (see *Note Rules Section::) are written
613using an extended set of regular expressions. These are:
614
615`x'
616 match the character 'x'
617
618`.'
619 any character (byte) except newline
620
621`[xyz]'
622 a "character class"; in this case, the pattern matches either an
623 'x', a 'y', or a 'z'
624
625`[abj-oZ]'
626 a "character class" with a range in it; matches an 'a', a 'b', any
627 letter from 'j' through 'o', or a 'Z'
628
629`[^A-Z]'
630 a "negated character class", i.e., any character but those in the
631 class. In this case, any character EXCEPT an uppercase letter.
632
633`[^A-Z\n]'
634 any character EXCEPT an uppercase letter or a newline
635
636`r*'
637 zero or more r's, where r is any regular expression
638
639`r+'
640 one or more r's
641
642`r?'
643 zero or one r's (that is, "an optional r")
644
645`r{2,5}'
646 anywhere from two to five r's
647
648`r{2,}'
649 two or more r's
650
651`r{4}'
652 exactly 4 r's
653
654`{name}'
655 the expansion of the `name' definition (*note Format::).
656
657`"[xyz]\"foo"'
658 the literal string: `[xyz]"foo'
659
660`\X'
661 if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
662 interpretation of `\x'. Otherwise, a literal `X' (used to escape
663 operators such as `*')
664
665`\0'
666 a NUL character (ASCII code 0)
667
668`\123'
669 the character with octal value 123
670
671`\x2a'
672 the character with hexadecimal value 2a
673
674`(r)'
675 match an `r'; parentheses are used to override precedence (see
676 below)
677
678`rs'
679 the regular expression `r' followed by the regular expression `s';
680 called "concatenation"
681
682`r|s'
683 either an `r' or an `s'
684
685`r/s'
686 an `r' but only if it is followed by an `s'. The text matched by
687 `s' is included when determining whether this rule is the longest
688 match, but is then returned to the input before the action is
689 executed. So the action only sees the text matched by `r'. This
690 type of pattern is called "trailing context". (There are some
691 combinations of `r/s' that flex cannot match correctly. *Note
692 Limitations::, regarding dangerous trailing context.)
693
694`^r'
695 an `r', but only at the beginning of a line (i.e., when just
696 starting to scan, or right after a newline has been scanned).
697
698`r$'
699 an `r', but only at the end of a line (i.e., just before a
700 newline). Equivalent to `r/\n'.
701
702 Note that `flex''s notion of "newline" is exactly whatever the C
703 compiler used to compile `flex' interprets `\n' as; in particular,
704 on some DOS systems you must either filter out `\r's in the input
705 yourself, or explicitly use `r/\r\n' for `r$'.
706
707`<s>r'
708 an `r', but only in start condition `s' (see *Note Start
709 Conditions:: for discussion of start conditions).
710
711`<s1,s2,s3>r'
712 same, but in any of start conditions `s1', `s2', or `s3'.
713
714`<*>r'
715 an `r' in any start condition, even an exclusive one.
716
717`<<EOF>>'
718 an end-of-file.
719
720`<s1,s2><<EOF>>'
721 an end-of-file when in start condition `s1' or `s2'
722
723 Note that inside of a character class, all regular expression
724operators lose their special meaning except escape (`\') and the
725character class operators, `-', `]]', and, at the beginning of the
726class, `^'.
727
728 The regular expressions listed above are grouped according to
729precedence, from highest precedence at the top to lowest at the bottom.
730Those grouped together have equal precedence (see special note on the
731precedence of the repeat operator, `{}', under the documentation for
732the `--posix' POSIX compliance option). For example,
733
734
735 foo|bar*
736
737 is the same as
738
739
740 (foo)|(ba(r*))
741
742 since the `*' operator has higher precedence than concatenation, and
743concatenation higher than alternation (`|'). This pattern therefore
744matches _either_ the string `foo' _or_ the string `ba' followed by
745zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the
746string `bar', use:
747
748
749 foo|(bar)*
750
751 And to match a sequence of zero or more repetitions of `foo' and
752`bar':
753
754
755 (foo|bar)*
756
757 In addition to characters and ranges of characters, character classes
758can also contain "character class expressions". These are expressions
759enclosed inside `[': and `:]' delimiters (which themselves must appear
760between the `[' and `]' of the character class. Other elements may
761occur inside the character class, too). The valid expressions are:
762
763
764 [:alnum:] [:alpha:] [:blank:]
765 [:cntrl:] [:digit:] [:graph:]
766 [:lower:] [:print:] [:punct:]
767 [:space:] [:upper:] [:xdigit:]
768
769 These expressions all designate a set of characters equivalent to the
770corresponding standard C `isXXX' function. For example, `[:alnum:]'
771designates those characters for which `isalnum()' returns true - i.e.,
772any alphabetic or numeric character. Some systems don't provide
773`isblank()', so flex defines `[:blank:]' as a blank or a tab.
774
775 For example, the following character classes are all equivalent:
776
777
778 [[:alnum:]]
779 [[:alpha:][:digit:]]
780 [[:alpha:][0-9]]
781 [a-zA-Z0-9]
782
783 Some notes on patterns are in order.
784
785 * If your scanner is case-insensitive (the `-i' flag), then
786 `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
787
788 * Character classes with ranges, such as `[a-Z]', should be used with
789 caution in a case-insensitive scanner if the range spans upper or
790 lowercase characters. Flex does not know if you want to fold all
791 upper and lowercase characters together, or if you want the
792 literal numeric range specified (with no case folding). When in
793 doubt, flex will assume that you meant the literal numeric range,
794 and will issue a warning. The exception to this rule is a
795 character range such as `[a-z]' or `[S-W]' where it is obvious
796 that you want case-folding to occur. Here are some examples with
797 the `-i' flag enabled:
798
799 Range Result Literal Range Alternate Range
800 `[a-t]' ok `[a-tA-T]'
801 `[A-T]' ok `[a-tA-T]'
802 `[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]'
803 `[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]'
804 `[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]'
805
806 * A negated character class such as the example `[^A-Z]' above
807 _will_ match a newline unless `\n' (or an equivalent escape
808 sequence) is one of the characters explicitly present in the
809 negated character class (e.g., `[^A-Z\n]'). This is unlike how
810 many other regular expression tools treat negated character
811 classes, but unfortunately the inconsistency is historically
812 entrenched. Matching newlines means that a pattern like `[^"]*'
813 can match the entire input unless there's another quote in the
814 input.
815
816 * A rule can have at most one instance of trailing context (the `/'
817 operator or the `$' operator). The start condition, `^', and
818 `<<EOF>>' patterns can only occur at the beginning of a pattern,
819 and, as well as with `/' and `$', cannot be grouped inside
820 parentheses. A `^' which does not occur at the beginning of a
821 rule or a `$' which does not occur at the end of a rule loses its
822 special properties and is treated as a normal character.
823
824 * The following are invalid:
825
826
827 foo/bar$
828 <sc1>foo<sc2>bar
829
830 Note that the first of these can be written `foo/bar\n'.
831
832 * The following will result in `$' or `^' being treated as a normal
833 character:
834
835
836 foo|(bar$)
837 foo|^bar
838
839 If the desired meaning is a `foo' or a
840 `bar'-followed-by-a-newline, the following could be used (the
841 special `|' action is explained below, *note Actions::):
842
843
844 foo |
845 bar$ /* action goes here */
846
847 A similar trick will work for matching a `foo' or a
848 `bar'-at-the-beginning-of-a-line.
849
850
851File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top
852
853How the Input Is Matched
854************************
855
856 When the generated scanner is run, it analyzes its input looking for
857strings which match any of its patterns. If it finds more than one
858match, it takes the one matching the most text (for trailing context
859rules, this includes the length of the trailing part, even though it
860will then be returned to the input). If it finds two or more matches of
861the same length, the rule listed first in the `flex' input file is
862chosen.
863
864 Once the match is determined, the text corresponding to the match
865(called the "token") is made available in the global character pointer
866`yytext', and its length in the global integer `yyleng'. The "action"
867corresponding to the matched pattern is then executed (*note
868Actions::), and then the remaining input is scanned for another match.
869
870 If no match is found, then the "default rule" is executed: the next
871character in the input is considered matched and copied to the standard
872output. Thus, the simplest valid `flex' input is:
873
874
875 %%
876
877 which generates a scanner that simply copies its input (one
878character at a time) to its output.
879
880 Note that `yytext' can be defined in two different ways: either as a
881character _pointer_ or as a character _array_. You can control which
882definition `flex' uses by including one of the special directives
883`%pointer' or `%array' in the first (definitions) section of your flex
884input. The default is `%pointer', unless you use the `-l' lex
885compatibility option, in which case `yytext' will be an array. The
886advantage of using `%pointer' is substantially faster scanning and no
887buffer overflow when matching very large tokens (unless you run out of
888dynamic memory). The disadvantage is that you are restricted in how
889your actions can modify `yytext' (*note Actions::), and calls to the
890`unput()' function destroys the present contents of `yytext', which can
891be a considerable porting headache when moving between different `lex'
892versions.
893
894 The advantage of `%array' is that you can then modify `yytext' to
895your heart's content, and calls to `unput()' do not destroy `yytext'
896(*note Actions::). Furthermore, existing `lex' programs sometimes
897access `yytext' externally using declarations of the form:
898
899
900 extern char yytext[];
901
902 This definition is erroneous when used with `%pointer', but correct
903for `%array'.
904
905 The `%array' declaration defines `yytext' to be an array of `YYLMAX'
906characters, which defaults to a fairly large value. You can change the
907size by simply #define'ing `YYLMAX' to a different value in the first
908section of your `flex' input. As mentioned above, with `%pointer'
909yytext grows dynamically to accommodate large tokens. While this means
910your `%pointer' scanner can accommodate very large tokens (such as
911matching entire blocks of comments), bear in mind that each time the
912scanner must resize `yytext' it also must rescan the entire token from
913the beginning, so matching such tokens can prove slow. `yytext'
914presently does _not_ dynamically grow if a call to `unput()' results in
915too much text being pushed back; instead, a run-time error results.
916
917 Also note that you cannot use `%array' with C++ scanner classes
918(*note Cxx::).
919
920
921File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top
922
923Actions
924*******
925
926 Each pattern in a rule has a corresponding "action", which can be
927any arbitrary C statement. The pattern ends at the first non-escaped
928whitespace character; the remainder of the line is its action. If the
929action is empty, then when the pattern is matched the input token is
930simply discarded. For example, here is the specification for a program
931which deletes all occurrences of `zap me' from its input:
932
933
934 %%
935 "zap me"
936
937 This example will copy all other characters in the input to the
938output since they will be matched by the default rule.
939
940 Here is a program which compresses multiple blanks and tabs down to a
941single blank, and throws away whitespace found at the end of a line:
942
943
944 %%
945 [ \t]+ putchar( ' ' );
946 [ \t]+$ /* ignore this token */
947
948 If the action contains a `}', then the action spans till the
949balancing `}' is found, and the action may cross multiple lines.
950`flex' knows about C strings and comments and won't be fooled by braces
951found within them, but also allows actions to begin with `%{' and will
952consider the action to be all the text up to the next `%}' (regardless
953of ordinary braces inside the action).
954
955 An action consisting solely of a vertical bar (`|') means "same as
956the action for the next rule". See below for an illustration.
957
958 Actions can include arbitrary C code, including `return' statements
959to return a value to whatever routine called `yylex()'. Each time
960`yylex()' is called it continues processing tokens from where it last
961left off until it either reaches the end of the file or executes a
962return.
963
964 Actions are free to modify `yytext' except for lengthening it
965(adding characters to its end-these will overwrite later characters in
966the input stream). This however does not apply when using `%array'
967(*note Matching::). In that case, `yytext' may be freely modified in
968any way.
969
970 Actions are free to modify `yyleng' except they should not do so if
971the action also includes use of `yymore()' (see below).
972
973 There are a number of special directives which can be included
974within an action:
975
976`ECHO'
977 copies yytext to the scanner's output.
978
979`BEGIN'
980 followed by the name of a start condition places the scanner in the
981 corresponding start condition (see below).
982
983`REJECT'
984 directs the scanner to proceed on to the "second best" rule which
985 matched the input (or a prefix of the input). The rule is chosen
986 as described above in *Note Matching::, and `yytext' and `yyleng'
987 set up appropriately. It may either be one which matched as much
988 text as the originally chosen rule but came later in the `flex'
989 input file, or one which matched less text. For example, the
990 following will both count the words in the input and call the
991 routine `special()' whenever `frob' is seen:
992
993
994 int word_count = 0;
995 %%
996
997 frob special(); REJECT;
998 [^ \t\n]+ ++word_count;
999
1000 Without the `REJECT', any occurences of `frob' in the input would
1001 not be counted as words, since the scanner normally executes only
1002 one action per token. Multiple uses of `REJECT' are allowed, each
1003 one finding the next best choice to the currently active rule. For
1004 example, when the following scanner scans the token `abcd', it will
1005 write `abcdabcaba' to the output:
1006
1007
1008 %%
1009 a |
1010 ab |
1011 abc |
1012 abcd ECHO; REJECT;
1013 .|\n /* eat up any unmatched character */
1014
1015 The first three rules share the fourth's action since they use the
1016 special `|' action.
1017
1018 `REJECT' is a particularly expensive feature in terms of scanner
1019 performance; if it is used in _any_ of the scanner's actions it
1020 will slow down _all_ of the scanner's matching. Furthermore,
1021 `REJECT' cannot be used with the `-Cf' or `-CF' options (*note
1022 Scanner Options::).
1023
1024 Note also that unlike the other special actions, `REJECT' is a
1025 _branch_. code immediately following it in the action will _not_
1026 be executed.
1027
1028`yymore()'
1029 tells the scanner that the next time it matches a rule, the
1030 corresponding token should be _appended_ onto the current value of
1031 `yytext' rather than replacing it. For example, given the input
1032 `mega-kludge' the following will write `mega-mega-kludge' to the
1033 output:
1034
1035
1036 %%
1037 mega- ECHO; yymore();
1038 kludge ECHO;
1039
1040 First `mega-' is matched and echoed to the output. Then `kludge'
1041 is matched, but the previous `mega-' is still hanging around at the
1042 beginning of `yytext' so the `ECHO' for the `kludge' rule will
1043 actually write `mega-kludge'.
1044
1045 Two notes regarding use of `yymore()'. First, `yymore()' depends on
1046the value of `yyleng' correctly reflecting the size of the current
1047token, so you must not modify `yyleng' if you are using `yymore()'.
1048Second, the presence of `yymore()' in the scanner's action entails a
1049minor performance penalty in the scanner's matching speed.
1050
1051 `yyless(n)' returns all but the first `n' characters of the current
1052token back to the input stream, where they will be rescanned when the
1053scanner looks for the next match. `yytext' and `yyleng' are adjusted
1054appropriately (e.g., `yyleng' will now be equal to `n'). For example,
1055on the input `foobar' the following will write out `foobarbar':
1056
1057
1058 %%
1059 foobar ECHO; yyless(3);
1060 [a-z]+ ECHO;
1061
1062 An argument of 0 to `yyless()' will cause the entire current input
1063string to be scanned again. Unless you've changed how the scanner will
1064subsequently process its input (using `BEGIN', for example), this will
1065result in an endless loop.
1066
1067 Note that `yyless()' is a macro and can only be used in the flex
1068input file, not from other source files.
1069
1070 `unput(c)' puts the character `c' back onto the input stream. It
1071will be the next character scanned. The following action will take the
1072current token and cause it to be rescanned enclosed in parentheses.
1073
1074
1075 {
1076 int i;
1077 /* Copy yytext because unput() trashes yytext */
1078 char *yycopy = strdup( yytext );
1079 unput( ')' );
1080 for ( i = yyleng - 1; i >= 0; --i )
1081 unput( yycopy[i] );
1082 unput( '(' );
1083 free( yycopy );
1084 }
1085
1086 Note that since each `unput()' puts the given character back at the
1087_beginning_ of the input stream, pushing back strings must be done
1088back-to-front.
1089
1090 An important potential problem when using `unput()' is that if you
1091are using `%pointer' (the default), a call to `unput()' _destroys_ the
1092contents of `yytext', starting with its rightmost character and
1093devouring one character to the left with each call. If you need the
1094value of `yytext' preserved after a call to `unput()' (as in the above
1095example), you must either first copy it elsewhere, or build your
1096scanner using `%array' instead (*note Matching::).
1097
1098 Finally, note that you cannot put back `EOF' to attempt to mark the
1099input stream with an end-of-file.
1100
1101 `input()' reads the next character from the input stream. For
1102example, the following is one way to eat up C comments:
1103
1104
1105 %%
1106 "/*" {
1107 register int c;
1108
1109 for ( ; ; )
1110 {
1111 while ( (c = input()) != '*' &&
1112 c != EOF )
1113 ; /* eat up text of comment */
1114
1115 if ( c == '*' )
1116 {
1117 while ( (c = input()) == '*' )
1118 ;
1119 if ( c == '/' )
1120 break; /* found the end */
1121 }
1122
1123 if ( c == EOF )
1124 {
1125 error( "EOF in comment" );
1126 break;
1127 }
1128 }
1129 }
1130
1131 (Note that if the scanner is compiled using `C++', then `input()' is
1132instead referred to as yyinput(), in order to avoid a name clash with
1133the `C++' stream by the name of `input'.)
1134
1135 `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
1136the next time the scanner attempts to match a token, it will first
1137refill the buffer using `YY_INPUT()' (*note Generated Scanner::). This
1138action is a special case of the more general `yy_flush_buffer()'
1139function, described below (*note Multiple Input Buffers::)
1140
1141 `yyterminate()' can be used in lieu of a return statement in an
1142action. It terminates the scanner and returns a 0 to the scanner's
1143caller, indicating "all done". By default, `yyterminate()' is also
1144called when an end-of-file is encountered. It is a macro and may be
1145redefined.
1146
1147
1148File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top
1149
1150The Generated Scanner
1151*********************
1152
1153 The output of `flex' is the file `lex.yy.c', which contains the
1154scanning routine `yylex()', a number of tables used by it for matching
1155tokens, and a number of auxiliary routines and macros. By default,
1156`yylex()' is declared as follows:
1157
1158
1159 int yylex()
1160 {
1161 ... various definitions and the actions in here ...
1162 }
1163
1164 (If your environment supports function prototypes, then it will be
1165`int yylex( void )'.) This definition may be changed by defining the
1166`YY_DECL' macro. For example, you could use:
1167
1168
1169 #define YY_DECL float lexscan( a, b ) float a, b;
1170
1171 to give the scanning routine the name `lexscan', returning a float,
1172and taking two floats as arguments. Note that if you give arguments to
1173the scanning routine using a K&R-style/non-prototyped function
1174declaration, you must terminate the definition with a semi-colon (;).
1175
1176 `flex' generates `C99' function definitions by default. However flex
1177does have the ability to generate obsolete, er, `traditional', function
1178definitions. This is to support bootstrapping gcc on old systems.
1179Unfortunately, traditional definitions prevent us from using any
1180standard data types smaller than int (such as short, char, or bool) as
1181function arguments. For this reason, future versions of `flex' may
1182generate standard C99 code only, leaving K&R-style functions to the
1183historians. Currently, if you do *not* want `C99' definitions, then
1184you must use `%option noansi-definitions'.
1185
1186 Whenever `yylex()' is called, it scans tokens from the global input
1187file `yyin' (which defaults to stdin). It continues until it either
1188reaches an end-of-file (at which point it returns the value 0) or one
1189of its actions executes a `return' statement.
1190
1191 If the scanner reaches an end-of-file, subsequent calls are undefined
1192unless either `yyin' is pointed at a new input file (in which case
1193scanning continues from that file), or `yyrestart()' is called.
1194`yyrestart()' takes one argument, a `FILE *' pointer (which can be
1195NULL, if you've set up `YY_INPUT' to scan from a source other than
1196`yyin'), and initializes `yyin' for scanning from that file.
1197Essentially there is no difference between just assigning `yyin' to a
1198new input file or using `yyrestart()' to do so; the latter is available
1199for compatibility with previous versions of `flex', and because it can
1200be used to switch input files in the middle of scanning. It can also
1201be used to throw away the current input buffer, by calling it with an
1202argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
1203(*note Actions::). Note that `yyrestart()' does _not_ reset the start
1204condition to `INITIAL' (*note Start Conditions::).
1205
1206 If `yylex()' stops scanning due to executing a `return' statement in
1207one of the actions, the scanner may then be called again and it will
1208resume scanning where it left off.
1209
1210 By default (and for purposes of efficiency), the scanner uses
1211block-reads rather than simple `getc()' calls to read characters from
1212`yyin'. The nature of how it gets its input can be controlled by
1213defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()'
1214is `YY_INPUT(buf,result,max_size)'. Its action is to place up to
1215`max_size' characters in the character array `buf' and return in the
1216integer variable `result' either the number of characters read or the
1217constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default
1218`YY_INPUT' reads from the global file-pointer `yyin'.
1219
1220 Here is a sample definition of `YY_INPUT' (in the definitions
1221section of the input file):
1222
1223
1224 %{
1225 #define YY_INPUT(buf,result,max_size) \
1226 { \
1227 int c = getchar(); \
1228 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
1229 }
1230 %}
1231
1232 This definition will change the input processing to occur one
1233character at a time.
1234
1235 When the scanner receives an end-of-file indication from YY_INPUT, it
1236then checks the `yywrap()' function. If `yywrap()' returns false
1237(zero), then it is assumed that the function has gone ahead and set up
1238`yyin' to point to another input file, and scanning continues. If it
1239returns true (non-zero), then the scanner terminates, returning 0 to
1240its caller. Note that in either case, the start condition remains
1241unchanged; it does _not_ revert to `INITIAL'.
1242
1243 If you do not supply your own version of `yywrap()', then you must
1244either use `%option noyywrap' (in which case the scanner behaves as
1245though `yywrap()' returned 1), or you must link with `-lfl' to obtain
1246the default version of the routine, which always returns 1.
1247
1248 For scanning from in-memory buffers (e.g., scanning strings), see
1249*Note Scanning Strings::. *Note Multiple Input Buffers::.
1250
1251 The scanner writes its `ECHO' output to the `yyout' global (default,
1252`stdout'), which may be redefined by the user simply by assigning it to
1253some other `FILE' pointer.
1254
Note: See TracBrowser for help on using the repository browser.