1 | This is flex.info, produced by makeinfo version 4.5 from flex.texi.
|
---|
2 |
|
---|
3 | INFO-DIR-SECTION Programming
|
---|
4 | START-INFO-DIR-ENTRY
|
---|
5 | * flex: (flex). Fast lexical analyzer generator (lex replacement).
|
---|
6 | END-INFO-DIR-ENTRY
|
---|
7 |
|
---|
8 |
|
---|
9 | The flex manual is placed under the same licensing conditions as the
|
---|
10 | rest of flex:
|
---|
11 |
|
---|
12 | Copyright (C) 1990, 1997 The Regents of the University of California.
|
---|
13 | All rights reserved.
|
---|
14 |
|
---|
15 | This code is derived from software contributed to Berkeley by Vern
|
---|
16 | Paxson.
|
---|
17 |
|
---|
18 | The United States Government has rights in this work pursuant to
|
---|
19 | contract no. DE-AC03-76SF00098 between the United States Department of
|
---|
20 | Energy and the University of California.
|
---|
21 |
|
---|
22 | Redistribution and use in source and binary forms, with or without
|
---|
23 | modification, are permitted provided that the following conditions are
|
---|
24 | met:
|
---|
25 |
|
---|
26 | 1. Redistributions of source code must retain the above copyright
|
---|
27 | notice, this list of conditions and the following disclaimer.
|
---|
28 |
|
---|
29 | 2. Redistributions in binary form must reproduce the above copyright
|
---|
30 | notice, this list of conditions and the following disclaimer in the
|
---|
31 | documentation and/or other materials provided with the
|
---|
32 | distribution.
|
---|
33 | Neither the name of the University nor the names of its contributors
|
---|
34 | may be used to endorse or promote products derived from this software
|
---|
35 | without specific prior written permission.
|
---|
36 |
|
---|
37 | THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
|
---|
38 | WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
---|
39 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
---|
40 |
|
---|
41 | File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir)
|
---|
42 |
|
---|
43 | flex
|
---|
44 | ****
|
---|
45 |
|
---|
46 | This manual describes `flex', a tool for generating programs that
|
---|
47 | perform pattern-matching on text. The manual includes both tutorial and
|
---|
48 | reference sections.
|
---|
49 |
|
---|
50 | This edition of `The flex Manual' documents `flex' version 2.5.33.
|
---|
51 | It was last updated on 20 February 2006.
|
---|
52 |
|
---|
53 | * Menu:
|
---|
54 |
|
---|
55 | * Copyright::
|
---|
56 | * Reporting Bugs::
|
---|
57 | * Introduction::
|
---|
58 | * Simple Examples::
|
---|
59 | * Format::
|
---|
60 | * Patterns::
|
---|
61 | * Matching::
|
---|
62 | * Actions::
|
---|
63 | * Generated Scanner::
|
---|
64 | * Start Conditions::
|
---|
65 | * Multiple Input Buffers::
|
---|
66 | * EOF::
|
---|
67 | * Misc Macros::
|
---|
68 | * User Values::
|
---|
69 | * Yacc::
|
---|
70 | * Scanner Options::
|
---|
71 | * Performance::
|
---|
72 | * Cxx::
|
---|
73 | * Reentrant::
|
---|
74 | * Lex and Posix::
|
---|
75 | * Memory Management::
|
---|
76 | * Serialized Tables::
|
---|
77 | * Diagnostics::
|
---|
78 | * Limitations::
|
---|
79 | * Bibliography::
|
---|
80 | * FAQ::
|
---|
81 | * Appendices::
|
---|
82 | * Indices::
|
---|
83 |
|
---|
84 | --- The Detailed Node Listing ---
|
---|
85 |
|
---|
86 | Format of the Input File
|
---|
87 |
|
---|
88 | * Definitions Section::
|
---|
89 | * Rules Section::
|
---|
90 | * User Code Section::
|
---|
91 | * Comments in the Input::
|
---|
92 |
|
---|
93 | Scanner Options
|
---|
94 |
|
---|
95 | * Options for Specifing Filenames::
|
---|
96 | * Options Affecting Scanner Behavior::
|
---|
97 | * Code-Level And API Options::
|
---|
98 | * Options for Scanner Speed and Size::
|
---|
99 | * Debugging Options::
|
---|
100 | * Miscellaneous Options::
|
---|
101 |
|
---|
102 | Reentrant C Scanners
|
---|
103 |
|
---|
104 | * Reentrant Uses::
|
---|
105 | * Reentrant Overview::
|
---|
106 | * Reentrant Example::
|
---|
107 | * Reentrant Detail::
|
---|
108 | * Reentrant Functions::
|
---|
109 |
|
---|
110 | The Reentrant API in Detail
|
---|
111 |
|
---|
112 | * Specify Reentrant::
|
---|
113 | * Extra Reentrant Argument::
|
---|
114 | * Global Replacement::
|
---|
115 | * Init and Destroy Functions::
|
---|
116 | * Accessor Methods::
|
---|
117 | * Extra Data::
|
---|
118 | * About yyscan_t::
|
---|
119 |
|
---|
120 | Memory Management
|
---|
121 |
|
---|
122 | * The Default Memory Management::
|
---|
123 | * Overriding The Default Memory Management::
|
---|
124 | * A Note About yytext And Memory::
|
---|
125 |
|
---|
126 | Serialized Tables
|
---|
127 |
|
---|
128 | * Creating Serialized Tables::
|
---|
129 | * Loading and Unloading Serialized Tables::
|
---|
130 | * Tables File Format::
|
---|
131 |
|
---|
132 | FAQ
|
---|
133 |
|
---|
134 | * When was flex born?::
|
---|
135 | * How do I expand \ escape sequences in C-style quoted strings?::
|
---|
136 | * Why do flex scanners call fileno if it is not ANSI compatible?::
|
---|
137 | * Does flex support recursive pattern definitions?::
|
---|
138 | * How do I skip huge chunks of input (tens of megabytes) while using flex?::
|
---|
139 | * Flex is not matching my patterns in the same order that I defined them.::
|
---|
140 | * My actions are executing out of order or sometimes not at all.::
|
---|
141 | * How can I have multiple input sources feed into the same scanner at the same time?::
|
---|
142 | * Can I build nested parsers that work with the same input file?::
|
---|
143 | * How can I match text only at the end of a file?::
|
---|
144 | * How can I make REJECT cascade across start condition boundaries?::
|
---|
145 | * Why cant I use fast or full tables with interactive mode?::
|
---|
146 | * How much faster is -F or -f than -C?::
|
---|
147 | * If I have a simple grammar cant I just parse it with flex?::
|
---|
148 | * Why doesnt yyrestart() set the start state back to INITIAL?::
|
---|
149 | * How can I match C-style comments?::
|
---|
150 | * The period isnt working the way I expected.::
|
---|
151 | * Can I get the flex manual in another format?::
|
---|
152 | * Does there exist a "faster" NDFA->DFA algorithm?::
|
---|
153 | * How does flex compile the DFA so quickly?::
|
---|
154 | * How can I use more than 8192 rules?::
|
---|
155 | * How do I abandon a file in the middle of a scan and switch to a new file?::
|
---|
156 | * How do I execute code only during initialization (only before the first scan)?::
|
---|
157 | * How do I execute code at termination?::
|
---|
158 | * Where else can I find help?::
|
---|
159 | * Can I include comments in the "rules" section of the file?::
|
---|
160 | * I get an error about undefined yywrap().::
|
---|
161 | * How can I change the matching pattern at run time?::
|
---|
162 | * How can I expand macros in the input?::
|
---|
163 | * How can I build a two-pass scanner?::
|
---|
164 | * How do I match any string not matched in the preceding rules?::
|
---|
165 | * I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
|
---|
166 | * Is there a way to make flex treat NULL like a regular character?::
|
---|
167 | * Whenever flex can not match the input it says "flex scanner jammed".::
|
---|
168 | * Why doesnt flex have non-greedy operators like perl does?::
|
---|
169 | * Memory leak - 16386 bytes allocated by malloc.::
|
---|
170 | * How do I track the byte offset for lseek()?::
|
---|
171 | * How do I use my own I/O classes in a C++ scanner?::
|
---|
172 | * How do I skip as many chars as possible?::
|
---|
173 | * deleteme00::
|
---|
174 | * Are certain equivalent patterns faster than others?::
|
---|
175 | * Is backing up a big deal?::
|
---|
176 | * Can I fake multi-byte character support?::
|
---|
177 | * deleteme01::
|
---|
178 | * Can you discuss some flex internals?::
|
---|
179 | * unput() messes up yy_at_bol::
|
---|
180 | * The | operator is not doing what I want::
|
---|
181 | * Why can't flex understand this variable trailing context pattern?::
|
---|
182 | * The ^ operator isn't working::
|
---|
183 | * Trailing context is getting confused with trailing optional patterns::
|
---|
184 | * Is flex GNU or not?::
|
---|
185 | * ERASEME53::
|
---|
186 | * I need to scan if-then-else blocks and while loops::
|
---|
187 | * ERASEME55::
|
---|
188 | * ERASEME56::
|
---|
189 | * ERASEME57::
|
---|
190 | * Is there a repository for flex scanners?::
|
---|
191 | * How can I conditionally compile or preprocess my flex input file?::
|
---|
192 | * Where can I find grammars for lex and yacc?::
|
---|
193 | * I get an end-of-buffer message for each character scanned.::
|
---|
194 | * unnamed-faq-62::
|
---|
195 | * unnamed-faq-63::
|
---|
196 | * unnamed-faq-64::
|
---|
197 | * unnamed-faq-65::
|
---|
198 | * unnamed-faq-66::
|
---|
199 | * unnamed-faq-67::
|
---|
200 | * unnamed-faq-68::
|
---|
201 | * unnamed-faq-69::
|
---|
202 | * unnamed-faq-70::
|
---|
203 | * unnamed-faq-71::
|
---|
204 | * unnamed-faq-72::
|
---|
205 | * unnamed-faq-73::
|
---|
206 | * unnamed-faq-74::
|
---|
207 | * unnamed-faq-75::
|
---|
208 | * unnamed-faq-76::
|
---|
209 | * unnamed-faq-77::
|
---|
210 | * unnamed-faq-78::
|
---|
211 | * unnamed-faq-79::
|
---|
212 | * unnamed-faq-80::
|
---|
213 | * unnamed-faq-81::
|
---|
214 | * unnamed-faq-82::
|
---|
215 | * unnamed-faq-83::
|
---|
216 | * unnamed-faq-84::
|
---|
217 | * unnamed-faq-85::
|
---|
218 | * unnamed-faq-86::
|
---|
219 | * unnamed-faq-87::
|
---|
220 | * unnamed-faq-88::
|
---|
221 | * unnamed-faq-90::
|
---|
222 | * unnamed-faq-91::
|
---|
223 | * unnamed-faq-92::
|
---|
224 | * unnamed-faq-93::
|
---|
225 | * unnamed-faq-94::
|
---|
226 | * unnamed-faq-95::
|
---|
227 | * unnamed-faq-96::
|
---|
228 | * unnamed-faq-97::
|
---|
229 | * unnamed-faq-98::
|
---|
230 | * unnamed-faq-99::
|
---|
231 | * unnamed-faq-100::
|
---|
232 | * unnamed-faq-101::
|
---|
233 | * What is the difference between YYLEX_PARAM and YY_DECL?::
|
---|
234 | * Why do I get "conflicting types for yylex" error?::
|
---|
235 | * How do I access the values set in a Flex action from within a Bison action?::
|
---|
236 |
|
---|
237 | Appendices
|
---|
238 |
|
---|
239 | * Makefiles and Flex::
|
---|
240 | * Bison Bridge::
|
---|
241 | * M4 Dependency::
|
---|
242 |
|
---|
243 | Indices
|
---|
244 |
|
---|
245 | * Concept Index::
|
---|
246 | * Index of Functions and Macros::
|
---|
247 | * Index of Variables::
|
---|
248 | * Index of Data Types::
|
---|
249 | * Index of Hooks::
|
---|
250 | * Index of Scanner Options::
|
---|
251 |
|
---|
252 |
|
---|
253 | File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top
|
---|
254 |
|
---|
255 | Copyright
|
---|
256 | *********
|
---|
257 |
|
---|
258 |
|
---|
259 | The flex manual is placed under the same licensing conditions as the
|
---|
260 | rest of flex:
|
---|
261 |
|
---|
262 | Copyright (C) 1990, 1997 The Regents of the University of California.
|
---|
263 | All rights reserved.
|
---|
264 |
|
---|
265 | This code is derived from software contributed to Berkeley by Vern
|
---|
266 | Paxson.
|
---|
267 |
|
---|
268 | The United States Government has rights in this work pursuant to
|
---|
269 | contract no. DE-AC03-76SF00098 between the United States Department of
|
---|
270 | Energy and the University of California.
|
---|
271 |
|
---|
272 | Redistribution and use in source and binary forms, with or without
|
---|
273 | modification, are permitted provided that the following conditions are
|
---|
274 | met:
|
---|
275 |
|
---|
276 | 1. Redistributions of source code must retain the above copyright
|
---|
277 | notice, this list of conditions and the following disclaimer.
|
---|
278 |
|
---|
279 | 2. Redistributions in binary form must reproduce the above copyright
|
---|
280 | notice, this list of conditions and the following disclaimer in the
|
---|
281 | documentation and/or other materials provided with the
|
---|
282 | distribution.
|
---|
283 | Neither the name of the University nor the names of its contributors
|
---|
284 | may be used to endorse or promote products derived from this software
|
---|
285 | without specific prior written permission.
|
---|
286 |
|
---|
287 | THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
|
---|
288 | WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
---|
289 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
---|
290 |
|
---|
291 | File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top
|
---|
292 |
|
---|
293 | Reporting Bugs
|
---|
294 | **************
|
---|
295 |
|
---|
296 | If you have problems with `flex' or think you have found a bug,
|
---|
297 | please send mail detailing your problem to
|
---|
298 | <flex-help@lists.sourceforge.net>. Patches are always welcome.
|
---|
299 |
|
---|
300 |
|
---|
301 | File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top
|
---|
302 |
|
---|
303 | Introduction
|
---|
304 | ************
|
---|
305 |
|
---|
306 | `flex' is a tool for generating "scanners". A scanner is a program
|
---|
307 | which recognizes lexical patterns in text. The `flex' program reads
|
---|
308 | the given input files, or its standard input if no file names are
|
---|
309 | given, for a description of a scanner to generate. The description is
|
---|
310 | in the form of pairs of regular expressions and C code, called "rules".
|
---|
311 | `flex' generates as output a C source file, `lex.yy.c' by default,
|
---|
312 | which defines a routine `yylex()'. This file can be compiled and
|
---|
313 | linked with the flex runtime library to produce an executable. When
|
---|
314 | the executable is run, it analyzes its input for occurrences of the
|
---|
315 | regular expressions. Whenever it finds one, it executes the
|
---|
316 | corresponding C code.
|
---|
317 |
|
---|
318 |
|
---|
319 | File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top
|
---|
320 |
|
---|
321 | Some Simple Examples
|
---|
322 | ********************
|
---|
323 |
|
---|
324 | First some simple examples to get the flavor of how one uses `flex'.
|
---|
325 |
|
---|
326 | The following `flex' input specifies a scanner which, when it
|
---|
327 | encounters the string `username' will replace it with the user's login
|
---|
328 | name:
|
---|
329 |
|
---|
330 |
|
---|
331 | %%
|
---|
332 | username printf( "%s", getlogin() );
|
---|
333 |
|
---|
334 | By default, any text not matched by a `flex' scanner is copied to
|
---|
335 | the output, so the net effect of this scanner is to copy its input file
|
---|
336 | to its output with each occurrence of `username' expanded. In this
|
---|
337 | input, there is just one rule. `username' is the "pattern" and the
|
---|
338 | `printf' is the "action". The `%%' symbol marks the beginning of the
|
---|
339 | rules.
|
---|
340 |
|
---|
341 | Here's another simple example:
|
---|
342 |
|
---|
343 |
|
---|
344 | int num_lines = 0, num_chars = 0;
|
---|
345 |
|
---|
346 | %%
|
---|
347 | \n ++num_lines; ++num_chars;
|
---|
348 | . ++num_chars;
|
---|
349 |
|
---|
350 | %%
|
---|
351 | main()
|
---|
352 | {
|
---|
353 | yylex();
|
---|
354 | printf( "# of lines = %d, # of chars = %d\n",
|
---|
355 | num_lines, num_chars );
|
---|
356 | }
|
---|
357 |
|
---|
358 | This scanner counts the number of characters and the number of lines
|
---|
359 | in its input. It produces no output other than the final report on the
|
---|
360 | character and line counts. The first line declares two globals,
|
---|
361 | `num_lines' and `num_chars', which are accessible both inside `yylex()'
|
---|
362 | and in the `main()' routine declared after the second `%%'. There are
|
---|
363 | two rules, one which matches a newline (`\n') and increments both the
|
---|
364 | line count and the character count, and one which matches any character
|
---|
365 | other than a newline (indicated by the `.' regular expression).
|
---|
366 |
|
---|
367 | A somewhat more complicated example:
|
---|
368 |
|
---|
369 |
|
---|
370 | /* scanner for a toy Pascal-like language */
|
---|
371 |
|
---|
372 | %{
|
---|
373 | /* need this for the call to atof() below */
|
---|
374 | #include math.h>
|
---|
375 | %}
|
---|
376 |
|
---|
377 | DIGIT [0-9]
|
---|
378 | ID [a-z][a-z0-9]*
|
---|
379 |
|
---|
380 | %%
|
---|
381 |
|
---|
382 | {DIGIT}+ {
|
---|
383 | printf( "An integer: %s (%d)\n", yytext,
|
---|
384 | atoi( yytext ) );
|
---|
385 | }
|
---|
386 |
|
---|
387 | {DIGIT}+"."{DIGIT}* {
|
---|
388 | printf( "A float: %s (%g)\n", yytext,
|
---|
389 | atof( yytext ) );
|
---|
390 | }
|
---|
391 |
|
---|
392 | if|then|begin|end|procedure|function {
|
---|
393 | printf( "A keyword: %s\n", yytext );
|
---|
394 | }
|
---|
395 |
|
---|
396 | {ID} printf( "An identifier: %s\n", yytext );
|
---|
397 |
|
---|
398 | "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
|
---|
399 |
|
---|
400 | "{"[\^{}}\n]*"}" /* eat up one-line comments */
|
---|
401 |
|
---|
402 | [ \t\n]+ /* eat up whitespace */
|
---|
403 |
|
---|
404 | . printf( "Unrecognized character: %s\n", yytext );
|
---|
405 |
|
---|
406 | %%
|
---|
407 |
|
---|
408 | main( argc, argv )
|
---|
409 | int argc;
|
---|
410 | char **argv;
|
---|
411 | {
|
---|
412 | ++argv, --argc; /* skip over program name */
|
---|
413 | if ( argc > 0 )
|
---|
414 | yyin = fopen( argv[0], "r" );
|
---|
415 | else
|
---|
416 | yyin = stdin;
|
---|
417 |
|
---|
418 | yylex();
|
---|
419 | }
|
---|
420 |
|
---|
421 | This is the beginnings of a simple scanner for a language like
|
---|
422 | Pascal. It identifies different types of "tokens" and reports on what
|
---|
423 | it has seen.
|
---|
424 |
|
---|
425 | The details of this example will be explained in the following
|
---|
426 | sections.
|
---|
427 |
|
---|
428 |
|
---|
429 | File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top
|
---|
430 |
|
---|
431 | Format of the Input File
|
---|
432 | ************************
|
---|
433 |
|
---|
434 | The `flex' input file consists of three sections, separated by a
|
---|
435 | line containing only `%%'.
|
---|
436 |
|
---|
437 |
|
---|
438 | definitions
|
---|
439 | %%
|
---|
440 | rules
|
---|
441 | %%
|
---|
442 | user code
|
---|
443 |
|
---|
444 | * Menu:
|
---|
445 |
|
---|
446 | * Definitions Section::
|
---|
447 | * Rules Section::
|
---|
448 | * User Code Section::
|
---|
449 | * Comments in the Input::
|
---|
450 |
|
---|
451 |
|
---|
452 | File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format
|
---|
453 |
|
---|
454 | Format of the Definitions Section
|
---|
455 | =================================
|
---|
456 |
|
---|
457 | The "definitions section" contains declarations of simple "name"
|
---|
458 | definitions to simplify the scanner specification, and declarations of
|
---|
459 | "start conditions", which are explained in a later section.
|
---|
460 |
|
---|
461 | Name definitions have the form:
|
---|
462 |
|
---|
463 |
|
---|
464 | name definition
|
---|
465 |
|
---|
466 | The `name' is a word beginning with a letter or an underscore (`_')
|
---|
467 | followed by zero or more letters, digits, `_', or `-' (dash). The
|
---|
468 | definition is taken to begin at the first non-whitespace character
|
---|
469 | following the name and continuing to the end of the line. The
|
---|
470 | definition can subsequently be referred to using `{name}', which will
|
---|
471 | expand to `(definition)'. For example,
|
---|
472 |
|
---|
473 |
|
---|
474 | DIGIT [0-9]
|
---|
475 | ID [a-z][a-z0-9]*
|
---|
476 |
|
---|
477 | Defines `DIGIT' to be a regular expression which matches a single
|
---|
478 | digit, and `ID' to be a regular expression which matches a letter
|
---|
479 | followed by zero-or-more letters-or-digits. A subsequent reference to
|
---|
480 |
|
---|
481 |
|
---|
482 | {DIGIT}+"."{DIGIT}*
|
---|
483 |
|
---|
484 | is identical to
|
---|
485 |
|
---|
486 |
|
---|
487 | ([0-9])+"."([0-9])*
|
---|
488 |
|
---|
489 | and matches one-or-more digits followed by a `.' followed by
|
---|
490 | zero-or-more digits.
|
---|
491 |
|
---|
492 | An unindented comment (i.e., a line beginning with `/*') is copied
|
---|
493 | verbatim to the output up to the next `*/'.
|
---|
494 |
|
---|
495 | Any _indented_ text or text enclosed in `%{' and `%}' is also copied
|
---|
496 | verbatim to the output (with the %{ and %} symbols removed). The %{
|
---|
497 | and %} symbols must appear unindented on lines by themselves.
|
---|
498 |
|
---|
499 | A `%top' block is similar to a `%{' ... `%}' block, except that the
|
---|
500 | code in a `%top' block is relocated to the _top_ of the generated file,
|
---|
501 | before any flex definitions (1). The `%top' block is useful when you
|
---|
502 | want certain preprocessor macros to be defined or certain files to be
|
---|
503 | included before the generated code. The single characters, `{' and
|
---|
504 | `}' are used to delimit the `%top' block, as show in the example below:
|
---|
505 |
|
---|
506 |
|
---|
507 | %top{
|
---|
508 | /* This code goes at the "top" of the generated file. */
|
---|
509 | #include <stdint.h>
|
---|
510 | #include <inttypes.h>
|
---|
511 | }
|
---|
512 |
|
---|
513 | Multiple `%top' blocks are allowed, and their order is preserved.
|
---|
514 |
|
---|
515 | ---------- Footnotes ----------
|
---|
516 |
|
---|
517 | (1) Actually, `yyIN_HEADER' is defined before the `%top' block.
|
---|
518 |
|
---|
519 |
|
---|
520 | File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format
|
---|
521 |
|
---|
522 | Format of the Rules Section
|
---|
523 | ===========================
|
---|
524 |
|
---|
525 | The "rules" section of the `flex' input contains a series of rules
|
---|
526 | of the form:
|
---|
527 |
|
---|
528 |
|
---|
529 | pattern action
|
---|
530 |
|
---|
531 | where the pattern must be unindented and the action must begin on
|
---|
532 | the same line. *Note Patterns::, for a further description of patterns
|
---|
533 | and actions.
|
---|
534 |
|
---|
535 | In the rules section, any indented or %{ %} enclosed text appearing
|
---|
536 | before the first rule may be used to declare variables which are local
|
---|
537 | to the scanning routine and (after the declarations) code which is to be
|
---|
538 | executed whenever the scanning routine is entered. Other indented or
|
---|
539 | %{ %} text in the rule section is still copied to the output, but its
|
---|
540 | meaning is not well-defined and it may well cause compile-time errors
|
---|
541 | (this feature is present for POSIX compliance. *Note Lex and Posix::,
|
---|
542 | for other such features).
|
---|
543 |
|
---|
544 | Any _indented_ text or text enclosed in `%{' and `%}' is copied
|
---|
545 | verbatim to the output (with the %{ and %} symbols removed). The %{
|
---|
546 | and %} symbols must appear unindented on lines by themselves.
|
---|
547 |
|
---|
548 |
|
---|
549 | File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format
|
---|
550 |
|
---|
551 | Format of the User Code Section
|
---|
552 | ===============================
|
---|
553 |
|
---|
554 | The user code section is simply copied to `lex.yy.c' verbatim. It
|
---|
555 | is used for companion routines which call or are called by the scanner.
|
---|
556 | The presence of this section is optional; if it is missing, the second
|
---|
557 | `%%' in the input file may be skipped, too.
|
---|
558 |
|
---|
559 |
|
---|
560 | File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format
|
---|
561 |
|
---|
562 | Comments in the Input
|
---|
563 | =====================
|
---|
564 |
|
---|
565 | Flex supports C-style comments, that is, anything between /* and */
|
---|
566 | is considered a comment. Whenever flex encounters a comment, it copies
|
---|
567 | the entire comment verbatim to the generated source code. Comments may
|
---|
568 | appear just about anywhere, but with the following exceptions:
|
---|
569 |
|
---|
570 | * Comments may not appear in the Rules Section wherever flex is
|
---|
571 | expecting a regular expression. This means comments may not appear
|
---|
572 | at the beginning of a line, or immediately following a list of
|
---|
573 | scanner states.
|
---|
574 |
|
---|
575 | * Comments may not appear on an `%option' line in the Definitions
|
---|
576 | Section.
|
---|
577 |
|
---|
578 | If you want to follow a simple rule, then always begin a comment on a
|
---|
579 | new line, with one or more whitespace characters before the initial
|
---|
580 | `/*'). This rule will work anywhere in the input file.
|
---|
581 |
|
---|
582 | All the comments in the following example are valid:
|
---|
583 |
|
---|
584 |
|
---|
585 | %{
|
---|
586 | /* code block */
|
---|
587 | %}
|
---|
588 |
|
---|
589 | /* Definitions Section */
|
---|
590 | %x STATE_X
|
---|
591 |
|
---|
592 | %%
|
---|
593 | /* Rules Section */
|
---|
594 | ruleA /* after regex */ { /* code block */ } /* after code block */
|
---|
595 | /* Rules Section (indented) */
|
---|
596 | <STATE_X>{
|
---|
597 | ruleC ECHO;
|
---|
598 | ruleD ECHO;
|
---|
599 | %{
|
---|
600 | /* code block */
|
---|
601 | %}
|
---|
602 | }
|
---|
603 | %%
|
---|
604 | /* User Code Section */
|
---|
605 |
|
---|
606 |
|
---|
607 | File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top
|
---|
608 |
|
---|
609 | Patterns
|
---|
610 | ********
|
---|
611 |
|
---|
612 | The patterns in the input (see *Note Rules Section::) are written
|
---|
613 | using an extended set of regular expressions. These are:
|
---|
614 |
|
---|
615 | `x'
|
---|
616 | match the character 'x'
|
---|
617 |
|
---|
618 | `.'
|
---|
619 | any character (byte) except newline
|
---|
620 |
|
---|
621 | `[xyz]'
|
---|
622 | a "character class"; in this case, the pattern matches either an
|
---|
623 | 'x', a 'y', or a 'z'
|
---|
624 |
|
---|
625 | `[abj-oZ]'
|
---|
626 | a "character class" with a range in it; matches an 'a', a 'b', any
|
---|
627 | letter from 'j' through 'o', or a 'Z'
|
---|
628 |
|
---|
629 | `[^A-Z]'
|
---|
630 | a "negated character class", i.e., any character but those in the
|
---|
631 | class. In this case, any character EXCEPT an uppercase letter.
|
---|
632 |
|
---|
633 | `[^A-Z\n]'
|
---|
634 | any character EXCEPT an uppercase letter or a newline
|
---|
635 |
|
---|
636 | `r*'
|
---|
637 | zero or more r's, where r is any regular expression
|
---|
638 |
|
---|
639 | `r+'
|
---|
640 | one or more r's
|
---|
641 |
|
---|
642 | `r?'
|
---|
643 | zero or one r's (that is, "an optional r")
|
---|
644 |
|
---|
645 | `r{2,5}'
|
---|
646 | anywhere from two to five r's
|
---|
647 |
|
---|
648 | `r{2,}'
|
---|
649 | two or more r's
|
---|
650 |
|
---|
651 | `r{4}'
|
---|
652 | exactly 4 r's
|
---|
653 |
|
---|
654 | `{name}'
|
---|
655 | the expansion of the `name' definition (*note Format::).
|
---|
656 |
|
---|
657 | `"[xyz]\"foo"'
|
---|
658 | the literal string: `[xyz]"foo'
|
---|
659 |
|
---|
660 | `\X'
|
---|
661 | if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
|
---|
662 | interpretation of `\x'. Otherwise, a literal `X' (used to escape
|
---|
663 | operators such as `*')
|
---|
664 |
|
---|
665 | `\0'
|
---|
666 | a NUL character (ASCII code 0)
|
---|
667 |
|
---|
668 | `\123'
|
---|
669 | the character with octal value 123
|
---|
670 |
|
---|
671 | `\x2a'
|
---|
672 | the character with hexadecimal value 2a
|
---|
673 |
|
---|
674 | `(r)'
|
---|
675 | match an `r'; parentheses are used to override precedence (see
|
---|
676 | below)
|
---|
677 |
|
---|
678 | `rs'
|
---|
679 | the regular expression `r' followed by the regular expression `s';
|
---|
680 | called "concatenation"
|
---|
681 |
|
---|
682 | `r|s'
|
---|
683 | either an `r' or an `s'
|
---|
684 |
|
---|
685 | `r/s'
|
---|
686 | an `r' but only if it is followed by an `s'. The text matched by
|
---|
687 | `s' is included when determining whether this rule is the longest
|
---|
688 | match, but is then returned to the input before the action is
|
---|
689 | executed. So the action only sees the text matched by `r'. This
|
---|
690 | type of pattern is called "trailing context". (There are some
|
---|
691 | combinations of `r/s' that flex cannot match correctly. *Note
|
---|
692 | Limitations::, regarding dangerous trailing context.)
|
---|
693 |
|
---|
694 | `^r'
|
---|
695 | an `r', but only at the beginning of a line (i.e., when just
|
---|
696 | starting to scan, or right after a newline has been scanned).
|
---|
697 |
|
---|
698 | `r$'
|
---|
699 | an `r', but only at the end of a line (i.e., just before a
|
---|
700 | newline). Equivalent to `r/\n'.
|
---|
701 |
|
---|
702 | Note that `flex''s notion of "newline" is exactly whatever the C
|
---|
703 | compiler used to compile `flex' interprets `\n' as; in particular,
|
---|
704 | on some DOS systems you must either filter out `\r's in the input
|
---|
705 | yourself, or explicitly use `r/\r\n' for `r$'.
|
---|
706 |
|
---|
707 | `<s>r'
|
---|
708 | an `r', but only in start condition `s' (see *Note Start
|
---|
709 | Conditions:: for discussion of start conditions).
|
---|
710 |
|
---|
711 | `<s1,s2,s3>r'
|
---|
712 | same, but in any of start conditions `s1', `s2', or `s3'.
|
---|
713 |
|
---|
714 | `<*>r'
|
---|
715 | an `r' in any start condition, even an exclusive one.
|
---|
716 |
|
---|
717 | `<<EOF>>'
|
---|
718 | an end-of-file.
|
---|
719 |
|
---|
720 | `<s1,s2><<EOF>>'
|
---|
721 | an end-of-file when in start condition `s1' or `s2'
|
---|
722 |
|
---|
723 | Note that inside of a character class, all regular expression
|
---|
724 | operators lose their special meaning except escape (`\') and the
|
---|
725 | character class operators, `-', `]]', and, at the beginning of the
|
---|
726 | class, `^'.
|
---|
727 |
|
---|
728 | The regular expressions listed above are grouped according to
|
---|
729 | precedence, from highest precedence at the top to lowest at the bottom.
|
---|
730 | Those grouped together have equal precedence (see special note on the
|
---|
731 | precedence of the repeat operator, `{}', under the documentation for
|
---|
732 | the `--posix' POSIX compliance option). For example,
|
---|
733 |
|
---|
734 |
|
---|
735 | foo|bar*
|
---|
736 |
|
---|
737 | is the same as
|
---|
738 |
|
---|
739 |
|
---|
740 | (foo)|(ba(r*))
|
---|
741 |
|
---|
742 | since the `*' operator has higher precedence than concatenation, and
|
---|
743 | concatenation higher than alternation (`|'). This pattern therefore
|
---|
744 | matches _either_ the string `foo' _or_ the string `ba' followed by
|
---|
745 | zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the
|
---|
746 | string `bar', use:
|
---|
747 |
|
---|
748 |
|
---|
749 | foo|(bar)*
|
---|
750 |
|
---|
751 | And to match a sequence of zero or more repetitions of `foo' and
|
---|
752 | `bar':
|
---|
753 |
|
---|
754 |
|
---|
755 | (foo|bar)*
|
---|
756 |
|
---|
757 | In addition to characters and ranges of characters, character classes
|
---|
758 | can also contain "character class expressions". These are expressions
|
---|
759 | enclosed inside `[': and `:]' delimiters (which themselves must appear
|
---|
760 | between the `[' and `]' of the character class. Other elements may
|
---|
761 | occur inside the character class, too). The valid expressions are:
|
---|
762 |
|
---|
763 |
|
---|
764 | [:alnum:] [:alpha:] [:blank:]
|
---|
765 | [:cntrl:] [:digit:] [:graph:]
|
---|
766 | [:lower:] [:print:] [:punct:]
|
---|
767 | [:space:] [:upper:] [:xdigit:]
|
---|
768 |
|
---|
769 | These expressions all designate a set of characters equivalent to the
|
---|
770 | corresponding standard C `isXXX' function. For example, `[:alnum:]'
|
---|
771 | designates those characters for which `isalnum()' returns true - i.e.,
|
---|
772 | any alphabetic or numeric character. Some systems don't provide
|
---|
773 | `isblank()', so flex defines `[:blank:]' as a blank or a tab.
|
---|
774 |
|
---|
775 | For example, the following character classes are all equivalent:
|
---|
776 |
|
---|
777 |
|
---|
778 | [[:alnum:]]
|
---|
779 | [[:alpha:][:digit:]]
|
---|
780 | [[:alpha:][0-9]]
|
---|
781 | [a-zA-Z0-9]
|
---|
782 |
|
---|
783 | Some notes on patterns are in order.
|
---|
784 |
|
---|
785 | * If your scanner is case-insensitive (the `-i' flag), then
|
---|
786 | `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
|
---|
787 |
|
---|
788 | * Character classes with ranges, such as `[a-Z]', should be used with
|
---|
789 | caution in a case-insensitive scanner if the range spans upper or
|
---|
790 | lowercase characters. Flex does not know if you want to fold all
|
---|
791 | upper and lowercase characters together, or if you want the
|
---|
792 | literal numeric range specified (with no case folding). When in
|
---|
793 | doubt, flex will assume that you meant the literal numeric range,
|
---|
794 | and will issue a warning. The exception to this rule is a
|
---|
795 | character range such as `[a-z]' or `[S-W]' where it is obvious
|
---|
796 | that you want case-folding to occur. Here are some examples with
|
---|
797 | the `-i' flag enabled:
|
---|
798 |
|
---|
799 | Range Result Literal Range Alternate Range
|
---|
800 | `[a-t]' ok `[a-tA-T]'
|
---|
801 | `[A-T]' ok `[a-tA-T]'
|
---|
802 | `[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]'
|
---|
803 | `[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]'
|
---|
804 | `[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]'
|
---|
805 |
|
---|
806 | * A negated character class such as the example `[^A-Z]' above
|
---|
807 | _will_ match a newline unless `\n' (or an equivalent escape
|
---|
808 | sequence) is one of the characters explicitly present in the
|
---|
809 | negated character class (e.g., `[^A-Z\n]'). This is unlike how
|
---|
810 | many other regular expression tools treat negated character
|
---|
811 | classes, but unfortunately the inconsistency is historically
|
---|
812 | entrenched. Matching newlines means that a pattern like `[^"]*'
|
---|
813 | can match the entire input unless there's another quote in the
|
---|
814 | input.
|
---|
815 |
|
---|
816 | * A rule can have at most one instance of trailing context (the `/'
|
---|
817 | operator or the `$' operator). The start condition, `^', and
|
---|
818 | `<<EOF>>' patterns can only occur at the beginning of a pattern,
|
---|
819 | and, as well as with `/' and `$', cannot be grouped inside
|
---|
820 | parentheses. A `^' which does not occur at the beginning of a
|
---|
821 | rule or a `$' which does not occur at the end of a rule loses its
|
---|
822 | special properties and is treated as a normal character.
|
---|
823 |
|
---|
824 | * The following are invalid:
|
---|
825 |
|
---|
826 |
|
---|
827 | foo/bar$
|
---|
828 | <sc1>foo<sc2>bar
|
---|
829 |
|
---|
830 | Note that the first of these can be written `foo/bar\n'.
|
---|
831 |
|
---|
832 | * The following will result in `$' or `^' being treated as a normal
|
---|
833 | character:
|
---|
834 |
|
---|
835 |
|
---|
836 | foo|(bar$)
|
---|
837 | foo|^bar
|
---|
838 |
|
---|
839 | If the desired meaning is a `foo' or a
|
---|
840 | `bar'-followed-by-a-newline, the following could be used (the
|
---|
841 | special `|' action is explained below, *note Actions::):
|
---|
842 |
|
---|
843 |
|
---|
844 | foo |
|
---|
845 | bar$ /* action goes here */
|
---|
846 |
|
---|
847 | A similar trick will work for matching a `foo' or a
|
---|
848 | `bar'-at-the-beginning-of-a-line.
|
---|
849 |
|
---|
850 |
|
---|
851 | File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top
|
---|
852 |
|
---|
853 | How the Input Is Matched
|
---|
854 | ************************
|
---|
855 |
|
---|
856 | When the generated scanner is run, it analyzes its input looking for
|
---|
857 | strings which match any of its patterns. If it finds more than one
|
---|
858 | match, it takes the one matching the most text (for trailing context
|
---|
859 | rules, this includes the length of the trailing part, even though it
|
---|
860 | will then be returned to the input). If it finds two or more matches of
|
---|
861 | the same length, the rule listed first in the `flex' input file is
|
---|
862 | chosen.
|
---|
863 |
|
---|
864 | Once the match is determined, the text corresponding to the match
|
---|
865 | (called the "token") is made available in the global character pointer
|
---|
866 | `yytext', and its length in the global integer `yyleng'. The "action"
|
---|
867 | corresponding to the matched pattern is then executed (*note
|
---|
868 | Actions::), and then the remaining input is scanned for another match.
|
---|
869 |
|
---|
870 | If no match is found, then the "default rule" is executed: the next
|
---|
871 | character in the input is considered matched and copied to the standard
|
---|
872 | output. Thus, the simplest valid `flex' input is:
|
---|
873 |
|
---|
874 |
|
---|
875 | %%
|
---|
876 |
|
---|
877 | which generates a scanner that simply copies its input (one
|
---|
878 | character at a time) to its output.
|
---|
879 |
|
---|
880 | Note that `yytext' can be defined in two different ways: either as a
|
---|
881 | character _pointer_ or as a character _array_. You can control which
|
---|
882 | definition `flex' uses by including one of the special directives
|
---|
883 | `%pointer' or `%array' in the first (definitions) section of your flex
|
---|
884 | input. The default is `%pointer', unless you use the `-l' lex
|
---|
885 | compatibility option, in which case `yytext' will be an array. The
|
---|
886 | advantage of using `%pointer' is substantially faster scanning and no
|
---|
887 | buffer overflow when matching very large tokens (unless you run out of
|
---|
888 | dynamic memory). The disadvantage is that you are restricted in how
|
---|
889 | your actions can modify `yytext' (*note Actions::), and calls to the
|
---|
890 | `unput()' function destroys the present contents of `yytext', which can
|
---|
891 | be a considerable porting headache when moving between different `lex'
|
---|
892 | versions.
|
---|
893 |
|
---|
894 | The advantage of `%array' is that you can then modify `yytext' to
|
---|
895 | your heart's content, and calls to `unput()' do not destroy `yytext'
|
---|
896 | (*note Actions::). Furthermore, existing `lex' programs sometimes
|
---|
897 | access `yytext' externally using declarations of the form:
|
---|
898 |
|
---|
899 |
|
---|
900 | extern char yytext[];
|
---|
901 |
|
---|
902 | This definition is erroneous when used with `%pointer', but correct
|
---|
903 | for `%array'.
|
---|
904 |
|
---|
905 | The `%array' declaration defines `yytext' to be an array of `YYLMAX'
|
---|
906 | characters, which defaults to a fairly large value. You can change the
|
---|
907 | size by simply #define'ing `YYLMAX' to a different value in the first
|
---|
908 | section of your `flex' input. As mentioned above, with `%pointer'
|
---|
909 | yytext grows dynamically to accommodate large tokens. While this means
|
---|
910 | your `%pointer' scanner can accommodate very large tokens (such as
|
---|
911 | matching entire blocks of comments), bear in mind that each time the
|
---|
912 | scanner must resize `yytext' it also must rescan the entire token from
|
---|
913 | the beginning, so matching such tokens can prove slow. `yytext'
|
---|
914 | presently does _not_ dynamically grow if a call to `unput()' results in
|
---|
915 | too much text being pushed back; instead, a run-time error results.
|
---|
916 |
|
---|
917 | Also note that you cannot use `%array' with C++ scanner classes
|
---|
918 | (*note Cxx::).
|
---|
919 |
|
---|
920 |
|
---|
921 | File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top
|
---|
922 |
|
---|
923 | Actions
|
---|
924 | *******
|
---|
925 |
|
---|
926 | Each pattern in a rule has a corresponding "action", which can be
|
---|
927 | any arbitrary C statement. The pattern ends at the first non-escaped
|
---|
928 | whitespace character; the remainder of the line is its action. If the
|
---|
929 | action is empty, then when the pattern is matched the input token is
|
---|
930 | simply discarded. For example, here is the specification for a program
|
---|
931 | which deletes all occurrences of `zap me' from its input:
|
---|
932 |
|
---|
933 |
|
---|
934 | %%
|
---|
935 | "zap me"
|
---|
936 |
|
---|
937 | This example will copy all other characters in the input to the
|
---|
938 | output since they will be matched by the default rule.
|
---|
939 |
|
---|
940 | Here is a program which compresses multiple blanks and tabs down to a
|
---|
941 | single blank, and throws away whitespace found at the end of a line:
|
---|
942 |
|
---|
943 |
|
---|
944 | %%
|
---|
945 | [ \t]+ putchar( ' ' );
|
---|
946 | [ \t]+$ /* ignore this token */
|
---|
947 |
|
---|
948 | If the action contains a `}', then the action spans till the
|
---|
949 | balancing `}' is found, and the action may cross multiple lines.
|
---|
950 | `flex' knows about C strings and comments and won't be fooled by braces
|
---|
951 | found within them, but also allows actions to begin with `%{' and will
|
---|
952 | consider the action to be all the text up to the next `%}' (regardless
|
---|
953 | of ordinary braces inside the action).
|
---|
954 |
|
---|
955 | An action consisting solely of a vertical bar (`|') means "same as
|
---|
956 | the action for the next rule". See below for an illustration.
|
---|
957 |
|
---|
958 | Actions can include arbitrary C code, including `return' statements
|
---|
959 | to return a value to whatever routine called `yylex()'. Each time
|
---|
960 | `yylex()' is called it continues processing tokens from where it last
|
---|
961 | left off until it either reaches the end of the file or executes a
|
---|
962 | return.
|
---|
963 |
|
---|
964 | Actions are free to modify `yytext' except for lengthening it
|
---|
965 | (adding characters to its end-these will overwrite later characters in
|
---|
966 | the input stream). This however does not apply when using `%array'
|
---|
967 | (*note Matching::). In that case, `yytext' may be freely modified in
|
---|
968 | any way.
|
---|
969 |
|
---|
970 | Actions are free to modify `yyleng' except they should not do so if
|
---|
971 | the action also includes use of `yymore()' (see below).
|
---|
972 |
|
---|
973 | There are a number of special directives which can be included
|
---|
974 | within an action:
|
---|
975 |
|
---|
976 | `ECHO'
|
---|
977 | copies yytext to the scanner's output.
|
---|
978 |
|
---|
979 | `BEGIN'
|
---|
980 | followed by the name of a start condition places the scanner in the
|
---|
981 | corresponding start condition (see below).
|
---|
982 |
|
---|
983 | `REJECT'
|
---|
984 | directs the scanner to proceed on to the "second best" rule which
|
---|
985 | matched the input (or a prefix of the input). The rule is chosen
|
---|
986 | as described above in *Note Matching::, and `yytext' and `yyleng'
|
---|
987 | set up appropriately. It may either be one which matched as much
|
---|
988 | text as the originally chosen rule but came later in the `flex'
|
---|
989 | input file, or one which matched less text. For example, the
|
---|
990 | following will both count the words in the input and call the
|
---|
991 | routine `special()' whenever `frob' is seen:
|
---|
992 |
|
---|
993 |
|
---|
994 | int word_count = 0;
|
---|
995 | %%
|
---|
996 |
|
---|
997 | frob special(); REJECT;
|
---|
998 | [^ \t\n]+ ++word_count;
|
---|
999 |
|
---|
1000 | Without the `REJECT', any occurences of `frob' in the input would
|
---|
1001 | not be counted as words, since the scanner normally executes only
|
---|
1002 | one action per token. Multiple uses of `REJECT' are allowed, each
|
---|
1003 | one finding the next best choice to the currently active rule. For
|
---|
1004 | example, when the following scanner scans the token `abcd', it will
|
---|
1005 | write `abcdabcaba' to the output:
|
---|
1006 |
|
---|
1007 |
|
---|
1008 | %%
|
---|
1009 | a |
|
---|
1010 | ab |
|
---|
1011 | abc |
|
---|
1012 | abcd ECHO; REJECT;
|
---|
1013 | .|\n /* eat up any unmatched character */
|
---|
1014 |
|
---|
1015 | The first three rules share the fourth's action since they use the
|
---|
1016 | special `|' action.
|
---|
1017 |
|
---|
1018 | `REJECT' is a particularly expensive feature in terms of scanner
|
---|
1019 | performance; if it is used in _any_ of the scanner's actions it
|
---|
1020 | will slow down _all_ of the scanner's matching. Furthermore,
|
---|
1021 | `REJECT' cannot be used with the `-Cf' or `-CF' options (*note
|
---|
1022 | Scanner Options::).
|
---|
1023 |
|
---|
1024 | Note also that unlike the other special actions, `REJECT' is a
|
---|
1025 | _branch_. code immediately following it in the action will _not_
|
---|
1026 | be executed.
|
---|
1027 |
|
---|
1028 | `yymore()'
|
---|
1029 | tells the scanner that the next time it matches a rule, the
|
---|
1030 | corresponding token should be _appended_ onto the current value of
|
---|
1031 | `yytext' rather than replacing it. For example, given the input
|
---|
1032 | `mega-kludge' the following will write `mega-mega-kludge' to the
|
---|
1033 | output:
|
---|
1034 |
|
---|
1035 |
|
---|
1036 | %%
|
---|
1037 | mega- ECHO; yymore();
|
---|
1038 | kludge ECHO;
|
---|
1039 |
|
---|
1040 | First `mega-' is matched and echoed to the output. Then `kludge'
|
---|
1041 | is matched, but the previous `mega-' is still hanging around at the
|
---|
1042 | beginning of `yytext' so the `ECHO' for the `kludge' rule will
|
---|
1043 | actually write `mega-kludge'.
|
---|
1044 |
|
---|
1045 | Two notes regarding use of `yymore()'. First, `yymore()' depends on
|
---|
1046 | the value of `yyleng' correctly reflecting the size of the current
|
---|
1047 | token, so you must not modify `yyleng' if you are using `yymore()'.
|
---|
1048 | Second, the presence of `yymore()' in the scanner's action entails a
|
---|
1049 | minor performance penalty in the scanner's matching speed.
|
---|
1050 |
|
---|
1051 | `yyless(n)' returns all but the first `n' characters of the current
|
---|
1052 | token back to the input stream, where they will be rescanned when the
|
---|
1053 | scanner looks for the next match. `yytext' and `yyleng' are adjusted
|
---|
1054 | appropriately (e.g., `yyleng' will now be equal to `n'). For example,
|
---|
1055 | on the input `foobar' the following will write out `foobarbar':
|
---|
1056 |
|
---|
1057 |
|
---|
1058 | %%
|
---|
1059 | foobar ECHO; yyless(3);
|
---|
1060 | [a-z]+ ECHO;
|
---|
1061 |
|
---|
1062 | An argument of 0 to `yyless()' will cause the entire current input
|
---|
1063 | string to be scanned again. Unless you've changed how the scanner will
|
---|
1064 | subsequently process its input (using `BEGIN', for example), this will
|
---|
1065 | result in an endless loop.
|
---|
1066 |
|
---|
1067 | Note that `yyless()' is a macro and can only be used in the flex
|
---|
1068 | input file, not from other source files.
|
---|
1069 |
|
---|
1070 | `unput(c)' puts the character `c' back onto the input stream. It
|
---|
1071 | will be the next character scanned. The following action will take the
|
---|
1072 | current token and cause it to be rescanned enclosed in parentheses.
|
---|
1073 |
|
---|
1074 |
|
---|
1075 | {
|
---|
1076 | int i;
|
---|
1077 | /* Copy yytext because unput() trashes yytext */
|
---|
1078 | char *yycopy = strdup( yytext );
|
---|
1079 | unput( ')' );
|
---|
1080 | for ( i = yyleng - 1; i >= 0; --i )
|
---|
1081 | unput( yycopy[i] );
|
---|
1082 | unput( '(' );
|
---|
1083 | free( yycopy );
|
---|
1084 | }
|
---|
1085 |
|
---|
1086 | Note that since each `unput()' puts the given character back at the
|
---|
1087 | _beginning_ of the input stream, pushing back strings must be done
|
---|
1088 | back-to-front.
|
---|
1089 |
|
---|
1090 | An important potential problem when using `unput()' is that if you
|
---|
1091 | are using `%pointer' (the default), a call to `unput()' _destroys_ the
|
---|
1092 | contents of `yytext', starting with its rightmost character and
|
---|
1093 | devouring one character to the left with each call. If you need the
|
---|
1094 | value of `yytext' preserved after a call to `unput()' (as in the above
|
---|
1095 | example), you must either first copy it elsewhere, or build your
|
---|
1096 | scanner using `%array' instead (*note Matching::).
|
---|
1097 |
|
---|
1098 | Finally, note that you cannot put back `EOF' to attempt to mark the
|
---|
1099 | input stream with an end-of-file.
|
---|
1100 |
|
---|
1101 | `input()' reads the next character from the input stream. For
|
---|
1102 | example, the following is one way to eat up C comments:
|
---|
1103 |
|
---|
1104 |
|
---|
1105 | %%
|
---|
1106 | "/*" {
|
---|
1107 | register int c;
|
---|
1108 |
|
---|
1109 | for ( ; ; )
|
---|
1110 | {
|
---|
1111 | while ( (c = input()) != '*' &&
|
---|
1112 | c != EOF )
|
---|
1113 | ; /* eat up text of comment */
|
---|
1114 |
|
---|
1115 | if ( c == '*' )
|
---|
1116 | {
|
---|
1117 | while ( (c = input()) == '*' )
|
---|
1118 | ;
|
---|
1119 | if ( c == '/' )
|
---|
1120 | break; /* found the end */
|
---|
1121 | }
|
---|
1122 |
|
---|
1123 | if ( c == EOF )
|
---|
1124 | {
|
---|
1125 | error( "EOF in comment" );
|
---|
1126 | break;
|
---|
1127 | }
|
---|
1128 | }
|
---|
1129 | }
|
---|
1130 |
|
---|
1131 | (Note that if the scanner is compiled using `C++', then `input()' is
|
---|
1132 | instead referred to as yyinput(), in order to avoid a name clash with
|
---|
1133 | the `C++' stream by the name of `input'.)
|
---|
1134 |
|
---|
1135 | `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
|
---|
1136 | the next time the scanner attempts to match a token, it will first
|
---|
1137 | refill the buffer using `YY_INPUT()' (*note Generated Scanner::). This
|
---|
1138 | action is a special case of the more general `yy_flush_buffer()'
|
---|
1139 | function, described below (*note Multiple Input Buffers::)
|
---|
1140 |
|
---|
1141 | `yyterminate()' can be used in lieu of a return statement in an
|
---|
1142 | action. It terminates the scanner and returns a 0 to the scanner's
|
---|
1143 | caller, indicating "all done". By default, `yyterminate()' is also
|
---|
1144 | called when an end-of-file is encountered. It is a macro and may be
|
---|
1145 | redefined.
|
---|
1146 |
|
---|
1147 |
|
---|
1148 | File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top
|
---|
1149 |
|
---|
1150 | The Generated Scanner
|
---|
1151 | *********************
|
---|
1152 |
|
---|
1153 | The output of `flex' is the file `lex.yy.c', which contains the
|
---|
1154 | scanning routine `yylex()', a number of tables used by it for matching
|
---|
1155 | tokens, and a number of auxiliary routines and macros. By default,
|
---|
1156 | `yylex()' is declared as follows:
|
---|
1157 |
|
---|
1158 |
|
---|
1159 | int yylex()
|
---|
1160 | {
|
---|
1161 | ... various definitions and the actions in here ...
|
---|
1162 | }
|
---|
1163 |
|
---|
1164 | (If your environment supports function prototypes, then it will be
|
---|
1165 | `int yylex( void )'.) This definition may be changed by defining the
|
---|
1166 | `YY_DECL' macro. For example, you could use:
|
---|
1167 |
|
---|
1168 |
|
---|
1169 | #define YY_DECL float lexscan( a, b ) float a, b;
|
---|
1170 |
|
---|
1171 | to give the scanning routine the name `lexscan', returning a float,
|
---|
1172 | and taking two floats as arguments. Note that if you give arguments to
|
---|
1173 | the scanning routine using a K&R-style/non-prototyped function
|
---|
1174 | declaration, you must terminate the definition with a semi-colon (;).
|
---|
1175 |
|
---|
1176 | `flex' generates `C99' function definitions by default. However flex
|
---|
1177 | does have the ability to generate obsolete, er, `traditional', function
|
---|
1178 | definitions. This is to support bootstrapping gcc on old systems.
|
---|
1179 | Unfortunately, traditional definitions prevent us from using any
|
---|
1180 | standard data types smaller than int (such as short, char, or bool) as
|
---|
1181 | function arguments. For this reason, future versions of `flex' may
|
---|
1182 | generate standard C99 code only, leaving K&R-style functions to the
|
---|
1183 | historians. Currently, if you do *not* want `C99' definitions, then
|
---|
1184 | you must use `%option noansi-definitions'.
|
---|
1185 |
|
---|
1186 | Whenever `yylex()' is called, it scans tokens from the global input
|
---|
1187 | file `yyin' (which defaults to stdin). It continues until it either
|
---|
1188 | reaches an end-of-file (at which point it returns the value 0) or one
|
---|
1189 | of its actions executes a `return' statement.
|
---|
1190 |
|
---|
1191 | If the scanner reaches an end-of-file, subsequent calls are undefined
|
---|
1192 | unless either `yyin' is pointed at a new input file (in which case
|
---|
1193 | scanning continues from that file), or `yyrestart()' is called.
|
---|
1194 | `yyrestart()' takes one argument, a `FILE *' pointer (which can be
|
---|
1195 | NULL, if you've set up `YY_INPUT' to scan from a source other than
|
---|
1196 | `yyin'), and initializes `yyin' for scanning from that file.
|
---|
1197 | Essentially there is no difference between just assigning `yyin' to a
|
---|
1198 | new input file or using `yyrestart()' to do so; the latter is available
|
---|
1199 | for compatibility with previous versions of `flex', and because it can
|
---|
1200 | be used to switch input files in the middle of scanning. It can also
|
---|
1201 | be used to throw away the current input buffer, by calling it with an
|
---|
1202 | argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
|
---|
1203 | (*note Actions::). Note that `yyrestart()' does _not_ reset the start
|
---|
1204 | condition to `INITIAL' (*note Start Conditions::).
|
---|
1205 |
|
---|
1206 | If `yylex()' stops scanning due to executing a `return' statement in
|
---|
1207 | one of the actions, the scanner may then be called again and it will
|
---|
1208 | resume scanning where it left off.
|
---|
1209 |
|
---|
1210 | By default (and for purposes of efficiency), the scanner uses
|
---|
1211 | block-reads rather than simple `getc()' calls to read characters from
|
---|
1212 | `yyin'. The nature of how it gets its input can be controlled by
|
---|
1213 | defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()'
|
---|
1214 | is `YY_INPUT(buf,result,max_size)'. Its action is to place up to
|
---|
1215 | `max_size' characters in the character array `buf' and return in the
|
---|
1216 | integer variable `result' either the number of characters read or the
|
---|
1217 | constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default
|
---|
1218 | `YY_INPUT' reads from the global file-pointer `yyin'.
|
---|
1219 |
|
---|
1220 | Here is a sample definition of `YY_INPUT' (in the definitions
|
---|
1221 | section of the input file):
|
---|
1222 |
|
---|
1223 |
|
---|
1224 | %{
|
---|
1225 | #define YY_INPUT(buf,result,max_size) \
|
---|
1226 | { \
|
---|
1227 | int c = getchar(); \
|
---|
1228 | result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
|
---|
1229 | }
|
---|
1230 | %}
|
---|
1231 |
|
---|
1232 | This definition will change the input processing to occur one
|
---|
1233 | character at a time.
|
---|
1234 |
|
---|
1235 | When the scanner receives an end-of-file indication from YY_INPUT, it
|
---|
1236 | then checks the `yywrap()' function. If `yywrap()' returns false
|
---|
1237 | (zero), then it is assumed that the function has gone ahead and set up
|
---|
1238 | `yyin' to point to another input file, and scanning continues. If it
|
---|
1239 | returns true (non-zero), then the scanner terminates, returning 0 to
|
---|
1240 | its caller. Note that in either case, the start condition remains
|
---|
1241 | unchanged; it does _not_ revert to `INITIAL'.
|
---|
1242 |
|
---|
1243 | If you do not supply your own version of `yywrap()', then you must
|
---|
1244 | either use `%option noyywrap' (in which case the scanner behaves as
|
---|
1245 | though `yywrap()' returned 1), or you must link with `-lfl' to obtain
|
---|
1246 | the default version of the routine, which always returns 1.
|
---|
1247 |
|
---|
1248 | For scanning from in-memory buffers (e.g., scanning strings), see
|
---|
1249 | *Note Scanning Strings::. *Note Multiple Input Buffers::.
|
---|
1250 |
|
---|
1251 | The scanner writes its `ECHO' output to the `yyout' global (default,
|
---|
1252 | `stdout'), which may be redefined by the user simply by assigning it to
|
---|
1253 | some other `FILE' pointer.
|
---|
1254 |
|
---|