1 | \chapter{Python compiler package \label{compiler}}
|
---|
2 |
|
---|
3 | \sectionauthor{Jeremy Hylton}{jeremy@zope.com}
|
---|
4 |
|
---|
5 |
|
---|
6 | The Python compiler package is a tool for analyzing Python source code
|
---|
7 | and generating Python bytecode. The compiler contains libraries to
|
---|
8 | generate an abstract syntax tree from Python source code and to
|
---|
9 | generate Python bytecode from the tree.
|
---|
10 |
|
---|
11 | The \refmodule{compiler} package is a Python source to bytecode
|
---|
12 | translator written in Python. It uses the built-in parser and
|
---|
13 | standard \refmodule{parser} module to generated a concrete syntax
|
---|
14 | tree. This tree is used to generate an abstract syntax tree (AST) and
|
---|
15 | then Python bytecode.
|
---|
16 |
|
---|
17 | The full functionality of the package duplicates the builtin compiler
|
---|
18 | provided with the Python interpreter. It is intended to match its
|
---|
19 | behavior almost exactly. Why implement another compiler that does the
|
---|
20 | same thing? The package is useful for a variety of purposes. It can
|
---|
21 | be modified more easily than the builtin compiler. The AST it
|
---|
22 | generates is useful for analyzing Python source code.
|
---|
23 |
|
---|
24 | This chapter explains how the various components of the
|
---|
25 | \refmodule{compiler} package work. It blends reference material with
|
---|
26 | a tutorial.
|
---|
27 |
|
---|
28 | The following modules are part of the \refmodule{compiler} package:
|
---|
29 |
|
---|
30 | \localmoduletable
|
---|
31 |
|
---|
32 |
|
---|
33 | \section{The basic interface}
|
---|
34 |
|
---|
35 | \declaremodule{}{compiler}
|
---|
36 |
|
---|
37 | The top-level of the package defines four functions. If you import
|
---|
38 | \module{compiler}, you will get these functions and a collection of
|
---|
39 | modules contained in the package.
|
---|
40 |
|
---|
41 | \begin{funcdesc}{parse}{buf}
|
---|
42 | Returns an abstract syntax tree for the Python source code in \var{buf}.
|
---|
43 | The function raises \exception{SyntaxError} if there is an error in the
|
---|
44 | source code. The return value is a \class{compiler.ast.Module} instance
|
---|
45 | that contains the tree.
|
---|
46 | \end{funcdesc}
|
---|
47 |
|
---|
48 | \begin{funcdesc}{parseFile}{path}
|
---|
49 | Return an abstract syntax tree for the Python source code in the file
|
---|
50 | specified by \var{path}. It is equivalent to
|
---|
51 | \code{parse(open(\var{path}).read())}.
|
---|
52 | \end{funcdesc}
|
---|
53 |
|
---|
54 | \begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
|
---|
55 | Do a pre-order walk over the abstract syntax tree \var{ast}. Call the
|
---|
56 | appropriate method on the \var{visitor} instance for each node
|
---|
57 | encountered.
|
---|
58 | \end{funcdesc}
|
---|
59 |
|
---|
60 | \begin{funcdesc}{compile}{source, filename, mode, flags=None,
|
---|
61 | dont_inherit=None}
|
---|
62 | Compile the string \var{source}, a Python module, statement or
|
---|
63 | expression, into a code object that can be executed by the exec
|
---|
64 | statement or \function{eval()}. This function is a replacement for the
|
---|
65 | built-in \function{compile()} function.
|
---|
66 |
|
---|
67 | The \var{filename} will be used for run-time error messages.
|
---|
68 |
|
---|
69 | The \var{mode} must be 'exec' to compile a module, 'single' to compile a
|
---|
70 | single (interactive) statement, or 'eval' to compile an expression.
|
---|
71 |
|
---|
72 | The \var{flags} and \var{dont_inherit} arguments affect future-related
|
---|
73 | statements, but are not supported yet.
|
---|
74 | \end{funcdesc}
|
---|
75 |
|
---|
76 | \begin{funcdesc}{compileFile}{source}
|
---|
77 | Compiles the file \var{source} and generates a .pyc file.
|
---|
78 | \end{funcdesc}
|
---|
79 |
|
---|
80 | The \module{compiler} package contains the following modules:
|
---|
81 | \refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
|
---|
82 | \module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
|
---|
83 | \module{transformer}, and \refmodule[compiler.visitor]{visitor}.
|
---|
84 |
|
---|
85 | \section{Limitations}
|
---|
86 |
|
---|
87 | There are some problems with the error checking of the compiler
|
---|
88 | package. The interpreter detects syntax errors in two distinct
|
---|
89 | phases. One set of errors is detected by the interpreter's parser,
|
---|
90 | the other set by the compiler. The compiler package relies on the
|
---|
91 | interpreter's parser, so it get the first phases of error checking for
|
---|
92 | free. It implements the second phase itself, and that implementation is
|
---|
93 | incomplete. For example, the compiler package does not raise an error
|
---|
94 | if a name appears more than once in an argument list:
|
---|
95 | \code{def f(x, x): ...}
|
---|
96 |
|
---|
97 | A future version of the compiler should fix these problems.
|
---|
98 |
|
---|
99 | \section{Python Abstract Syntax}
|
---|
100 |
|
---|
101 | The \module{compiler.ast} module defines an abstract syntax for
|
---|
102 | Python. In the abstract syntax tree, each node represents a syntactic
|
---|
103 | construct. The root of the tree is \class{Module} object.
|
---|
104 |
|
---|
105 | The abstract syntax offers a higher level interface to parsed Python
|
---|
106 | source code. The \ulink{\module{parser}}
|
---|
107 | {http://www.python.org/doc/current/lib/module-parser.html}
|
---|
108 | module and the compiler written in C for the Python interpreter use a
|
---|
109 | concrete syntax tree. The concrete syntax is tied closely to the
|
---|
110 | grammar description used for the Python parser. Instead of a single
|
---|
111 | node for a construct, there are often several levels of nested nodes
|
---|
112 | that are introduced by Python's precedence rules.
|
---|
113 |
|
---|
114 | The abstract syntax tree is created by the
|
---|
115 | \module{compiler.transformer} module. The transformer relies on the
|
---|
116 | builtin Python parser to generate a concrete syntax tree. It
|
---|
117 | generates an abstract syntax tree from the concrete tree.
|
---|
118 |
|
---|
119 | The \module{transformer} module was created by Greg
|
---|
120 | Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
|
---|
121 | experimental Python-to-C compiler. The current version contains a
|
---|
122 | number of modifications and improvements, but the basic form of the
|
---|
123 | abstract syntax and of the transformer are due to Stein and Tutt.
|
---|
124 |
|
---|
125 | \subsection{AST Nodes}
|
---|
126 |
|
---|
127 | \declaremodule{}{compiler.ast}
|
---|
128 |
|
---|
129 | The \module{compiler.ast} module is generated from a text file that
|
---|
130 | describes each node type and its elements. Each node type is
|
---|
131 | represented as a class that inherits from the abstract base class
|
---|
132 | \class{compiler.ast.Node} and defines a set of named attributes for
|
---|
133 | child nodes.
|
---|
134 |
|
---|
135 | \begin{classdesc}{Node}{}
|
---|
136 |
|
---|
137 | The \class{Node} instances are created automatically by the parser
|
---|
138 | generator. The recommended interface for specific \class{Node}
|
---|
139 | instances is to use the public attributes to access child nodes. A
|
---|
140 | public attribute may be bound to a single node or to a sequence of
|
---|
141 | nodes, depending on the \class{Node} type. For example, the
|
---|
142 | \member{bases} attribute of the \class{Class} node, is bound to a
|
---|
143 | list of base class nodes, and the \member{doc} attribute is bound to
|
---|
144 | a single node.
|
---|
145 |
|
---|
146 | Each \class{Node} instance has a \member{lineno} attribute which may
|
---|
147 | be \code{None}. XXX Not sure what the rules are for which nodes
|
---|
148 | will have a useful lineno.
|
---|
149 | \end{classdesc}
|
---|
150 |
|
---|
151 | All \class{Node} objects offer the following methods:
|
---|
152 |
|
---|
153 | \begin{methoddesc}{getChildren}{}
|
---|
154 | Returns a flattened list of the child nodes and objects in the
|
---|
155 | order they occur. Specifically, the order of the nodes is the
|
---|
156 | order in which they appear in the Python grammar. Not all of the
|
---|
157 | children are \class{Node} instances. The names of functions and
|
---|
158 | classes, for example, are plain strings.
|
---|
159 | \end{methoddesc}
|
---|
160 |
|
---|
161 | \begin{methoddesc}{getChildNodes}{}
|
---|
162 | Returns a flattened list of the child nodes in the order they
|
---|
163 | occur. This method is like \method{getChildren()}, except that it
|
---|
164 | only returns those children that are \class{Node} instances.
|
---|
165 | \end{methoddesc}
|
---|
166 |
|
---|
167 | Two examples illustrate the general structure of \class{Node}
|
---|
168 | classes. The \keyword{while} statement is defined by the following
|
---|
169 | grammar production:
|
---|
170 |
|
---|
171 | \begin{verbatim}
|
---|
172 | while_stmt: "while" expression ":" suite
|
---|
173 | ["else" ":" suite]
|
---|
174 | \end{verbatim}
|
---|
175 |
|
---|
176 | The \class{While} node has three attributes: \member{test},
|
---|
177 | \member{body}, and \member{else_}. (If the natural name for an
|
---|
178 | attribute is also a Python reserved word, it can't be used as an
|
---|
179 | attribute name. An underscore is appended to the word to make it a
|
---|
180 | legal identifier, hence \member{else_} instead of \keyword{else}.)
|
---|
181 |
|
---|
182 | The \keyword{if} statement is more complicated because it can include
|
---|
183 | several tests.
|
---|
184 |
|
---|
185 | \begin{verbatim}
|
---|
186 | if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
|
---|
187 | \end{verbatim}
|
---|
188 |
|
---|
189 | The \class{If} node only defines two attributes: \member{tests} and
|
---|
190 | \member{else_}. The \member{tests} attribute is a sequence of test
|
---|
191 | expression, consequent body pairs. There is one pair for each
|
---|
192 | \keyword{if}/\keyword{elif} clause. The first element of the pair is
|
---|
193 | the test expression. The second elements is a \class{Stmt} node that
|
---|
194 | contains the code to execute if the test is true.
|
---|
195 |
|
---|
196 | The \method{getChildren()} method of \class{If} returns a flat list of
|
---|
197 | child nodes. If there are three \keyword{if}/\keyword{elif} clauses
|
---|
198 | and no \keyword{else} clause, then \method{getChildren()} will return
|
---|
199 | a list of six elements: the first test expression, the first
|
---|
200 | \class{Stmt}, the second text expression, etc.
|
---|
201 |
|
---|
202 | The following table lists each of the \class{Node} subclasses defined
|
---|
203 | in \module{compiler.ast} and each of the public attributes available
|
---|
204 | on their instances. The values of most of the attributes are
|
---|
205 | themselves \class{Node} instances or sequences of instances. When the
|
---|
206 | value is something other than an instance, the type is noted in the
|
---|
207 | comment. The attributes are listed in the order in which they are
|
---|
208 | returned by \method{getChildren()} and \method{getChildNodes()}.
|
---|
209 |
|
---|
210 | \input{asttable}
|
---|
211 |
|
---|
212 |
|
---|
213 | \subsection{Assignment nodes}
|
---|
214 |
|
---|
215 | There is a collection of nodes used to represent assignments. Each
|
---|
216 | assignment statement in the source code becomes a single
|
---|
217 | \class{Assign} node in the AST. The \member{nodes} attribute is a
|
---|
218 | list that contains a node for each assignment target. This is
|
---|
219 | necessary because assignment can be chained, e.g. \code{a = b = 2}.
|
---|
220 | Each \class{Node} in the list will be one of the following classes:
|
---|
221 | \class{AssAttr}, \class{AssList}, \class{AssName}, or
|
---|
222 | \class{AssTuple}.
|
---|
223 |
|
---|
224 | Each target assignment node will describe the kind of object being
|
---|
225 | assigned to: \class{AssName} for a simple name, e.g. \code{a = 1}.
|
---|
226 | \class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}.
|
---|
227 | \class{AssList} and \class{AssTuple} for list and tuple expansion
|
---|
228 | respectively, e.g. \code{a, b, c = a_tuple}.
|
---|
229 |
|
---|
230 | The target assignment nodes also have a \member{flags} attribute that
|
---|
231 | indicates whether the node is being used for assignment or in a delete
|
---|
232 | statement. The \class{AssName} is also used to represent a delete
|
---|
233 | statement, e.g. \class{del x}.
|
---|
234 |
|
---|
235 | When an expression contains several attribute references, an
|
---|
236 | assignment or delete statement will contain only one \class{AssAttr}
|
---|
237 | node -- for the final attribute reference. The other attribute
|
---|
238 | references will be represented as \class{Getattr} nodes in the
|
---|
239 | \member{expr} attribute of the \class{AssAttr} instance.
|
---|
240 |
|
---|
241 | \subsection{Examples}
|
---|
242 |
|
---|
243 | This section shows several simple examples of ASTs for Python source
|
---|
244 | code. The examples demonstrate how to use the \function{parse()}
|
---|
245 | function, what the repr of an AST looks like, and how to access
|
---|
246 | attributes of an AST node.
|
---|
247 |
|
---|
248 | The first module defines a single function. Assume it is stored in
|
---|
249 | \file{/tmp/doublelib.py}.
|
---|
250 |
|
---|
251 | \begin{verbatim}
|
---|
252 | """This is an example module.
|
---|
253 |
|
---|
254 | This is the docstring.
|
---|
255 | """
|
---|
256 |
|
---|
257 | def double(x):
|
---|
258 | "Return twice the argument"
|
---|
259 | return x * 2
|
---|
260 | \end{verbatim}
|
---|
261 |
|
---|
262 | In the interactive interpreter session below, I have reformatted the
|
---|
263 | long AST reprs for readability. The AST reprs use unqualified class
|
---|
264 | names. If you want to create an instance from a repr, you must import
|
---|
265 | the class names from the \module{compiler.ast} module.
|
---|
266 |
|
---|
267 | \begin{verbatim}
|
---|
268 | >>> import compiler
|
---|
269 | >>> mod = compiler.parseFile("/tmp/doublelib.py")
|
---|
270 | >>> mod
|
---|
271 | Module('This is an example module.\n\nThis is the docstring.\n',
|
---|
272 | Stmt([Function(None, 'double', ['x'], [], 0,
|
---|
273 | 'Return twice the argument',
|
---|
274 | Stmt([Return(Mul((Name('x'), Const(2))))]))]))
|
---|
275 | >>> from compiler.ast import *
|
---|
276 | >>> Module('This is an example module.\n\nThis is the docstring.\n',
|
---|
277 | ... Stmt([Function(None, 'double', ['x'], [], 0,
|
---|
278 | ... 'Return twice the argument',
|
---|
279 | ... Stmt([Return(Mul((Name('x'), Const(2))))]))]))
|
---|
280 | Module('This is an example module.\n\nThis is the docstring.\n',
|
---|
281 | Stmt([Function(None, 'double', ['x'], [], 0,
|
---|
282 | 'Return twice the argument',
|
---|
283 | Stmt([Return(Mul((Name('x'), Const(2))))]))]))
|
---|
284 | >>> mod.doc
|
---|
285 | 'This is an example module.\n\nThis is the docstring.\n'
|
---|
286 | >>> for node in mod.node.nodes:
|
---|
287 | ... print node
|
---|
288 | ...
|
---|
289 | Function(None, 'double', ['x'], [], 0, 'Return twice the argument',
|
---|
290 | Stmt([Return(Mul((Name('x'), Const(2))))]))
|
---|
291 | >>> func = mod.node.nodes[0]
|
---|
292 | >>> func.code
|
---|
293 | Stmt([Return(Mul((Name('x'), Const(2))))])
|
---|
294 | \end{verbatim}
|
---|
295 |
|
---|
296 | \section{Using Visitors to Walk ASTs}
|
---|
297 |
|
---|
298 | \declaremodule{}{compiler.visitor}
|
---|
299 |
|
---|
300 | The visitor pattern is ... The \refmodule{compiler} package uses a
|
---|
301 | variant on the visitor pattern that takes advantage of Python's
|
---|
302 | introspection features to eliminate the need for much of the visitor's
|
---|
303 | infrastructure.
|
---|
304 |
|
---|
305 | The classes being visited do not need to be programmed to accept
|
---|
306 | visitors. The visitor need only define visit methods for classes it
|
---|
307 | is specifically interested in; a default visit method can handle the
|
---|
308 | rest.
|
---|
309 |
|
---|
310 | XXX The magic \method{visit()} method for visitors.
|
---|
311 |
|
---|
312 | \begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
|
---|
313 | \end{funcdesc}
|
---|
314 |
|
---|
315 | \begin{classdesc}{ASTVisitor}{}
|
---|
316 |
|
---|
317 | The \class{ASTVisitor} is responsible for walking over the tree in the
|
---|
318 | correct order. A walk begins with a call to \method{preorder()}. For
|
---|
319 | each node, it checks the \var{visitor} argument to \method{preorder()}
|
---|
320 | for a method named `visitNodeType,' where NodeType is the name of the
|
---|
321 | node's class, e.g. for a \class{While} node a \method{visitWhile()}
|
---|
322 | would be called. If the method exists, it is called with the node as
|
---|
323 | its first argument.
|
---|
324 |
|
---|
325 | The visitor method for a particular node type can control how child
|
---|
326 | nodes are visited during the walk. The \class{ASTVisitor} modifies
|
---|
327 | the visitor argument by adding a visit method to the visitor; this
|
---|
328 | method can be used to visit a particular child node. If no visitor is
|
---|
329 | found for a particular node type, the \method{default()} method is
|
---|
330 | called.
|
---|
331 | \end{classdesc}
|
---|
332 |
|
---|
333 | \class{ASTVisitor} objects have the following methods:
|
---|
334 |
|
---|
335 | XXX describe extra arguments
|
---|
336 |
|
---|
337 | \begin{methoddesc}{default}{node\optional{, \moreargs}}
|
---|
338 | \end{methoddesc}
|
---|
339 |
|
---|
340 | \begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
|
---|
341 | \end{methoddesc}
|
---|
342 |
|
---|
343 | \begin{methoddesc}{preorder}{tree, visitor}
|
---|
344 | \end{methoddesc}
|
---|
345 |
|
---|
346 |
|
---|
347 | \section{Bytecode Generation}
|
---|
348 |
|
---|
349 | The code generator is a visitor that emits bytecodes. Each visit method
|
---|
350 | can call the \method{emit()} method to emit a new bytecode. The basic
|
---|
351 | code generator is specialized for modules, classes, and functions. An
|
---|
352 | assembler converts that emitted instructions to the low-level bytecode
|
---|
353 | format. It handles things like generator of constant lists of code
|
---|
354 | objects and calculation of jump offsets.
|
---|