mirror of https://github.com/python/cpython.git
(libparser.tex): Revised parser module documentation; improved logical
organization.
This commit is contained in:
parent
36f219dff8
commit
4b7d5a49ab
|
@ -17,20 +17,21 @@
|
|||
The \code{parser} module provides an interface to Python's internal
|
||||
parser and byte-code compiler. The primary purpose for this interface
|
||||
is to allow Python code to edit the parse tree of a Python expression
|
||||
and create executable code from this. This can be better than trying
|
||||
to parse and modify an arbitrary Python code fragment as a string, and
|
||||
ensures that parsing is performed in a manner identical to the code
|
||||
forming the application. It's also faster.
|
||||
and create executable code from this. This is better than trying
|
||||
to parse and modify an arbitrary Python code fragment as a string
|
||||
because parsing is performed in a manner identical to the code
|
||||
forming the application. It is also faster.
|
||||
|
||||
There are a few things to note about this module which are important
|
||||
to making use of the data structures created. This is not a tutorial
|
||||
on editing the parse trees for Python code.
|
||||
on editing the parse trees for Python code, but some examples of using
|
||||
the \code{parser} module are presented.
|
||||
|
||||
Most importantly, a good understanding of the Python grammar processed
|
||||
by the internal parser is required. For full information on the
|
||||
language syntax, refer to the Language Reference. The parser itself
|
||||
is created from a grammar specification defined in the file
|
||||
\code{Grammar/Grammar} in the standard Python distribution. The parse
|
||||
\file{Grammar/Grammar} in the standard Python distribution. The parse
|
||||
trees stored in the ``AST objects'' created by this module are the
|
||||
actual output from the internal parser when created by the
|
||||
\code{expr()} or \code{suite()} functions, described below. The AST
|
||||
|
@ -51,16 +52,16 @@ Each element of the sequences returned by \code{ast2list} or
|
|||
non-terminal elements in the grammar always have a length greater than
|
||||
one. The first element is an integer which identifies a production in
|
||||
the grammar. These integers are given symbolic names in the C header
|
||||
file \code{Include/graminit.h} and the Python module
|
||||
\code{Lib/symbol.py}. Each additional element of the sequence represents
|
||||
file \file{Include/graminit.h} and the Python module
|
||||
\file{Lib/symbol.py}. Each additional element of the sequence represents
|
||||
a component of the production as recognized in the input string: these
|
||||
are always sequences which have the same form as the parent. An
|
||||
important aspect of this structure which should be noted is that
|
||||
keywords used to identify the parent node type, such as the keyword
|
||||
\code{if} in an \emph{if\_stmt}, are included in the node tree without
|
||||
\code{if} in an \code{if_stmt}, are included in the node tree without
|
||||
any special treatment. For example, the \code{if} keyword is
|
||||
represented by the tuple \code{(1, 'if')}, where \code{1} is the
|
||||
numeric value associated with all \code{NAME} elements, including
|
||||
numeric value associated with all \code{NAME} tokens, including
|
||||
variable and function names defined by the user. In an alternate form
|
||||
returned when line number information is requested, the same token
|
||||
might be represented as \code{(1, 'if', 12)}, where the \code{12}
|
||||
|
@ -70,51 +71,115 @@ Terminal elements are represented in much the same way, but without
|
|||
any child elements and the addition of the source text which was
|
||||
identified. The example of the \code{if} keyword above is
|
||||
representative. The various types of terminal symbols are defined in
|
||||
the C header file \code{Include/token.h} and the Python module
|
||||
\code{Lib/token.py}.
|
||||
the C header file \file{Include/token.h} and the Python module
|
||||
\file{Lib/token.py}.
|
||||
|
||||
The AST objects are not actually required to support the functionality
|
||||
of this module, but are provided for three purposes: to allow an
|
||||
application to amortize the cost of processing complex parse trees, to
|
||||
provide a parse tree representation which conserves memory space when
|
||||
compared to the Python list or tuple representation, and to ease the
|
||||
creation of additional modules in C which manipulate parse trees. A
|
||||
simple ``wrapper'' module may be created in Python to hide the use of
|
||||
AST objects.
|
||||
The AST objects are not required to support the functionality of this
|
||||
module, but are provided for three purposes: to allow an application
|
||||
to amortize the cost of processing complex parse trees, to provide a
|
||||
parse tree representation which conserves memory space when compared
|
||||
to the Python list or tuple representation, and to ease the creation
|
||||
of additional modules in C which manipulate parse trees. A simple
|
||||
``wrapper'' class may be created in Python to hide the use of AST
|
||||
objects; the \code{AST} library module provides a variety of such
|
||||
classes.
|
||||
|
||||
|
||||
The \code{parser} module defines the following functions:
|
||||
The \code{parser} module defines functions for a few distinct
|
||||
purposes. The most important purposes are to create AST objects and
|
||||
to convert AST objects to other representations such as parse trees
|
||||
and compiled code objects, but there are also functions which serve to
|
||||
query the type of parse tree represented by an AST object.
|
||||
|
||||
\renewcommand{\indexsubitem}{(in module parser)}
|
||||
|
||||
\begin{funcdesc}{ast2list}{ast\optional{\, line\_info\code{ = 0}}}
|
||||
|
||||
\subsection{Creating AST Objects}
|
||||
|
||||
AST objects may be created from source code or from a parse tree.
|
||||
When creating an AST object from source, different functions are used
|
||||
to create the \code{'eval'} and \code{'exec'} forms.
|
||||
|
||||
\begin{funcdesc}{expr}{string}
|
||||
The \code{expr()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'eval')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{suite}{string}
|
||||
The \code{suite()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'exec')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{sequence2ast}{sequence}
|
||||
This function accepts a parse tree represented as a sequence and
|
||||
builds an internal representation if possible. If it can validate
|
||||
that the tree conforms to the Python grammar and all nodes are valid
|
||||
node types in the host version of Python, an AST object is created
|
||||
from the internal representation and returned to the called. If there
|
||||
is a problem creating the internal representation, or if the tree
|
||||
cannot be validated, a \code{ParserError} exception is thrown. An AST
|
||||
object created this way should not be assumed to compile correctly;
|
||||
normal exceptions thrown by compilation may still be initiated when
|
||||
the AST object is passed to \code{compileast()}. This may indicate
|
||||
problems not related to syntax (such as a \code{MemoryError}
|
||||
exception), but may also be due to constructs such as the result of
|
||||
parsing \code{del f(0)}, which escapes the Python parser but is
|
||||
checked by the bytecode compiler.
|
||||
|
||||
Sequences representing terminal tokens may be represented as either
|
||||
two-element lists of the form \code{(1, 'name')} or as three-element
|
||||
lists of the form \code{(1, 'name', 56)}. If the third element is
|
||||
present, it is assumed to be a valid line number. The line number
|
||||
may be specified for any subset of the terminal symbols in the input
|
||||
tree.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{tuple2ast}{sequence}
|
||||
This is the same function as \code{sequence2ast()}. This entry point
|
||||
is maintained for backward compatibility.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\subsection{Converting AST Objects}
|
||||
|
||||
AST objects, regardless of the input used to create them, may be
|
||||
converted to parse trees represented as list- or tuple- trees, or may
|
||||
be compiled into executable code objects. Parse trees may be
|
||||
extracted with or without line numbering information.
|
||||
|
||||
\begin{funcdesc}{ast2list}{ast\optional{\, line_info\code{ = 0}}}
|
||||
This function accepts an AST object from the caller in
|
||||
\code{\var{ast}} and returns a Python list representing the
|
||||
equivelent parse tree. The resulting list representation can be used
|
||||
for inspection or the creation of a new parse tree in list form.
|
||||
This function does not fail so long as memory is available to build
|
||||
the list representation. If a parse tree will only be used for
|
||||
for inspection or the creation of a new parse tree in list form. This
|
||||
function does not fail so long as memory is available to build the
|
||||
list representation. If the parse tree will only be used for
|
||||
inspection, \code{ast2tuple()} should be used instead to reduce memory
|
||||
consumption and fragmentation. When modifications are to be made to
|
||||
the parse tree, this function is significantly faster than retrieving
|
||||
a tuple representation and converting that to nested lists.
|
||||
consumption and fragmentation. When the list representation is
|
||||
required, this function is significantly faster than retrieving a
|
||||
tuple representation and converting that to nested lists.
|
||||
|
||||
If the \code{line\_info} flag is given true value, line number
|
||||
information will be included for all terminal tokens as a third
|
||||
element of the list representing the token. This information is
|
||||
omitted if the flag is false or omitted.
|
||||
If \code{\var{line_info}} is true, line number information will be
|
||||
included for all terminal tokens as a third element of the list
|
||||
representing the token. This information is omitted if the flag is
|
||||
false or omitted.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{ast2tuple}{ast\optional{\, line\_info\code{ = 0}}}
|
||||
\begin{funcdesc}{ast2tuple}{ast\optional{\, line_info\code{ = 0}}}
|
||||
This function accepts an AST object from the caller in
|
||||
\code{\var{ast}} and returns a Python tuple representing the
|
||||
equivelent parse tree. Other than returning a tuple instead of a
|
||||
list, this function is identical to \code{ast2list()}.
|
||||
|
||||
If the \code{line\_info} flag is given true value, line number
|
||||
information will be included for all terminal tokens as a third
|
||||
element of the list representing the token. This information is
|
||||
omitted if the flag is false or omitted.
|
||||
If \code{\var{line_info}} is true, line number information will be
|
||||
included for all terminal tokens as a third element of the list
|
||||
representing the token. This information is omitted if the flag is
|
||||
false or omitted.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{compileast}{ast\optional{\, filename\code{ = '<ast>'}}}
|
||||
|
@ -128,7 +193,7 @@ for \code{\var{filename}} indicates that the source was an AST object.
|
|||
|
||||
Compiling an AST object may result in exceptions related to
|
||||
compilation; an example would be a \code{SyntaxError} caused by the
|
||||
parse tree for \code{del f(0)}; this statement is considered legal
|
||||
parse tree for \code{del f(0)}: this statement is considered legal
|
||||
within the formal grammar for Python but is not a legal language
|
||||
construct. The \code{SyntaxError} raised for this condition is
|
||||
actually generated by the Python byte-compiler normally, which is why
|
||||
|
@ -138,14 +203,13 @@ inspection of the parse tree.
|
|||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{expr}{string}
|
||||
The \code{expr()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'eval')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
\subsection{Queries on AST Objects}
|
||||
|
||||
Two functions are provided which allow an application to determine if
|
||||
an AST was create as an expression or a suite. Neither of these
|
||||
functions can be used to determine if an AST was created from source
|
||||
code via \code{expr()} or \code{suite()} or from a parse tree via
|
||||
\code{sequence2ast()}.
|
||||
|
||||
\begin{funcdesc}{isexpr}{ast}
|
||||
When \code{\var{ast}} represents an \code{'eval'} form, this function
|
||||
|
@ -160,48 +224,10 @@ like this either, and are identical to those created by the built-in
|
|||
|
||||
\begin{funcdesc}{issuite}{ast}
|
||||
This function mirrors \code{isexpr()} in that it reports whether an
|
||||
AST object represents a suite of statements. It is not safe to assume
|
||||
that this function is equivelent to \code{not isexpr(\var{ast})}, as
|
||||
additional syntactic fragments may be supported in the future.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{suite}{string}
|
||||
The \code{suite()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'exec')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{sequence2ast}{sequence}
|
||||
This function accepts a parse tree represented as a sequence and
|
||||
builds an internal representation if possible. If it can validate
|
||||
that the tree conforms to the Python grammar and all nodes are valid
|
||||
node types in the host version of Python, an AST object is created
|
||||
from the internal representation and returned to the called. If there
|
||||
is a problem creating the internal representation, or if the tree
|
||||
cannot be validated, a \code{ParserError} exception is thrown. An AST
|
||||
object created this way should not be assumed to compile correctly;
|
||||
normal exceptions thrown by compilation may still be initiated when
|
||||
the AST object is passed to \code{compileast()}. This will normally
|
||||
indicate problems not related to syntax (such as a \code{MemoryError}
|
||||
exception), but may also be due to constructs such as the result of
|
||||
parsing \code{del f(0)}, which escapes the Python parser but is
|
||||
checked by the bytecode compiler.
|
||||
|
||||
Sequences representing terminal tokens may be represented as either
|
||||
two-element lists of the form \code{(1, 'name')} or as three-element
|
||||
lists of the form \code{(1, 'name', 56)}. If the third element is
|
||||
present, it is assumed to be a valid line number. The line number
|
||||
may be specified for any subset of the terminal symbols in the input
|
||||
tree.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{tuple2ast}{sequence}
|
||||
This is the same function as \code{sequence2ast}. This entry point is
|
||||
maintained for backward compatibility.
|
||||
AST object represents an \code{'exec'} form, commonly known as a
|
||||
``suite.'' It is not safe to assume that this function is equivelent
|
||||
to \code{not isexpr(\var{ast})}, as additional syntactic fragments may
|
||||
be supported in the future.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
|
@ -235,10 +261,11 @@ to the descriptions of each function for detailed information.
|
|||
|
||||
\subsection{AST Objects}
|
||||
|
||||
AST objects (returned by \code{expr()}, \code{suite()}, and
|
||||
\code{sequence2ast()}, described above) have no methods of their own.
|
||||
AST objects returned by \code{expr()}, \code{suite()}, and
|
||||
\code{sequence2ast()} have no methods of their own.
|
||||
Some of the functions defined which accept an AST object as their
|
||||
first argument may change to object methods in the future.
|
||||
first argument may change to object methods in the future. The type
|
||||
of these objects is available as \code{ASTType} in the module.
|
||||
|
||||
Ordered and equality comparisons are supported between AST objects.
|
||||
|
||||
|
@ -247,12 +274,12 @@ Ordered and equality comparisons are supported between AST objects.
|
|||
|
||||
The parser modules allows operations to be performed on the parse tree
|
||||
of Python source code before the bytecode is generated, and provides
|
||||
for inspection of the parse tree for information gathering purposes as
|
||||
well. Two examples are presented. The simple example demonstrates
|
||||
emulation of the \code{compile()} built-in function and the complex
|
||||
example shows the use of a parse tree for information discovery.
|
||||
for inspection of the parse tree for information gathering purposes.
|
||||
Two examples are presented. The simple example demonstrates emulation
|
||||
of the \code{compile()} built-in function and the complex example
|
||||
shows the use of a parse tree for information discovery.
|
||||
|
||||
\subsubsection{Emulation of {\tt compile()}}
|
||||
\subsubsection{Emulation of \sectcode{compile()}}
|
||||
|
||||
While many useful operations may take place between parsing and
|
||||
bytecode generation, the simplest operation is to do nothing. For
|
||||
|
@ -298,17 +325,16 @@ def load_expression(source_string):
|
|||
|
||||
\subsubsection{Information Discovery}
|
||||
|
||||
Some applications can benfit from access to the parse tree itself, and
|
||||
can take advantage of the intermediate data structure provided by the
|
||||
\code{parser} module. The remainder of this section of examples will
|
||||
demonstrate how the intermediate data structure can provide access to
|
||||
module documentation defined in docstrings without requiring that the
|
||||
code being examined be imported into a running interpreter. This can
|
||||
be very useful for performing analyses of untrusted code.
|
||||
Some applications benefit from direct access to the parse tree. The
|
||||
remainder of this section demonstrates how the parse tree provides
|
||||
access to module documentation defined in docstrings without requiring
|
||||
that the code being examined be loaded into a running interpreter via
|
||||
\code{import}. This can be very useful for performing analyses of
|
||||
untrusted code.
|
||||
|
||||
Generally, the example will demonstrate how the parse tree may be
|
||||
traversed to distill interesting information. Two functions and a set
|
||||
of classes is developed which provide programmatic access to high
|
||||
of classes are developed which provide programmatic access to high
|
||||
level function and class definitions provided by a module. The
|
||||
classes extract information from the parse tree and provide access to
|
||||
the information at a useful semantic level, one function provides a
|
||||
|
@ -316,7 +342,7 @@ simple low-level pattern matching capability, and the other function
|
|||
defines a high-level interface to the classes by handling file
|
||||
operations on behalf of the caller. All source files mentioned here
|
||||
which are not part of the Python installation are located in the
|
||||
\file{Demo/parser} directory of the distribution.
|
||||
\file{Demo/parser/} directory of the distribution.
|
||||
|
||||
The dynamic nature of Python allows the programmer a great deal of
|
||||
flexibility, but most modules need only a limited measure of this when
|
||||
|
@ -324,13 +350,13 @@ defining classes, functions, and methods. In this example, the only
|
|||
definitions that will be considered are those which are defined in the
|
||||
top level of their context, e.g., a function defined by a \code{def}
|
||||
statement at column zero of a module, but not a function defined
|
||||
within a branch of an \code{if} ... \code{else} construct, thought
|
||||
within a branch of an \code{if} ... \code{else} construct, though
|
||||
there are some good reasons for doing so in some situations. Nesting
|
||||
of definitions will be handled by the code developed in the example.
|
||||
|
||||
To construct the upper-level extraction methods, we need to know what
|
||||
the parse tree structure looks like and how much of it we actually
|
||||
need to be concerned about. Python uses a moderately deep parse tree,
|
||||
need to be concerned about. Python uses a moderately deep parse tree
|
||||
so there are a large number of intermediate nodes. It is important to
|
||||
read and understand the formal grammar used by Python. This is
|
||||
specified in the file \file{Grammar/Grammar} in the distribution.
|
||||
|
@ -345,7 +371,7 @@ a module consisting of a docstring and nothing else. (See file
|
|||
|
||||
Using the interpreter to take a look at the parse tree, we find a
|
||||
bewildering mass of numbers and parentheses, with the documentation
|
||||
buried deep in the nested tuples:
|
||||
buried deep in nested tuples.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> import parser
|
||||
|
@ -405,12 +431,12 @@ the docstring subtree within the tree defining the described
|
|||
structure.
|
||||
|
||||
By replacing the actual docstring with something to signify a variable
|
||||
component of the tree, we allow a simple pattern matching approach may
|
||||
be taken to checking any given subtree for equivelence to the general
|
||||
pattern for docstrings. Since the example demonstrates information
|
||||
extraction, we can safely require that the tree be in tuple form
|
||||
rather than list form, allowing a simple variable representation to be
|
||||
\code{['variable\_name']}. A simple recursive function can implement
|
||||
component of the tree, we allow a simple pattern matching approach to
|
||||
check any given subtree for equivelence to the general pattern for
|
||||
docstrings. Since the example demonstrates information extraction, we
|
||||
can safely require that the tree be in tuple form rather than list
|
||||
form, allowing a simple variable representation to be
|
||||
\code{['variable_name']}. A simple recursive function can implement
|
||||
the pattern matching, returning a boolean and a dictionary of variable
|
||||
name to value mappings. (See file \file{example.py}.)
|
||||
|
||||
|
@ -434,7 +460,7 @@ def match(pattern, data, vars=None):
|
|||
return same, vars
|
||||
\end{verbatim}
|
||||
|
||||
Using this simple recursive pattern matching function and the symbolic
|
||||
Using this simple representation for syntactic variables and the symbolic
|
||||
node types, the pattern for the candidate docstring subtrees becomes
|
||||
fairly readable. (See file \file{example.py}.)
|
||||
|
||||
|
@ -518,17 +544,17 @@ methods \code{get_name()}, \code{get_docstring()},
|
|||
|
||||
Within each of the forms of code block that the public classes
|
||||
represent, most of the required information is in the same form and is
|
||||
access in the same way, with classes having the distinction that
|
||||
accessed in the same way, with classes having the distinction that
|
||||
functions defined at the top level are referred to as ``methods.''
|
||||
Since the difference in nomenclature reflects a real semantic
|
||||
distinction from functions defined outside of a class, our
|
||||
implementation needs to maintain the same measure of distinction.
|
||||
distinction from functions defined outside of a class, the
|
||||
implementation needs to maintain the distinction.
|
||||
Hence, most of the functionality of the public classes can be
|
||||
implemented in a common base class, \code{SuiteInfoBase}, with the
|
||||
accessors for function and method information provided elsewhere.
|
||||
Note that there is only one class which represents function and method
|
||||
information; this mirrors the use of the \code{def} statement to
|
||||
define both types of functions.
|
||||
information; this paralels the use of the \code{def} statement to
|
||||
define both types of elements.
|
||||
|
||||
Most of the accessor functions are declared in \code{SuiteInfoBase}
|
||||
and do not need to be overriden by subclasses. More importantly, the
|
||||
|
@ -602,25 +628,25 @@ When the short form is used, the code block may contain a docstring as
|
|||
the first, and possibly only, \code{small_stmt} element. The
|
||||
extraction of such a docstring is slightly different and requires only
|
||||
a portion of the complete pattern used in the more common case. As
|
||||
given in the code, the docstring will only be found if there is only
|
||||
implemented, the docstring will only be found if there is only
|
||||
one \code{small_stmt} node in the \code{simple_stmt} node. Since most
|
||||
functions and methods which use the short form do not provide
|
||||
functions and methods which use the short form do not provide a
|
||||
docstring, this may be considered sufficient. The extraction of the
|
||||
docstring proceeds using the \code{match()} function as described
|
||||
above, and the value of the docstring is stored as an attribute of the
|
||||
\code{SuiteInfoBase} object.
|
||||
|
||||
After docstring extraction, the operates a simple definition discovery
|
||||
algorithm on the \code{stmt} nodes of the \code{suite} node. The
|
||||
After docstring extraction, a simple definition discovery
|
||||
algorithm operates on the \code{stmt} nodes of the \code{suite} node. The
|
||||
special case of the short form is not tested; since there are no
|
||||
\code{stmt} nodes in the short form, the algorithm will silently skip
|
||||
the single \code{simple_stmt} node and correctly not discover any
|
||||
nested definitions.
|
||||
|
||||
Each statement in the code block bing examined is categorized as being
|
||||
a class definition, function definition (including methods), or
|
||||
Each statement in the code block is categorized as
|
||||
a class definition, function or method definition, or
|
||||
something else. For the definition statements, the name of the
|
||||
element being defined is extracted and representation object
|
||||
element defined is extracted and a representation object
|
||||
appropriate to the definition is created with the defining subtree
|
||||
passed as an argument to the constructor. The repesentation objects
|
||||
are stored in instance variables and may be retrieved by name using
|
||||
|
@ -630,7 +656,7 @@ The public classes provide any accessors required which are more
|
|||
specific than those provided by the \code{SuiteInfoBase} class, but
|
||||
the real extraction algorithm remains common to all forms of code
|
||||
blocks. A high-level function can be used to extract the complete set
|
||||
of information from a source file:
|
||||
of information from a source file. (See file \file{example.py}.)
|
||||
|
||||
\begin{verbatim}
|
||||
def get_docs(fileName):
|
||||
|
|
|
@ -17,20 +17,21 @@
|
|||
The \code{parser} module provides an interface to Python's internal
|
||||
parser and byte-code compiler. The primary purpose for this interface
|
||||
is to allow Python code to edit the parse tree of a Python expression
|
||||
and create executable code from this. This can be better than trying
|
||||
to parse and modify an arbitrary Python code fragment as a string, and
|
||||
ensures that parsing is performed in a manner identical to the code
|
||||
forming the application. It's also faster.
|
||||
and create executable code from this. This is better than trying
|
||||
to parse and modify an arbitrary Python code fragment as a string
|
||||
because parsing is performed in a manner identical to the code
|
||||
forming the application. It is also faster.
|
||||
|
||||
There are a few things to note about this module which are important
|
||||
to making use of the data structures created. This is not a tutorial
|
||||
on editing the parse trees for Python code.
|
||||
on editing the parse trees for Python code, but some examples of using
|
||||
the \code{parser} module are presented.
|
||||
|
||||
Most importantly, a good understanding of the Python grammar processed
|
||||
by the internal parser is required. For full information on the
|
||||
language syntax, refer to the Language Reference. The parser itself
|
||||
is created from a grammar specification defined in the file
|
||||
\code{Grammar/Grammar} in the standard Python distribution. The parse
|
||||
\file{Grammar/Grammar} in the standard Python distribution. The parse
|
||||
trees stored in the ``AST objects'' created by this module are the
|
||||
actual output from the internal parser when created by the
|
||||
\code{expr()} or \code{suite()} functions, described below. The AST
|
||||
|
@ -51,16 +52,16 @@ Each element of the sequences returned by \code{ast2list} or
|
|||
non-terminal elements in the grammar always have a length greater than
|
||||
one. The first element is an integer which identifies a production in
|
||||
the grammar. These integers are given symbolic names in the C header
|
||||
file \code{Include/graminit.h} and the Python module
|
||||
\code{Lib/symbol.py}. Each additional element of the sequence represents
|
||||
file \file{Include/graminit.h} and the Python module
|
||||
\file{Lib/symbol.py}. Each additional element of the sequence represents
|
||||
a component of the production as recognized in the input string: these
|
||||
are always sequences which have the same form as the parent. An
|
||||
important aspect of this structure which should be noted is that
|
||||
keywords used to identify the parent node type, such as the keyword
|
||||
\code{if} in an \emph{if\_stmt}, are included in the node tree without
|
||||
\code{if} in an \code{if_stmt}, are included in the node tree without
|
||||
any special treatment. For example, the \code{if} keyword is
|
||||
represented by the tuple \code{(1, 'if')}, where \code{1} is the
|
||||
numeric value associated with all \code{NAME} elements, including
|
||||
numeric value associated with all \code{NAME} tokens, including
|
||||
variable and function names defined by the user. In an alternate form
|
||||
returned when line number information is requested, the same token
|
||||
might be represented as \code{(1, 'if', 12)}, where the \code{12}
|
||||
|
@ -70,51 +71,115 @@ Terminal elements are represented in much the same way, but without
|
|||
any child elements and the addition of the source text which was
|
||||
identified. The example of the \code{if} keyword above is
|
||||
representative. The various types of terminal symbols are defined in
|
||||
the C header file \code{Include/token.h} and the Python module
|
||||
\code{Lib/token.py}.
|
||||
the C header file \file{Include/token.h} and the Python module
|
||||
\file{Lib/token.py}.
|
||||
|
||||
The AST objects are not actually required to support the functionality
|
||||
of this module, but are provided for three purposes: to allow an
|
||||
application to amortize the cost of processing complex parse trees, to
|
||||
provide a parse tree representation which conserves memory space when
|
||||
compared to the Python list or tuple representation, and to ease the
|
||||
creation of additional modules in C which manipulate parse trees. A
|
||||
simple ``wrapper'' module may be created in Python to hide the use of
|
||||
AST objects.
|
||||
The AST objects are not required to support the functionality of this
|
||||
module, but are provided for three purposes: to allow an application
|
||||
to amortize the cost of processing complex parse trees, to provide a
|
||||
parse tree representation which conserves memory space when compared
|
||||
to the Python list or tuple representation, and to ease the creation
|
||||
of additional modules in C which manipulate parse trees. A simple
|
||||
``wrapper'' class may be created in Python to hide the use of AST
|
||||
objects; the \code{AST} library module provides a variety of such
|
||||
classes.
|
||||
|
||||
|
||||
The \code{parser} module defines the following functions:
|
||||
The \code{parser} module defines functions for a few distinct
|
||||
purposes. The most important purposes are to create AST objects and
|
||||
to convert AST objects to other representations such as parse trees
|
||||
and compiled code objects, but there are also functions which serve to
|
||||
query the type of parse tree represented by an AST object.
|
||||
|
||||
\renewcommand{\indexsubitem}{(in module parser)}
|
||||
|
||||
\begin{funcdesc}{ast2list}{ast\optional{\, line\_info\code{ = 0}}}
|
||||
|
||||
\subsection{Creating AST Objects}
|
||||
|
||||
AST objects may be created from source code or from a parse tree.
|
||||
When creating an AST object from source, different functions are used
|
||||
to create the \code{'eval'} and \code{'exec'} forms.
|
||||
|
||||
\begin{funcdesc}{expr}{string}
|
||||
The \code{expr()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'eval')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{suite}{string}
|
||||
The \code{suite()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'exec')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{sequence2ast}{sequence}
|
||||
This function accepts a parse tree represented as a sequence and
|
||||
builds an internal representation if possible. If it can validate
|
||||
that the tree conforms to the Python grammar and all nodes are valid
|
||||
node types in the host version of Python, an AST object is created
|
||||
from the internal representation and returned to the called. If there
|
||||
is a problem creating the internal representation, or if the tree
|
||||
cannot be validated, a \code{ParserError} exception is thrown. An AST
|
||||
object created this way should not be assumed to compile correctly;
|
||||
normal exceptions thrown by compilation may still be initiated when
|
||||
the AST object is passed to \code{compileast()}. This may indicate
|
||||
problems not related to syntax (such as a \code{MemoryError}
|
||||
exception), but may also be due to constructs such as the result of
|
||||
parsing \code{del f(0)}, which escapes the Python parser but is
|
||||
checked by the bytecode compiler.
|
||||
|
||||
Sequences representing terminal tokens may be represented as either
|
||||
two-element lists of the form \code{(1, 'name')} or as three-element
|
||||
lists of the form \code{(1, 'name', 56)}. If the third element is
|
||||
present, it is assumed to be a valid line number. The line number
|
||||
may be specified for any subset of the terminal symbols in the input
|
||||
tree.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{tuple2ast}{sequence}
|
||||
This is the same function as \code{sequence2ast()}. This entry point
|
||||
is maintained for backward compatibility.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\subsection{Converting AST Objects}
|
||||
|
||||
AST objects, regardless of the input used to create them, may be
|
||||
converted to parse trees represented as list- or tuple- trees, or may
|
||||
be compiled into executable code objects. Parse trees may be
|
||||
extracted with or without line numbering information.
|
||||
|
||||
\begin{funcdesc}{ast2list}{ast\optional{\, line_info\code{ = 0}}}
|
||||
This function accepts an AST object from the caller in
|
||||
\code{\var{ast}} and returns a Python list representing the
|
||||
equivelent parse tree. The resulting list representation can be used
|
||||
for inspection or the creation of a new parse tree in list form.
|
||||
This function does not fail so long as memory is available to build
|
||||
the list representation. If a parse tree will only be used for
|
||||
for inspection or the creation of a new parse tree in list form. This
|
||||
function does not fail so long as memory is available to build the
|
||||
list representation. If the parse tree will only be used for
|
||||
inspection, \code{ast2tuple()} should be used instead to reduce memory
|
||||
consumption and fragmentation. When modifications are to be made to
|
||||
the parse tree, this function is significantly faster than retrieving
|
||||
a tuple representation and converting that to nested lists.
|
||||
consumption and fragmentation. When the list representation is
|
||||
required, this function is significantly faster than retrieving a
|
||||
tuple representation and converting that to nested lists.
|
||||
|
||||
If the \code{line\_info} flag is given true value, line number
|
||||
information will be included for all terminal tokens as a third
|
||||
element of the list representing the token. This information is
|
||||
omitted if the flag is false or omitted.
|
||||
If \code{\var{line_info}} is true, line number information will be
|
||||
included for all terminal tokens as a third element of the list
|
||||
representing the token. This information is omitted if the flag is
|
||||
false or omitted.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{ast2tuple}{ast\optional{\, line\_info\code{ = 0}}}
|
||||
\begin{funcdesc}{ast2tuple}{ast\optional{\, line_info\code{ = 0}}}
|
||||
This function accepts an AST object from the caller in
|
||||
\code{\var{ast}} and returns a Python tuple representing the
|
||||
equivelent parse tree. Other than returning a tuple instead of a
|
||||
list, this function is identical to \code{ast2list()}.
|
||||
|
||||
If the \code{line\_info} flag is given true value, line number
|
||||
information will be included for all terminal tokens as a third
|
||||
element of the list representing the token. This information is
|
||||
omitted if the flag is false or omitted.
|
||||
If \code{\var{line_info}} is true, line number information will be
|
||||
included for all terminal tokens as a third element of the list
|
||||
representing the token. This information is omitted if the flag is
|
||||
false or omitted.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{compileast}{ast\optional{\, filename\code{ = '<ast>'}}}
|
||||
|
@ -128,7 +193,7 @@ for \code{\var{filename}} indicates that the source was an AST object.
|
|||
|
||||
Compiling an AST object may result in exceptions related to
|
||||
compilation; an example would be a \code{SyntaxError} caused by the
|
||||
parse tree for \code{del f(0)}; this statement is considered legal
|
||||
parse tree for \code{del f(0)}: this statement is considered legal
|
||||
within the formal grammar for Python but is not a legal language
|
||||
construct. The \code{SyntaxError} raised for this condition is
|
||||
actually generated by the Python byte-compiler normally, which is why
|
||||
|
@ -138,14 +203,13 @@ inspection of the parse tree.
|
|||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{expr}{string}
|
||||
The \code{expr()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'eval')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
\subsection{Queries on AST Objects}
|
||||
|
||||
Two functions are provided which allow an application to determine if
|
||||
an AST was create as an expression or a suite. Neither of these
|
||||
functions can be used to determine if an AST was created from source
|
||||
code via \code{expr()} or \code{suite()} or from a parse tree via
|
||||
\code{sequence2ast()}.
|
||||
|
||||
\begin{funcdesc}{isexpr}{ast}
|
||||
When \code{\var{ast}} represents an \code{'eval'} form, this function
|
||||
|
@ -160,48 +224,10 @@ like this either, and are identical to those created by the built-in
|
|||
|
||||
\begin{funcdesc}{issuite}{ast}
|
||||
This function mirrors \code{isexpr()} in that it reports whether an
|
||||
AST object represents a suite of statements. It is not safe to assume
|
||||
that this function is equivelent to \code{not isexpr(\var{ast})}, as
|
||||
additional syntactic fragments may be supported in the future.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{suite}{string}
|
||||
The \code{suite()} function parses the parameter \code{\var{string}}
|
||||
as if it were an input to \code{compile(\var{string}, 'exec')}. If
|
||||
the parse succeeds, an AST object is created to hold the internal
|
||||
parse tree representation, otherwise an appropriate exception is
|
||||
thrown.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
\begin{funcdesc}{sequence2ast}{sequence}
|
||||
This function accepts a parse tree represented as a sequence and
|
||||
builds an internal representation if possible. If it can validate
|
||||
that the tree conforms to the Python grammar and all nodes are valid
|
||||
node types in the host version of Python, an AST object is created
|
||||
from the internal representation and returned to the called. If there
|
||||
is a problem creating the internal representation, or if the tree
|
||||
cannot be validated, a \code{ParserError} exception is thrown. An AST
|
||||
object created this way should not be assumed to compile correctly;
|
||||
normal exceptions thrown by compilation may still be initiated when
|
||||
the AST object is passed to \code{compileast()}. This will normally
|
||||
indicate problems not related to syntax (such as a \code{MemoryError}
|
||||
exception), but may also be due to constructs such as the result of
|
||||
parsing \code{del f(0)}, which escapes the Python parser but is
|
||||
checked by the bytecode compiler.
|
||||
|
||||
Sequences representing terminal tokens may be represented as either
|
||||
two-element lists of the form \code{(1, 'name')} or as three-element
|
||||
lists of the form \code{(1, 'name', 56)}. If the third element is
|
||||
present, it is assumed to be a valid line number. The line number
|
||||
may be specified for any subset of the terminal symbols in the input
|
||||
tree.
|
||||
\end{funcdesc}
|
||||
|
||||
\begin{funcdesc}{tuple2ast}{sequence}
|
||||
This is the same function as \code{sequence2ast}. This entry point is
|
||||
maintained for backward compatibility.
|
||||
AST object represents an \code{'exec'} form, commonly known as a
|
||||
``suite.'' It is not safe to assume that this function is equivelent
|
||||
to \code{not isexpr(\var{ast})}, as additional syntactic fragments may
|
||||
be supported in the future.
|
||||
\end{funcdesc}
|
||||
|
||||
|
||||
|
@ -235,10 +261,11 @@ to the descriptions of each function for detailed information.
|
|||
|
||||
\subsection{AST Objects}
|
||||
|
||||
AST objects (returned by \code{expr()}, \code{suite()}, and
|
||||
\code{sequence2ast()}, described above) have no methods of their own.
|
||||
AST objects returned by \code{expr()}, \code{suite()}, and
|
||||
\code{sequence2ast()} have no methods of their own.
|
||||
Some of the functions defined which accept an AST object as their
|
||||
first argument may change to object methods in the future.
|
||||
first argument may change to object methods in the future. The type
|
||||
of these objects is available as \code{ASTType} in the module.
|
||||
|
||||
Ordered and equality comparisons are supported between AST objects.
|
||||
|
||||
|
@ -247,12 +274,12 @@ Ordered and equality comparisons are supported between AST objects.
|
|||
|
||||
The parser modules allows operations to be performed on the parse tree
|
||||
of Python source code before the bytecode is generated, and provides
|
||||
for inspection of the parse tree for information gathering purposes as
|
||||
well. Two examples are presented. The simple example demonstrates
|
||||
emulation of the \code{compile()} built-in function and the complex
|
||||
example shows the use of a parse tree for information discovery.
|
||||
for inspection of the parse tree for information gathering purposes.
|
||||
Two examples are presented. The simple example demonstrates emulation
|
||||
of the \code{compile()} built-in function and the complex example
|
||||
shows the use of a parse tree for information discovery.
|
||||
|
||||
\subsubsection{Emulation of {\tt compile()}}
|
||||
\subsubsection{Emulation of \sectcode{compile()}}
|
||||
|
||||
While many useful operations may take place between parsing and
|
||||
bytecode generation, the simplest operation is to do nothing. For
|
||||
|
@ -298,17 +325,16 @@ def load_expression(source_string):
|
|||
|
||||
\subsubsection{Information Discovery}
|
||||
|
||||
Some applications can benfit from access to the parse tree itself, and
|
||||
can take advantage of the intermediate data structure provided by the
|
||||
\code{parser} module. The remainder of this section of examples will
|
||||
demonstrate how the intermediate data structure can provide access to
|
||||
module documentation defined in docstrings without requiring that the
|
||||
code being examined be imported into a running interpreter. This can
|
||||
be very useful for performing analyses of untrusted code.
|
||||
Some applications benefit from direct access to the parse tree. The
|
||||
remainder of this section demonstrates how the parse tree provides
|
||||
access to module documentation defined in docstrings without requiring
|
||||
that the code being examined be loaded into a running interpreter via
|
||||
\code{import}. This can be very useful for performing analyses of
|
||||
untrusted code.
|
||||
|
||||
Generally, the example will demonstrate how the parse tree may be
|
||||
traversed to distill interesting information. Two functions and a set
|
||||
of classes is developed which provide programmatic access to high
|
||||
of classes are developed which provide programmatic access to high
|
||||
level function and class definitions provided by a module. The
|
||||
classes extract information from the parse tree and provide access to
|
||||
the information at a useful semantic level, one function provides a
|
||||
|
@ -316,7 +342,7 @@ simple low-level pattern matching capability, and the other function
|
|||
defines a high-level interface to the classes by handling file
|
||||
operations on behalf of the caller. All source files mentioned here
|
||||
which are not part of the Python installation are located in the
|
||||
\file{Demo/parser} directory of the distribution.
|
||||
\file{Demo/parser/} directory of the distribution.
|
||||
|
||||
The dynamic nature of Python allows the programmer a great deal of
|
||||
flexibility, but most modules need only a limited measure of this when
|
||||
|
@ -324,13 +350,13 @@ defining classes, functions, and methods. In this example, the only
|
|||
definitions that will be considered are those which are defined in the
|
||||
top level of their context, e.g., a function defined by a \code{def}
|
||||
statement at column zero of a module, but not a function defined
|
||||
within a branch of an \code{if} ... \code{else} construct, thought
|
||||
within a branch of an \code{if} ... \code{else} construct, though
|
||||
there are some good reasons for doing so in some situations. Nesting
|
||||
of definitions will be handled by the code developed in the example.
|
||||
|
||||
To construct the upper-level extraction methods, we need to know what
|
||||
the parse tree structure looks like and how much of it we actually
|
||||
need to be concerned about. Python uses a moderately deep parse tree,
|
||||
need to be concerned about. Python uses a moderately deep parse tree
|
||||
so there are a large number of intermediate nodes. It is important to
|
||||
read and understand the formal grammar used by Python. This is
|
||||
specified in the file \file{Grammar/Grammar} in the distribution.
|
||||
|
@ -345,7 +371,7 @@ a module consisting of a docstring and nothing else. (See file
|
|||
|
||||
Using the interpreter to take a look at the parse tree, we find a
|
||||
bewildering mass of numbers and parentheses, with the documentation
|
||||
buried deep in the nested tuples:
|
||||
buried deep in nested tuples.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> import parser
|
||||
|
@ -405,12 +431,12 @@ the docstring subtree within the tree defining the described
|
|||
structure.
|
||||
|
||||
By replacing the actual docstring with something to signify a variable
|
||||
component of the tree, we allow a simple pattern matching approach may
|
||||
be taken to checking any given subtree for equivelence to the general
|
||||
pattern for docstrings. Since the example demonstrates information
|
||||
extraction, we can safely require that the tree be in tuple form
|
||||
rather than list form, allowing a simple variable representation to be
|
||||
\code{['variable\_name']}. A simple recursive function can implement
|
||||
component of the tree, we allow a simple pattern matching approach to
|
||||
check any given subtree for equivelence to the general pattern for
|
||||
docstrings. Since the example demonstrates information extraction, we
|
||||
can safely require that the tree be in tuple form rather than list
|
||||
form, allowing a simple variable representation to be
|
||||
\code{['variable_name']}. A simple recursive function can implement
|
||||
the pattern matching, returning a boolean and a dictionary of variable
|
||||
name to value mappings. (See file \file{example.py}.)
|
||||
|
||||
|
@ -434,7 +460,7 @@ def match(pattern, data, vars=None):
|
|||
return same, vars
|
||||
\end{verbatim}
|
||||
|
||||
Using this simple recursive pattern matching function and the symbolic
|
||||
Using this simple representation for syntactic variables and the symbolic
|
||||
node types, the pattern for the candidate docstring subtrees becomes
|
||||
fairly readable. (See file \file{example.py}.)
|
||||
|
||||
|
@ -518,17 +544,17 @@ methods \code{get_name()}, \code{get_docstring()},
|
|||
|
||||
Within each of the forms of code block that the public classes
|
||||
represent, most of the required information is in the same form and is
|
||||
access in the same way, with classes having the distinction that
|
||||
accessed in the same way, with classes having the distinction that
|
||||
functions defined at the top level are referred to as ``methods.''
|
||||
Since the difference in nomenclature reflects a real semantic
|
||||
distinction from functions defined outside of a class, our
|
||||
implementation needs to maintain the same measure of distinction.
|
||||
distinction from functions defined outside of a class, the
|
||||
implementation needs to maintain the distinction.
|
||||
Hence, most of the functionality of the public classes can be
|
||||
implemented in a common base class, \code{SuiteInfoBase}, with the
|
||||
accessors for function and method information provided elsewhere.
|
||||
Note that there is only one class which represents function and method
|
||||
information; this mirrors the use of the \code{def} statement to
|
||||
define both types of functions.
|
||||
information; this paralels the use of the \code{def} statement to
|
||||
define both types of elements.
|
||||
|
||||
Most of the accessor functions are declared in \code{SuiteInfoBase}
|
||||
and do not need to be overriden by subclasses. More importantly, the
|
||||
|
@ -602,25 +628,25 @@ When the short form is used, the code block may contain a docstring as
|
|||
the first, and possibly only, \code{small_stmt} element. The
|
||||
extraction of such a docstring is slightly different and requires only
|
||||
a portion of the complete pattern used in the more common case. As
|
||||
given in the code, the docstring will only be found if there is only
|
||||
implemented, the docstring will only be found if there is only
|
||||
one \code{small_stmt} node in the \code{simple_stmt} node. Since most
|
||||
functions and methods which use the short form do not provide
|
||||
functions and methods which use the short form do not provide a
|
||||
docstring, this may be considered sufficient. The extraction of the
|
||||
docstring proceeds using the \code{match()} function as described
|
||||
above, and the value of the docstring is stored as an attribute of the
|
||||
\code{SuiteInfoBase} object.
|
||||
|
||||
After docstring extraction, the operates a simple definition discovery
|
||||
algorithm on the \code{stmt} nodes of the \code{suite} node. The
|
||||
After docstring extraction, a simple definition discovery
|
||||
algorithm operates on the \code{stmt} nodes of the \code{suite} node. The
|
||||
special case of the short form is not tested; since there are no
|
||||
\code{stmt} nodes in the short form, the algorithm will silently skip
|
||||
the single \code{simple_stmt} node and correctly not discover any
|
||||
nested definitions.
|
||||
|
||||
Each statement in the code block bing examined is categorized as being
|
||||
a class definition, function definition (including methods), or
|
||||
Each statement in the code block is categorized as
|
||||
a class definition, function or method definition, or
|
||||
something else. For the definition statements, the name of the
|
||||
element being defined is extracted and representation object
|
||||
element defined is extracted and a representation object
|
||||
appropriate to the definition is created with the defining subtree
|
||||
passed as an argument to the constructor. The repesentation objects
|
||||
are stored in instance variables and may be retrieved by name using
|
||||
|
@ -630,7 +656,7 @@ The public classes provide any accessors required which are more
|
|||
specific than those provided by the \code{SuiteInfoBase} class, but
|
||||
the real extraction algorithm remains common to all forms of code
|
||||
blocks. A high-level function can be used to extract the complete set
|
||||
of information from a source file:
|
||||
of information from a source file. (See file \file{example.py}.)
|
||||
|
||||
\begin{verbatim}
|
||||
def get_docs(fileName):
|
||||
|
|
Loading…
Reference in New Issue