Merge branch 'module-preloading'

Not much point in keeping the docs changes separate.
This commit is contained in:
David Wilson 2017-10-05 19:20:05 +05:30
commit d800b684ef
6 changed files with 278 additions and 100 deletions

View File

@ -32,21 +32,22 @@ bootstrap implementation sent to every new slave context.
Decorator that marks a function or class method to automatically receive a
kwarg named `econtext`, referencing the
:py:class:`econtext.core.ExternalContext` active in the context in which
the function is being invoked in. The decorator is only meaningful when the
function is invoked via :py:data:`econtext.core.CALL_FUNCTION`.
:py:class:`mitogen.core.ExternalContext` active in the context in which the
function is being invoked in. The decorator is only meaningful when the
function is invoked via :py:data:`CALL_FUNCTION
<mitogen.core.CALL_FUNCTION>`.
When the function is invoked directly, `econtext` must still be passed to it
explicitly.
When the function is invoked directly, `econtext` must still be passed to
it explicitly.
.. currentmodule:: mitogen.core
.. decorator:: takes_router
Decorator that marks a function or class method to automatically receive a
kwarg named `router`, referencing the :py:class:`econtext.core.Router`
kwarg named `router`, referencing the :py:class:`mitogen.core.Router`
active in the context in which the function is being invoked in. The
decorator is only meaningful when the function is invoked via
:py:data:`econtext.core.CALL_FUNCTION`.
:py:data:`CALL_FUNCTION <mitogen.core.CALL_FUNCTION>`.
When the function is invoked directly, `router` must still be passed to it
explicitly.
@ -269,8 +270,9 @@ Router Class
.. class:: Router
Route messages between parent and child contexts, and invoke handlers
defined on our parent context. Router.route() straddles the Broker and user
threads, it is safe to call anywhere.
defined on our parent context. :py:meth:`Router.route() <route>` straddles
the :py:class:`Broker <mitogen.core.Broker>` and user threads, it is safe
to call anywhere.
**Note:** This is the somewhat limited core version of the Router class
used by child contexts. The master subclass is documented below this one.
@ -312,8 +314,8 @@ Router Class
:param mitogen.core.Context respondent:
Context that messages to this handle are expected to be sent from.
If specified, arranges for ``_DEAD`` to be delivered to `fn` when
disconncetion of the context is detected.
If specified, arranges for :py:data:`_DEAD` to be delivered to `fn`
when disconnection of the context is detected.
In future `respondent` will likely also be used to prevent other
contexts from sending messages to the handle.
@ -417,8 +419,8 @@ Router Class
:param mitogen.core.Context via:
If not ``None``, arrange for construction to occur via RPCs made to
the context `via`, and for ``ADD_ROUTE`` messages to be generated
as appropriate.
the context `via`, and for :py:data:`ADD_ROUTE
<mitogen.core.ADD_ROUTE>` messages to be generated as appropriate.
.. code-block:: python
@ -635,9 +637,9 @@ Receiver Class
:param mitogen.core.Context respondent:
Reference to the context this receiver is receiving from. If not
``None``, arranges for the receiver to receive
:py:data:`mitogen.core._DEAD` if messages can no longer be routed to
the context, due to disconnection or exit.
``None``, arranges for the receiver to receive :py:data:`_DEAD` if
messages can no longer be routed to the context, due to disconnection
or exit.
.. attribute:: notify = None
@ -718,7 +720,7 @@ Sender Class
.. py:method:: close ()
Send :py:data:`mitogen.core._DEAD` to the remote end, causing
Send :py:data:`_DEAD` to the remote end, causing
:py:meth:`ChannelError` to be raised in any waiting thread.
.. py:method:: put (data)
@ -745,12 +747,15 @@ Channel Class
Broker Class
============
.. currentmodule:: mitogen.master
.. currentmodule:: mitogen.core
.. autoclass:: Broker
:members:
:inherited-members:
.. currentmodule:: mitogen.master
.. autoclass:: Broker
:members:
Utility Functions
=================

View File

@ -177,7 +177,6 @@ Recursion
Let's try something a little more complex:
.. _serialization-rules:
RPC Serialization Rules
@ -186,16 +185,16 @@ RPC Serialization Rules
The following built-in types may be used as parameters or return values in
remote procedure calls:
* bool
* bytearray
* bytes
* dict
* int
* list
* long
* str
* tuple
* unicode
* :class:`bool`
* :class:`bytearray`
* :func:`bytes`
* :class:`dict`
* :class:`int`
* :func:`list`
* :class:`long`
* :class:`str`
* :func:`tuple`
* :func:`unicode`
User-defined types may not be used, except for:

View File

@ -128,8 +128,8 @@ Generating A Synthetic `mitogen` Package
########################################
Since the bootstrap consists of the :py:mod:`mitogen.core` source code, and
this code is loaded by Python by way of its main script (``__main__`` module),
initially the module layout in the child will be incorrect.
this code is loaded by Python by way of its main script (:mod:`__main__`
module), initially the module layout in the child will be incorrect.
The first step taken after bootstrap is to rearrange :py:data:`sys.modules` slightly
so that :py:mod:`mitogen.core` appears in the correct location, and all
@ -139,7 +139,7 @@ such that :py:mod:`cPickle` correctly serializes instance module names.
Once a synthetic :py:mod:`mitogen` package and :py:mod:`mitogen.core` module
have been generated, the bootstrap **deletes** `sys.modules['__main__']`, so
that any attempt to import it (by :py:mod:`cPickle`) will cause the import to
be satisfied by fetching the master's actual ``__main__`` module. This is
be satisfied by fetching the master's actual :mod:`__main__` module. This is
necessary to allow master programs to be written as a self-contained Python
script.
@ -172,8 +172,8 @@ The Module Importer
###################
An instance of :py:class:`mitogen.core.Importer` is installed in
:py:data:`sys.meta_path`, where Python's ``import`` statement will execute it
before attempting to find a module locally.
:py:data:`sys.meta_path`, where Python's :keyword:`import` statement will
execute it before attempting to find a module locally.
Standard IO Redirection
@ -198,6 +198,8 @@ active, so that ``print`` statements and suchlike promptly appear in the logs.
Function Call Dispatch
######################
.. currentmodule:: mitogen.core
After all initialization is complete, the child's main thread sits in a loop
reading from a :py:class:`Channel <mitogen.core.Channel>` connected to the
:py:data:`CALL_FUNCTION <mitogen.core.CALL_FUNCTION>` handle. This handle is
@ -205,17 +207,26 @@ written to by
:py:meth:`call() <mitogen.master.Context.call>`
and :py:meth:`call_async() <mitogen.master.Context.call_async>`.
:py:data:`CALL_FUNCTION <mitogen.core.CALL_FUNCTION>` only accepts requests
from the context IDs listed in :py:data:`mitogen.parent_ids`, forming a chain
of trust between the master and any intermediate context leading to the
recipient of the message. In combination with :ref:`source-verification`, this
is a major contributor to ensuring contexts running on compromised
infrastructure cannot trigger code execution in siblings or any parent.
Shutdown
########
.. currentmodule:: mitogen.core
When a context receives :py:data:`SHUTDOWN <mitogen.core.SHUTDOWN>` from its
immediate parent, it closes its own :py:data:`CALL_FUNCTION
<mitogen.core.CALL_FUNCTION>` :py:class:`Channel <mitogen.core.Channel>` before
sending ``SHUTDOWN`` to any directly connected children. Closing the channel
has the effect of causing :py:meth:`ExternalContext._dispatch_calls()
<mitogen.core.ExternalContext._dispatch_calls>` to exit and begin joining on
the broker thread.
sending :py:data:`SHUTDOWN <mitogen.core.SHUTDOWN>` to any directly connected
children. Closing the channel has the effect of causing
:py:meth:`ExternalContext._dispatch_calls` to exit and begin joining on the
broker thread.
During shutdown, the master waits up to 5 seconds for children to disconnect
gracefully before force disconnecting them, while children will use that time
@ -234,7 +245,7 @@ irritating delays would often be experienced during program termination.
If the main thread (responsible for function call dispatch) fails to shut down
gracefully, because some user function is hanging, it will still be cleaned up
since as the final step in broker shutdown, the broker sends
:py:data:`signal.SIGTERM` to its own process.
:py:mod:`signal.SIGTERM <signal>` to its own process.
.. _stream-protocol:
@ -242,6 +253,8 @@ since as the final step in broker shutdown, the broker sends
Stream Protocol
---------------
.. currentmodule:: mitogen.core
Once connected, a basic framing protocol is used to communicate between
parent and child:
@ -263,29 +276,60 @@ parent and child:
Masters listen on the following handles:
.. data:: mitogen.core.FORWARD_LOG
.. _FORWARD_LOG:
.. currentmodule:: mitogen.core
.. data:: FORWARD_LOG
Receives `(logger_name, level, msg)` 3-tuples and writes them to the
master's ``mitogen.ctx.<context_name>`` logger.
.. data:: mitogen.core.GET_MODULE
.. _GET_MODULE:
.. currentmodule:: mitogen.core
.. data:: GET_MODULE
Receives `(reply_to, fullname)` 2-tuples, looks up the source code for the
module named ``fullname``, and writes the source along with some metadata
back to the handle ``reply_to``. If lookup fails, ``None`` is sent instead.
Receives the name of a module to load `fullname`, locates the source code
for `fullname`, and routes one or more :py:data:`LOAD_MODULE` messages back
towards the sender of the :py:data:`GET_MODULE` request. If lookup fails,
``None`` is sent instead.
.. data:: mitogen.core.ALLOCATE_ID
See :ref:`import-preloading` for a deeper discussion of
:py:data:`GET_MODULE`/:py:data:`LOAD_MODULE`.
.. _ALLOCATE_ID:
.. currentmodule:: mitogen.core
.. data:: ALLOCATE_ID
Replies to any message sent to it with a newly allocated unique context ID,
to allow children to safely start their own contexts. In future this is
likely to be replaced by 32-bit context IDs and pseudorandom allocation,
with an improved ``ADD_ROUTE`` message sent upstream rather than downstream
that generates NACKs if any ancestor detects an ID collision.
with an improved :py:data:`ADD_ROUTE` message sent upstream rather than
downstream that generates NACKs if any ancestor detects an ID collision.
Children listen on the following handles:
.. data:: mitogen.core.CALL_FUNCTION
.. _LOAD_MODULE:
.. currentmodule:: mitogen.core
.. data:: LOAD_MODULE
Receives `(pkg_present, path, compressed, related)` tuples, composed of:
* **pkg_present**: Either ``None`` for a plain ``.py`` module, or a list of
canonical names of submodules existing witin this package. For example, a
:py:data:`LOAD_MODULE` for the :py:mod:`mitogen` package would return a
list like: `["mitogen.core", "mitogen.fakessh", "mitogen.master", ..]`.
This list is used by children to avoid generating useless round-trips due
to Python 2.x's :keyword:`import` statement behavior.
* **path**: Original filesystem where the module was found on the master.
* **compressed**: :py:mod:`zlib`-compressed module source code.
* **related**: list of canonical module names on which this module appears
to depend. Used by children that have ever started any children of their
own to preload those children with :py:data:`LOAD_MODULE` messages in
response to a :py:data:`GET_MODULE` request.
.. _CALL_FUNCTION:
.. currentmodule:: mitogen.core
.. data:: CALL_FUNCTION
Receives `(mod_name, class_name, func_name, args, kwargs)`
5-tuples from
@ -293,52 +337,57 @@ Children listen on the following handles:
imports ``mod_name``, then attempts to execute
`class_name.func_name(\*args, \**kwargs)`.
When this channel is closed (by way of sending ``_DEAD`` to it), the
child's main thread begins graceful shutdown of its own `Broker` and
`Router`.
When this channel is closed (by way of sending :py:data:`_DEAD` to it), the
child's main thread begins graceful shutdown of its own :py:class:`Broker`
and :py:class:`Router`.
.. data:: mitogen.core.SHUTDOWN
.. _SHUTDOWN:
.. currentmodule:: mitogen.core
.. data:: SHUTDOWN
When received from a child's immediate parent, causes the broker thread to
enter graceful shutdown, including writing ``_DEAD`` to the child's main
thread, causing it to join on the exit of the broker thread.
enter graceful shutdown, including writing :py:data:`_DEAD` to the child's
main thread, causing it to join on the exit of the broker thread.
The final step of a child's broker shutdown process sends
:py:data:`signal.SIGTERM` to itself, ensuring the process dies even if the
main thread was hung executing user code.
:py:mod:`signal.SIGTERM <signal>` to itself, ensuring the process dies even
if the main thread was hung executing user code.
Each context is responsible for sending ``SHUTDOWN`` to each of its
directly connected children in response to the master sending ``SHUTDOWN``
to it, and arranging for the connection to its parent to be closed shortly
thereafter.
Each context is responsible for sending :py:data:`SHUTDOWN` to each of its
directly connected children in response to the master sending
:py:data:`SHUTDOWN` to it, and arranging for the connection to its parent
to be closed shortly thereafter.
.. data:: mitogen.core.ADD_ROUTE
.. _ADD_ROUTE:
.. currentmodule:: mitogen.core
.. data:: ADD_ROUTE
Receives `(target_id, via_id)` integer tuples, describing how messages
arriving at this context on any Stream should be forwarded on the stream
associated with the Context `via_id` such that they are eventually
delivered to the target Context.
arriving at this context on any stream should be forwarded on the stream
associated with the context `via_id` such that they are eventually
delivered to the target context.
This message is necessary to inform intermediary contexts of the existence
of a downstream Context, as they do not otherwise parse traffic they are
fowarding to their downstream contexts that may cause new contexts to be
established.
Given a chain `master -> ssh1 -> sudo1`, no `ADD_ROUTE` message is
Given a chain `master -> ssh1 -> sudo1`, no :py:data:`ADD_ROUTE` message is
necessary, since :py:class:`mitogen.core.Router` in the `ssh` context can
arrange to update its routes while setting up the new child during
`proxy_connect()`.
:py:meth:`Router.proxy_connect() <mitogen.master.Router.proxy_connect>`.
However, given a chain like `master -> ssh1 -> sudo1 -> ssh2 -> sudo2`,
`ssh1` requires an `ADD_ROUTE` for `ssh2`, and both `ssh1` and `sudo1`
require an `ADD_ROUTE` for `sudo2`, as neither directly dealt with its
establishment.
`ssh1` requires an :py:data:`ADD_ROUTE` for `ssh2`, and both `ssh1` and
`sudo1` require an :py:data:`ADD_ROUTE` for `sudo2`, as neither directly
dealt with its establishment.
Children that have ever been used to create a descendent child also listen on
the following handles:
.. data:: mitogen.core.GET_MODULE
.. currentmodule:: mitogen.core
.. data:: GET_MODULE
As with master's ``GET_MODULE``, except this implementation
(:py:class:`mitogen.master.ModuleForwarder`) serves responses using
@ -356,13 +405,15 @@ triggered by :py:meth:`call_async() <mitogen.master.Context.call_async>`.
Sentinel Value
##############
.. autodata:: mitogen.core._DEAD
.. _DEAD:
.. currentmodule:: mitogen.core
.. data:: _DEAD
The special value :py:data:`mitogen.core._DEAD` is used to signal
disconnection or closure of the remote end. It is used internally by
:py:class:`Channel <mitogen.core.Channel>` and also passed to any function
still registered with :py:meth:`add_handler()
<mitogen.core.Router.add_handler>` during Broker shutdown.
This special value is used to signal disconnection or closure of the remote
end. It is used internally by :py:class:`Channel <mitogen.core.Channel>`
and also passed to any function still registered with
:py:meth:`add_handler() <mitogen.core.Router.add_handler>` during Broker
shutdown.
Use of Pickle
@ -411,16 +462,18 @@ communicate with.
When :py:class:`mitogen.core.Router` receives a message, it checks the IDs
associated with its directly connected streams for a potential route. If any
stream matches, either because it directly connects to the target ID, or
because the master sent an ``ADD_ROUTE`` message associating it, then the
message will be forwarded down the tree using that stream.
because the master sent an :py:data:`ADD_ROUTE <mitogen.core.ADD_ROUTE>`
message associating it, then the message will be forwarded down the tree using
that stream.
If the message does not match any ``ADD_ROUTE`` message or stream, instead it
is forwarded upwards to the immediate parent, and recursively by each parent in
turn until one is reached that knows how to forward the message down the tree.
If the message does not match any :py:data:`ADD_ROUTE <mitogen.core.ADD_ROUTE>`
message or stream, instead it is forwarded upwards to the immediate parent, and
recursively by each parent in turn until one is reached that knows how to
forward the message down the tree.
When the master establishes a new context via an existing child context, it
sends corresponding ``ADD_ROUTE`` messages to each indirect parent between the
context and the root.
sends corresponding :py:data:`ADD_ROUTE <mitogen.core.ADD_ROUTE>` messages to
each indirect parent between the context and the root.
Example
@ -441,6 +494,24 @@ When ``sudo:node22a:webapp`` wants to send a message to
.. image:: images/route.png
.. _source-verification:
Source Verification
###################
Before forwarding or dispatching a message it has received,
:py:class:`mitogen.core.Router` first looks up the corresponding
:py:class:`mitogen.core.Stream` it would use to send responses towards the
message source, and if the looked up stream does not match the stream on which
the message was received, the message is discarded and a warning is logged.
This creates a trust chain leading up to the root of the tree, preventing
downstream contexts from injecting messages appearing to be from the master or
any more trustworthy parent. In this way, privileged functionality such as
:py:data:`CALL_FUNCTION <mitogen.core.CALL_FUNCTION>` can base trust decisions
on the accuracy of :py:ref:`src_id <stream-protocol>`.
Future
######
@ -465,23 +536,25 @@ The Module Importer
are a variety of approaches to implementing it, and the present implementation
is not pefectly efficient in every case.
It operates by intercepting ``import`` statements via `sys.meta_path`, asking
Python if it can satisfy the import by itself, and if not, indicating to Python
that it is capable of loading the module.
It operates by intercepting :keyword:`import` statements via
:py:data:`sys.meta_path`, asking Python if it can satisfy the import by itself,
and if not, indicating to Python that it is capable of loading the module.
In :py:meth:`load_module() <mitogen.core.Importer.load_module>` an RPC is
started to the parent context, requesting the module source code. Once the
source is fetched, the method builds a new module object using the best
practice documented in PEP-302.
started to the parent context, requesting the module source code by way of a
:py:data:`GET_MODULE <mitogen.core.GET_MODULE>`. If the parent context does not
have the module available, it recursively forwards the request upstream, while
avoiding duplicate requests for the same module from its own threads and any
child contexts.
Neutralizing ``__main__``
#########################
Neutralizing :py:mod:`__main__`
###############################
To avoid accidental execution of the ``__main__`` module's code in a slave
context, when serving the source of the main module, Mitogen removes any code
occurring after the first conditional that looks like a standard ``__main__``
execution guard:
To avoid accidental execution of the :py:mod:`__main__` module's code in a
slave context, when serving the source of the main module, Mitogen removes any
code occurring after the first conditional that looks like a standard
:py:mod:`__main__` execution guard:
.. code-block:: python
@ -506,11 +579,12 @@ requests will be made for modules that do not exist. For example:
import sys
import os
In Python 2.x, Python will first try to load ``mypkg.sys`` and ``mypkg.os``,
which do not exist, before falling back on :py:mod:`sys` and :py:mod:`os`.
In Python 2.x, Python will first try to load :py:mod:`mypkg.sys` and
:py:mod:`mypkg.os`, which do not exist, before falling back on :py:mod:`sys`
and :py:mod:`os`.
These negative imports present a challenge, as they introduce a large number of
pointless network roundtrips. Therefore in addition to the
pointless network round-trips. Therefore in addition to the
:py:mod:`zlib`-compressed source, for packages the master sends along a list of
child modules known to exist.
@ -521,6 +595,77 @@ module does not appear in the enumeration of child modules belonging to the
package that was provided by the master.
.. _import-preloading:
Import Preloading
#################
.. currentmodule:: mitogen.core
To further avoid round-trips, when a module or package is requested by a child,
its bytecode is scanned in the master to find all the module's
:keyword:`import` statements, and of those, which associated modules appear to
have been loaded in the master's :py:data:`sys.modules`.
The :py:data:`sys.modules` check is necessary to handle various kinds of
conditional execution, for example, when a module's code guards an
:keyword:`import` statement based on the active Python runtime version,
operating system, or optional third party dependencies.
Before replying to a child's request for a module with dependencies:
* If the request is for a package, any dependent modules used by the package
that appear within the package itself are known to be missing from the child,
since the child requested the top-level package module, therefore they are
pre-loaded into the child using :py:data:`LOAD_MODULE` messages before
sending the :py:data:`LOAD_MODULE` message for the requested package module
itself. In this way, the child will already have dependent modules cached by
the time it receives the requested module, avoiding one round-trip for each
dependency.
For example, when a child requests the :py:mod:`django` package, and the master
determines the :py:mod:`django` module code in the master has :keyword:`import`
statements for :py:mod:`django.utils`, :py:mod:`django.utils.lru_cache`, and
:py:mod:`django.utils.version`,
and that exceution of the module code on the master caused those modules to
appear in the master's :py:data:`sys.modules`, there is high probability
execution of the :py:mod:`django` module code in the child will cause the
same modules to be loaded. Since all those modules exist within the
:py:mod:`django` package, and we already know the child lacks that package,
it is safe to assume the child will make follow-up requests for those modules
too.
In the example, this replaces 4 round-trips with 1 round-trip.
For any package module ever requested by a child, the parent keeps a note of
the name of the package for one final optimization:
* If the request is for a sub-module of a package, and it is known the child
loaded the package's implementation from the parent, then any dependent
modules of the requested module at any nesting level within the package that
is known to be missing are sent using :py:data:`LOAD_MODULE` messages before
sending the :py:data:`LOAD_MODULE` message for the requested module, avoiding
1 round-trip for each dependency within the same top-level package.
For example, when a child has previously requested the :py:mod:`django`
package module, the parent knows the package was completely absent on the
child. Therefore when the child subsequently requests the
:py:mod:`django.db` package module, it is safe to assume the child will
generate subsequent :py:data:`GET_MODULE` requests for the 2
:py:mod:`django.conf`, 3 :py:mod:`django.core`, 2 :py:mod:`django.db`, 3
:py:mod:`django.dispatch`, and 7 :py:mod:`django.utils` indirect dependencies
for :py:mod:`django.db`.
In the example, this replaces 17 round-trips with 1 round-trip.
The method used to detect import statements is similar to the standard library
:py:mod:`modulefinder` module: rather than analyze module source code,
:ref:`IMPORT_NAME <python:bytecodes>` opcodes are extracted from the module's
bytecode. This is since clean source analysis methods (:py:mod:`ast` and
:py:mod:`compiler`) are an order of magnitude slower, and incompatible across
major Python versions.
Child Module Enumeration
########################

View File

@ -142,6 +142,26 @@ further effort.
.. _py2exe: http://www.py2exe.org/
Common sources of import latency and bandwidth consumption are mitigated:
* Modules need only be uploaded once per directly connected context. Subsequent
requests for modules from children of that context will be served by the
child itself.
* Imports by threads within a context triggering a load are deduplicated and
joined with any identical requests triggered by other threads in the same
context and children in the context's subtree.
* No roundtrip is required for negative responses due to Python 2's import
statement semantics: children have a list of submodules belonging to a
package, and ignore requests for submodules that did not exist on the master.
* Imports are extracted from each module, compared to those found in memory,
and recursively preloaded into children requesting that module, minimizing
round-trips to one per package nesting level. For example,
:py:mod:`django.db.models` only requires 3 round-trips to transfer 456KiB,
representing 1.7MiB of uncompressed source split across 148 modules.
SSH Client Emulation
####################

View File

@ -162,6 +162,13 @@ Other Stream Subclasses
:members:
Importer Class
--------------
.. currentmodule:: mitogen.core
.. autoclass:: Importer
:members:
ExternalContext Class
---------------------
@ -201,6 +208,9 @@ ExternalContext Class
The :py:class:`IoLogger` connected to ``stderr``.
.. method:: _dispatch_calls
Implementation for the main thread in every child context.
mitogen.master
==============

View File

@ -103,7 +103,6 @@ def _unpickle_dead():
return _DEAD
#: Sentinel value used to represent :py:class:`Channel` disconnection.
_DEAD = Dead()