Commit Graph

551 Commits

Author SHA1 Message Date
David Wilson 3579b6806b issue #152: reproduction for second problem 2018-03-19 21:58:35 +05:45
David Wilson c183f06dfb issue #152: respect the Ansible-selected interpreter for local connections too. 2018-03-19 21:58:35 +05:45
David Wilson 89b0faae2f Workaround for global state in yum_repository module; closes #154. 2018-03-19 21:58:35 +05:45
David Wilson 305e024819 issue #154: import user's reproduction 2018-03-19 21:58:35 +05:45
David Wilson 071d9fbfb3 docs: tidy ansible docs. 2018-03-19 21:58:35 +05:45
David Wilson 4d8ccab2ca ansible: docstring fixes 2018-03-19 21:58:35 +05:45
David Wilson 2132c311b2 tests: mark some tests as skipped 2018-03-19 21:58:35 +05:45
David Wilson f241eac5ce parent: allow Python to determine its install prefix from argv[0]
Fixes support for virtualenv. Closes #152.
2018-03-19 21:58:35 +05:45
David Wilson 088fd76109 issue #152: import reproduction 2018-03-19 21:58:35 +05:45
David Wilson dec3af375a issue #144: ansible: increase default pool size to 16. 2018-03-19 21:58:35 +05:45
David Wilson 9cf889b846 issue #144: master: public/private Pool attributes. 2018-03-19 21:58:35 +05:45
David Wilson 19632473dc issue #144: ansible: use service.Pool with default size=1. 2018-03-19 21:58:35 +05:45
David Wilson fe900087a2 issue #144: service: working service.Pool object.
It knows how to dispatch messages from multiple receivers (associated
with multiple services) to multiple threads, where the service
implementation is invoked on the message.

It wakes a maximum of one thread per received message.

It knows how to shut down gracefully.

Implication: due to the latch use, there are 2 file descriptors burned
for every thread. We don't need interruptibility here, so in future, it
might be nice to allow swapping a diferent queueing primitive into
Select (maybe a subclass?) just for this case.
2018-03-19 21:58:35 +05:45
David Wilson 4f361be7e7 issue #144: teach Select() to close its latch
Causes all threads sleeping on the select to wake.
2018-03-19 21:58:35 +05:45
David Wilson 8aada2646c core: support throwing LatchError in every sleeping thread
This is to allow Select() to be used as a generic queueing primitive
that supports graceful shutdown.
2018-03-19 21:58:35 +05:45
David Wilson ebfe733914 core: tidy up Stream.on_receive() branches. 2018-03-19 21:58:35 +05:45
David Wilson 7a74bb0a39 docs: update ansible risks/differences. 2018-03-19 21:58:35 +05:45
David Wilson 4541bc76a0 Add Google Cloud client to dev requirements
Will be used more heavily for CI later, but it's already in use by
gcloud-ansible-playbook.py.
2018-03-19 21:58:35 +05:45
David Wilson bcc15987fc docs: extra ansible paragraph. 2018-03-19 21:58:35 +05:45
David Wilson 858b01e78b issue #150: add docstrings. 2018-03-19 21:58:35 +05:45
David Wilson 6940b23013 issue #150: ansible: mark worker/child sock as CLOEXEC. 2018-03-19 21:58:35 +05:45
David Wilson 7a394dc73e ansible: allow establishment of duplicate SSH connections 2018-03-19 21:58:35 +05:45
David Wilson 86ede62241 issue #150: introduce separate connection multiplexer process
This is a work in progress.
2018-03-19 21:58:35 +05:45
David Wilson eee5423dd9 issue #150: tidy up mitogen.debug output for use next time 2018-03-19 21:58:35 +05:45
David Wilson 9adadb5c3a issue #150: import stack.py hack as mitogen.debug
Usage:
  - insert a call to mitogen.debug() in the desired process
  - kill -USR2 that process
  - observe its controlling TTY produces thread stack dumps
2018-03-19 21:58:35 +05:45
David Wilson df488237d4 core: fix race in PidfulStreamHandler
Need to re-test with the lock held, else >1 threads can end up waiting
for lock then reopening the log repeatedly.
2018-03-19 21:58:35 +05:45
David Wilson a06c92d285 core: enable_debug_logging() should reopen file post-fork. 2018-03-19 21:58:35 +05:45
David Wilson 051fb85d2d issue #150: 100 target docker inventory 2018-03-19 21:58:34 +05:45
David Wilson 4691ce0b95 issue #150: ansible: add basic Docker support. 2018-03-19 21:58:34 +05:45
David Wilson b64e52b1fd issue #150: tweak script for running without external IPs 2018-03-19 21:58:34 +05:45
David Wilson 8607680730 issue #150: quick script to run ansible against gcloud instance group 2018-03-19 21:58:34 +05:45
David Wilson 67ff762ba5 issue #139: docs: remove note about bad buffering 2018-03-19 21:58:34 +05:45
David Wilson eba12e2ee2 issue #139: bump kernel socket buffer size to 128kb
This allows us to write 128kb at a time towards SSH, but it doesn't help
with sudo, where the ancient tty layer is always used.
2018-03-19 21:58:34 +05:45
David Wilson 728a0da8a4 issue #139: eliminate quadratic behaviour from transmit path
Implication: the entire message remains buffered until its last byte is
transmitted. Not wasting time on it, as there are pieces of work like
issue #6 that might invalidate these problems on the transmit path
entirely.
2018-03-19 21:58:34 +05:45
David Wilson a3b4b459fa issue #139: eliminate quadratic behaviour on input path
Rather than slowly build up a Python string over time, we just store a
deque of chunks (which, in a later commit, will now be around 128KB
each), and track the total buffer size in a separate integer.

The tricky loop is there to ensure the header does not need to be sliced
off the full message (which may be huge, causing yet another spike and
copy), but rather only off the much smaller first 128kb-sized chunk
received.

There is one more problem with this code: the ''.join() causes RAM usage
to temporarily double, but that was true of the old solution too. Shall
wait for bug reports before fixing this, as it gets very ugly very fast.
2018-03-19 21:58:34 +05:45
David Wilson ba9a06d0f5 issue #139: core: Side.write(): let the OS write as much as possible.
There is no penalty for just passing as much data to the OS as possible,
it is not copied, and for a non-blocking socket, the OS will just keep
buffer as much as it can and tell us how much that was.

Also avoids a rather pointless string slice.
2018-03-19 21:58:34 +05:45
David Wilson 49db4125d0 issue #139: core: bump CHUNK_SIZE from 16kb to 128Kb
Reduces the number of IO loop iterations required to receive large
messages at a small cost to RAM usage.

Note that when calling read() with a large buffer value like this,
Python must zero-allocate that much RAM. In other words, for even a
single byte received, 128kb of RAM might need to be written.
Consequently CHUNK_SIZE is quite a sensitive value and this might need
further tuning.
2018-03-19 21:58:34 +05:45
David Wilson 8e2b07a54e issue #139: add profiling=True option to mitogen.main(). 2018-03-19 21:58:34 +05:45
David Wilson 017e8105cf issue #131: disable non-blocking IO during UNIX accept()
accept() (per interface) returns a non-blocking socket because the
listener socket is in non-blocking mode, therefore it is pure scheduling
luck that a connecting-in child has a chance to write anything for the
top-level processs to read during the subsequent .recv().

A higher forks setting in ansible.cfg was enough to cause our luck to
run out, causing the .recv() to crashi with EGAIN, and the multiplexer
to respond to the handler's crash by calling its disconnect method. This
is why some reports mentioned ECONNREFUSED -- the listener really was
gone, because its Stream class had crashed.

Meanwhile since the window where we're waiting for the remote process to
identify itself is tiny, simply flip off O_NONBLOCK for the duration of
the connection handshake. Stream.accept() (via Side.__init__) will
reenable O_NONBLOCK for the descriptors it duplicates, so we don't even
need to bother turning this back off.

A better solution entails splitting Stream up into a state machine and
doing the handshake with non-blocking IO, but that isn't going to be
available until asynchronous connect is implemented. Meanwhile in
reality this solution is probably 100% fine.
2018-03-19 21:58:34 +05:45
David Wilson 44d36eccba issue #146: don't crash during on_broker_shutdown
There is some insane unidentifiable Mitogen context (the local context?)
that instantly crashes with a higher forks setting. It appears to be
harmless, but meanwhile this naturally shouldn't be happening.
2018-03-19 21:58:34 +05:45
David Wilson cb620500d1 issue #131: log stack and PPID with MITOGEN_ROUTER_DEBUG=1 2018-03-19 21:58:34 +05:45
David Wilson 0f5a31fb52 issue #131: test with forks=50 2018-03-19 21:58:34 +05:45
David Wilson cd455e8c58 ansible: minor tidy up 2018-03-19 21:58:34 +05:45
David Wilson d1888f1908 docs: reorder sections 2018-03-19 21:58:34 +05:45
David Wilson 3e40b9ab8e issue #131: import something clean that might tickle the problem 2018-03-19 21:58:34 +05:45
David Wilson 014247ce66 docs: another crazy Ansible success story 2018-03-19 21:58:34 +05:45
David Wilson 87435bf45d issue #140: nicer filetree construction 2018-03-19 21:58:34 +05:45
David Wilson 3584084be6 issue #140: explicit Broker management, and guard against crap plug-ins.
Implement Connection.__del__, which is almost certainly going to trigger
more bugs down the line, because the state of the Connection instance is
not guranteed during __del__. Meanwhile, it is temporarily needed for
deployed-today Ansibles that have a buggy synchronize action that does
not call Connection.close().

A better approach to this would be to virtualize the guts of Connection,
and move its management to one central place where we can guarantee
resource destruction happens reliably, but that may entail another
Ansible monkey-patch to give us such a reliable hook.
2018-03-19 21:58:34 +05:45
David Wilson 83c8412474 issue #140: permit mitogen.unix.connect() to accept preconstructed Broker.
Part of an effort to make resource management a little more explicit.
2018-03-19 21:58:34 +05:45
David Wilson 65df36895e issue #140: prevent duplicate watcher thread creation
When a Broker() is running with install_watcher=True, arrange for only
one watcher thread to exist for each target thread, and to reset the
mapping of watchers to targets after process fork.

This is probably the last change I want to make to the watcher feature
before deciding to rip it out, it may be more trouble than it is worth.
2018-03-19 21:58:34 +05:45