proxy.py/examples/web_scraper.py

65 lines
2.1 KiB
Python
Raw Normal View History

# -*- coding: utf-8 -*-
"""
proxy.py
~~~~~~~~
Fast, Lightweight, Pluggable, TLS interception capable proxy server focused on
Network monitoring, controls & Application development, testing, debugging.
:copyright: (c) 2013-present by Abhinav Singh and contributors.
:license: BSD, see LICENSE for more details.
"""
import time
from proxy import Proxy
from proxy.core.work import Work
from proxy.common.types import Readables, Writables, SelectableEvents
from proxy.core.connection import TcpClientConnection
class WebScraper(Work[TcpClientConnection]):
"""Demonstrates how to orchestrate a generic work acceptors and executors
workflow using proxy.py core.
By default, `WebScraper` expects to receive work from a file on disk.
Each line in the file must be a URL to scrape. Received URL is scrapped
by the implementation in this class.
After scrapping, results are published to the eventing core. One or several
result subscriber can then handle the result as necessary. Currently, result
subscribers consume the scrapped response and write discovered URL in the
file on the disk. This creates a feedback loop. Allowing WebScraper to
continue endlessly.
NOTE: No loop detection is performed currently.
NOTE: File descriptor need not point to a file on disk.
Example, file descriptor can be a database connection.
For simplicity, imagine a Redis server connection handling
only PUBSUB protocol.
"""
async def get_events(self) -> SelectableEvents:
"""Return sockets and events (read or write) that we are interested in."""
return {}
Async `get_events`, `handle_event`, `handle_readables`, `handle_writables` (#769) * Asynchronous `handle_event` and `LocalExecutor` thread * Bail out on first task completion * mypy * Add `helper/benchmark.sh` and fix threaded which must now use asyncio (reduced performance of threaded) * Print open file diff from `benchmark.sh` * Add `--local-executor` flag, disabled by default for now until tests are updated * Async `handle_readables` and `handle_writables` for `HttpProtocolHandlerPlugin` interface (doesnt impact proxy/web plugins for now) * Async `get_events` * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address tests after async changes * mypy and flake8 * spelldoc * `check.py` and trailing comma * Rename to `_assertions.py` * Add missing `pytest-mock` and `pytest-asyncio` deps * Add `pytest-mock` to `pylint` deps * Correct use of `parameterize` and add `PT007` to flake8 ignores * Fix mypy hints broken for `< Python3.9` * Remove usage of `asynccontextmanager` which is not available for all Python versions that `proxy.py` supports * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for pre-python-3.9 versions * `AsyncTask` apis `set_name` and `get_name` are not available on all supported versions * Install setuptools via `lib-dep` until we recommend editable install * Deprecate support for `Python 3.6` * Use recommendation suggested here https://github.com/abhinavsingh/proxy.py/pull/769\#discussion_r753840929 * Address recommendation here https://github.com/abhinavsingh/proxy.py/pull/769\#discussion_r753841906 * Make `Threadless` agnostic of `multiprocessing.Process` * Acceptors must dispatch to local executor in non-blocking fashion * No daemon for executor processes and fix shutdown logic * Only return fds from `_selected_events` not all events data * Refactor logic * Prefix private methods with `_` * `work_queue` and not `client_queue` * Turn `Threadless` into an abstract executor. Introduce `RemoteExecutor` * Make `LocalExecutor` agnostic of `threading.Thread` * `LocalExecutor` now implements `Threadless` * `get_events` and `get_descriptors` now must return int and not sock. `Threadless` now avoids repeated register/unregister and instead make use of `selectors.modify` * Fix `main` tests * Apply suggestions from code review Co-authored-by: Sviatoslav Sydorenko <wk@sydorenko.org.ua> * Apply code review recommendations manually * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert back `Any` and use `addr or None` * Address `flake8` * Update tests to use `fileno` * Fix doc build * Fix doc spell, use tear down and not teardown * Doc updates * Add back support for `Python 3.6` * Acceptors dont need loop initialization * On Python 3.6 `asyncio.new_event_loop()` is necessary * Make doc happy * `--threaded` needs a new event loop for 3.7 too * Always use `asyncio.new_event_loop()` for threaded mode Added e2e integration tests (subprocess & curl) for all modes. * Lint fixes Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sviatoslav Sydorenko <wk@sydorenko.org.ua>
2021-11-23 09:32:00 +00:00
async def handle_events(
self,
readables: Readables,
writables: Writables,
) -> bool:
"""Handle readable and writable sockets.
Return True to shutdown work."""
return False
if __name__ == '__main__':
with Proxy(
work_klass=WebScraper,
threadless=True,
num_workers=1,
port=12345,
) as pool:
while True:
time.sleep(1)