proxy.py/tutorial/http_parser.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HttpParser\n",
    "\n",
    "`HttpParser` class is at the heart of everything related to HTTP.  It is used by `Web server` and `Proxy server` core and their plugin eco-system.  As the name suggests, it is capable of parsing both HTTP request and response packets.  It can also parse HTTP look-a-like protocols like ICAP, SIP etc.  Most importantly, remember that `HttpParser` was originally written to handle HTTP packets arriving in the context of a proxy server and till date its default behavior favors the same flavor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HTTP Web Requests\n",
    "\n",
    "Let's parse a typical HTTP web request using `HttpParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n"
     ]
    }
   ],
   "source": [
    "from proxy.http.methods import httpMethods\n",
    "from proxy.http.parser import HttpParser, httpParserTypes, httpParserStates\n",
    "from proxy.common.constants import HTTP_1_1\n",
    "\n",
    "get_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
    "get_request.parse(memoryview(b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n",
    "\n",
    "# Rebuild the raw request\n",
    "print(get_request.build())\n",
    "\n",
    "assert get_request.is_complete\n",
    "assert get_request.method == httpMethods.GET\n",
    "assert get_request.version == HTTP_1_1\n",
    "assert get_request.host == None\n",
    "assert get_request.port == 80\n",
    "assert get_request._url != None\n",
    "assert get_request._url.remainder == b'/'\n",
    "assert get_request.has_header(b'host')\n",
    "assert get_request.header(b'host') == b'jaxl.com'\n",
    "assert len(get_request.headers) == 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> NOTE:\n",
    ">\n",
    "> `get_request.host` is `None`\n",
    "> \n",
    "> We use `get_request.header(b'host') == b'jaxl.com'` to get the expected host value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HTTP Proxy Requests\n",
    "\n",
    "Next, let's parse a HTTP proxy request using `HttpParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n",
      "b'GET http://jaxl.com:80/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n"
     ]
    }
   ],
   "source": [
    "proxy_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
    "proxy_request.parse(memoryview(b'GET http://jaxl.com/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n",
    "\n",
    "print(proxy_request.build())\n",
    "print(proxy_request.build(for_proxy=True))\n",
    "\n",
    "assert proxy_request.is_complete\n",
    "assert proxy_request.method == httpMethods.GET\n",
    "assert proxy_request.version == HTTP_1_1\n",
    "assert proxy_request.host == b'jaxl.com'\n",
    "assert proxy_request.port == 80\n",
    "assert proxy_request._url != None\n",
    "assert proxy_request._url.remainder == b'/'\n",
    "assert proxy_request.has_header(b'host')\n",
    "assert proxy_request.header(b'host') == b'jaxl.com'\n",
    "assert len(proxy_request.headers) == 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> NOTE:\n",
    ">\n",
    "> Notice how `proxy_request.build()` and `proxy_request.build(for_proxy=True)` behave\n",
    ">\n",
    "> Also, here `proxy_request.host` field is populated\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HTTPS Proxy Requests\n",
    "\n",
    "To conclude, let's parse a HTTPS proxy request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'CONNECT / HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n",
      "b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n"
     ]
    }
   ],
   "source": [
    "connect_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
    "connect_request.parse(memoryview(b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'))\n",
    "\n",
    "print(connect_request.build())\n",
    "print(connect_request.build(for_proxy=True))\n",
    "\n",
    "assert connect_request.is_complete\n",
    "assert connect_request.is_https_tunnel\n",
    "assert connect_request.version == HTTP_1_1\n",
    "assert connect_request.host == b'jaxl.com'\n",
    "assert connect_request.port == 443\n",
    "assert connect_request._url != None\n",
    "assert connect_request._url.remainder == None\n",
    "assert connect_request.has_header(b'host')\n",
    "assert connect_request.header(b'host') == b'jaxl.com:443'\n",
    "assert len(connect_request.headers) == 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Take Away\n",
    "\n",
    "- `host` and `port` attributes represent the `host:port` present in the HTTP packet request line.  For comparison purposes, below are all the three request lines again.  Notice how there is no `host:port` available only for the web server get request\n",
    "  | Request Type      | Request Line     |\n",
    "  | ------------------| ---------------- |\n",
    "  | `get_request`     | `GET / HTTP/1.1` |\n",
    "  | `proxy_request`   | `GET http://jaxl.com HTTP/1.1` |\n",
    "  | `connect_request` | `CONNECT jaxl.com:443 HTTP/1.1` |\n",
    "- `_url` attribute is an instance of `Url` parser and contains parsed information about the URL found in the request line\n",
    "\n",
    "Few of the other handy properties within `HttpParser` are:\n",
    "\n",
    "- `is_complete`\n",
    "- `is_http_1_1_keep_alive`\n",
    "- `is_connection_upgrade`\n",
    "- `is_https_tunnel`\n",
    "- `is_chunked_encoded`\n",
    "- `content_expected`\n",
    "- `body_expected`"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "da9d6927d62b2b95bde149eedfbd5367cb7f465aad65a736f49c99ee3db39df7"
  },
  "kernelspec": {
   "display_name": "Python 3.10.0 64-bit ('venv310': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}