hydrus/help/downloader_parsers_formulae...

101 lines
11 KiB
HTML

<html>
<head>
<title>downloader - parsers - formulae</title>
<link href="hydrus.ico" rel="shortcut icon" />
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div class="content">
<p><a href="downloader_parsers.html"><---- Back to main parsers page</a></p>
<h3 id="formulae"><a href="#formulae">formulae</a></h3>
<p>Formulae are tools used by higher-level components of the parsing system. They take some data (typically some HTML or JSON) and return 0 to n strings. For our purposes, these strings will usually be tags, URLs, and timestamps. You will usually see them summarised with this panel:</p>
<p><img src="edit_formula_panel.png" /></p>
<p>The different types are currently <a href="#html_formula">html</a>, <a href="#json_formula">json</a>, <a href="#compound_formula">compound</a>, and <a href="#context_variable_formula">context variable</a>.</p>
<h3 id="html_formula"><a href="#html_formula">html</a></h3>
<p>This takes a full HTML document or a sample of HTML--and any regular sort of XML <i>should</i> also work. It starts at the root node and searches for lower nodes using one or more ordered rules based on tag name and attributes, and then returns string data from those final nodes.</p>
<p>For instance, if you have this:</p>
<p><pre>&lt;html&gt;
&lt;body&gt;
&lt;div class="media_taglist"&gt;
&lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;blonde hair&lt;/a&gt; (3456)&lt;/span&gt;
&lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;blue eyes&lt;/a&gt; (4567)&lt;/span&gt;
&lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;bodysuit&lt;/a&gt; (5678)&lt;/span&gt;
&lt;span class="charactertag"&gt;&lt;a href="(search page)"&gt;samus aran&lt;/a&gt; (2345)&lt;/span&gt;
&lt;span class="artisttag"&gt;&lt;a href="(search page)"&gt;splashbrush&lt;/a&gt; (123)&lt;/span&gt;
&lt;/div&gt;
&lt;div class="content"&gt;
&lt;span class="media"&gt;(a whole bunch of content that doesn't have tags in)&lt;/span&gt;
&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;</pre></p>
<p><i>(Most boorus have a taglist like this on their file pages.)</i></p>
<p>To find the artist, "splashbrush", here, you could:</p>
<ul>
<li>search beneath the root tag (&lt;html&gt;) for the &lt;div&gt; tag with attribute class="media_taglist"</li>
<li>search beneath that &lt;div&gt; for &lt;span&gt; tags with attribute class="artisttag"</li>
<li>search beneath those &lt;span&gt; tags for &lt;a&gt; tags</li>
<li>and then get the string content of those &lt;a&gt; tags</li>
</ul>
<p>Changing the "artisttag" to "charactertag" or "generaltag" would give you "samus aran" or "blonde hair","blue eyes","bodysuit" respectively.</p>
<p>You might be tempted to just go straight for any &lt;span&gt; with class="artisttag", but many sites use the same class to render a sidebar of favourite/popular tags or some other sponsored content, so it is generally best to try to narrow down to a larger &lt;div&gt; container so you don't get anything you don't mean.</p>
<p><b>the ui</b></p>
<p>Clicking 'edit formula' on an HTML formula gives you this:</p>
<p><img src="edit_html_formula_panel.png" /></p>
<p>You edit on the left and test on the right.</p>
<p><b>finding the right html tags</b></p>
<p>When you add or edit one of the specific tag search rules, you get this:</p>
<p><img src="edit_html_tag_rule_panel.png" /></p>
<p>You can set multiple key/value attribute search conditions, but you'll typically be searching for 'class' or 'id' here, if anything.</p>
<p>Note that you can set it to fetch only the xth instance of a found tag, which can be useful in situations like this:</p>
<p><pre>&lt;span class="generaltag"&gt;
&lt;a href="(add tag)"&gt;+&lt;/a&gt;
&lt;a href="(remove tag)"&gt;-&lt;/a&gt;
&lt;a href="(search page)"&gt;blonde hair&lt;/a&gt; (3456)
&lt;/span&gt;</pre></p>
<p>Without any more attributes, there isn't a great way to distinguish the &lt;a&gt; with "blonde hair" from the other two--so just set 'get the 3rd &lt;a&gt; tag' and you are good.</p>
<p>Most of the time, you'll be searching descendants (i.e. walking down the tree), but sometimes you might have this:</p>
<p><pre>&lt;span&gt;
&lt;a href="(link to post url)"&gt;
&lt;img class="thumb" src="(thumbnail image)" /&gt;
&lt;/a&gt;
&lt;/span&gt;</pre></p>
<p>There isn't a great way to find the &lt;span&gt; or the &lt;a&gt; when looking from above here, as they are lacking a class or id, but you can find the &lt;img&gt; ok, so if you find those and then add a rule where instead of searching descendants, you are 'walking back up ancestors' like this:</p>
<p><img src="edit_html_formula_panel_descendants_ancestors.png" /></p>
<p>You can solve some tricky problems this way!</p>
<p>You can also set a String Match, which is the same panel as you say in with URL Classes. It tests its best guess at the tag's 'string' value, so you can find a tag with 'Original Image' as its text or that with a regex starts with 'Posted on: '. Have a play with it and you'll figure it out.</p>
<p><b>content to fetch</b></p>
<p>Once you have narrowed down the right nodes you want, you can decide what text to fetch. Given a node of:</p>
<p><pre>&lt;a href="(URL A)" class="thumb_title"&gt;Forest Glade&lt;/a&gt;</pre></p>
<p>Returning the 'href' attribute would return the string "(URL A)", returning the string content would give "Forest Glade", and returning the full html would give "&lt;a href="(URL A)" class="thumb"&gt;Forest Glade&lt;/a&gt;". This last choice is useful in complicated situations where you want a second, separated layer of parsing, which we will get to later.</p>
<p><b>string match and conversion</b></p>
<p>You can set a final String Match to filter the parsed results (e.g. "only allow strings that only contain numbers" or "only allow full URLs as based on (complicated regex)") and String Converter to edit it (e.g. "remove the first three characters of whatever you find" or "decode from base64").</p>
<p>You won't use these much, but they can sometimes get you out of a complicated situation.</p>
<p><b>testing</b></p>
<p>The testing panel on the right is important and worth using. Copy the html from the source you want to parse and then hit the paste buttons to set that as the data to test with.</p>
<h3 id="json_formula"><a href="#json_formula">json</a></h3>
<p>This takes some JSON and does a similar style of search:</p>
<p><img src="edit_json_formula_panel.png" /></p>
<p>It is a bit simpler than HTML--if the current node is a list (called an 'Array' in JSON), you can fetch every item or the xth item, and if it is a dictionary (called an 'Object' in JSON), you can fetch a particular entry by name. Since you can't jump down several layers with attribute lookups or tag names like with HTML, you have to go down every layer one at a time. In any case, if you have something like this:</p>
<p><a href="json_thread_example.png"><img src="json_thread_example.png" width="50%" height="50%"/></a></p>
<p><i>Note: It is a great idea to check the html or json you are trying to parse with your browser. Some web browsers have excellent developer tools that let you walk through the nodes of the document you are trying to parse in a prettier way than I would ever have time to put together. This image is one of the views Firefox provides if you simply enter a JSON URL.</i></p>
<p>Searching for "posts"->1st list item->"sub" on this data will give you "Nobody like kino here.".</p>
<p>Searching for "posts"->all list items->"tim" will give you the three SHA256 file hashes (since the third post has no file attached and so no 'tim' entry, the parser skips over it without complaint).</p>
<p>Searching for "posts"->1st list item->"com" will give you the OP's comment, <span class="dealwithit">~AS RAW UNPARSED HTML~</span>.</p>
<p>The default is to fetch the final nodes' 'data content', which means coercing simple variables into strings. If the current node is a list or dict, no string is returned.</p>
<p>But if you like, you can return the json beneath the current node (which, like HTML, includes the current node). This again will come in useful later.</p>
<h3 id="compound_formula"><a href="#compound_formula">compound</a></h3>
<p>If you want to create a string from multiple parsed strings--for instance by appending the 'tim' and the 'ext' in our json example together--you can use a Compound formula. This fetches multiple lists of strings and tries to place them into a single string using '\1' regex substitution syntax:</p>
<p><img src="edit_compound_formula_panel.png" /></p>
<p>This is a complicated example taken from one of my thread parsers. I have to take a modified version of the original thread URL (the first rule, so \1) and then append the filename (\2) and its extension (\3) on the end to get the final file URL of a post. You can mix in more characters in the substitution phrase, like "\1.jpg" or even have multiple instances ("https://\2.muhsite.com/\2/\1"), if that is appropriate.</p>
<p>This is where the magic happens, sometimes, so keep it in mind if you need to do something cleverer than the data you have seems to provide.</p>
<h3 id="context_variable_formula"><a href="#context_variable_formula">context variable</a></h3>
<p>This is a basic hacky answer to a particular problem. It is a simple key:value dictionary that at the moment only stores one variable, 'url', which contains the original URL used to fetch the data being parsed.</p>
<p class="warning">If a different URL Class links to this parser via an API URL, this 'url' variable will always be the API URL (i.e. it literally is the URL used to fetch the data), not any thread/whatever URL the user entered.</p>
<p><img src="edit_context_variable_formula_panel.png" /></p>
<p>Hit the 'edit example parsing context' to change the URL used for testing.</p>
<p>I have used this several times to stitch together file URLs when I am pulling data from APIs, like in the compound formula example above. In this case, the starting URL is "https://a.4cdn.org/tg/thread/57806016.json", from which I extract the board name, "tg", using the string converter, and then add in 4chan's CDN domain to make the appropriate base file URL (https:/i.4cdn.org/tg/) for the given thread. I only have to jump through this hoop in 4chan's case because they explicitly store file URLs by board name. 8chan on the other hand, for instance, has a static "https://media.8ch.net/file_store/" for all files, so it is a little easier (I think I just do a single 'prepend' string transformation somewhere).</p>
<p>If you want to make some parsers, you will have to get familiar with how different sites store and present their data!</p>
</div>
</body>
</html>