Add downloader docs
This commit is contained in:
parent
ecfb361892
commit
8007351af9
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
title: Putting it All Together
|
||||
---
|
||||
|
||||
Now you know what GUGs, URL Classes, and Parsers are, you should have some ideas of how URL Classes could steer what happens when the downloader is faced with an URL to process. Should a URL be imported as a media file, or should it be parsed? If so, how?
|
||||
|
||||
You may have noticed in the Edit GUG ui that it lists if a current URL Class matches the example URL output. If the GUG has no matching URL Class, it won't be listed in the main 'gallery selector' button's list--it'll be relegated to the 'non-functioning' page. Without a URL Class, the client doesn't know what to do with the output of that GUG. But if a URL Class does match, we can then hand the result over to a parser set at _network->downloader definitions->manage url class links_:
|
||||
|
||||
![](images/downloader_completion_url_links.png)
|
||||
|
||||
Here you simply set which parsers go with which URL Classes. If you have URL Classes that do not have a parser linked (which is the default for new URL Classes), you can use the 'try to fill in gaps...' button to automatically fill the gaps based on guesses using the parsers' example URLs. This is usually the best way to line things up unless you have multiple potential parsers for that URL Class, in which case it'll usually go by the parser name earliest in the alphabet.
|
||||
|
||||
If the URL Class has no parser set or the parser is broken or otherwise invalid, the respective URL's file import object in the downloader or subscription is going to throw some kind of error when it runs. If you make and share some parsers, the first indication that something is wrong is going to be several users saying 'I got this error: (_copy notes_ from file import status window)'. You can then load the parser back up in _manage parsers_ and try to figure out what changed and roll out an update.
|
||||
|
||||
_manage url class links_ also shows 'api link review', which summarises which URL Classes api-link to others. In these cases, only the api URL gets a parser entry in the first 'parser links' window, since the first will never be fetched for parsing (in the downloader, it will always be converted to the API URL, and _that_ is fetched and parsed).
|
||||
|
||||
Once your GUG has a URL Class and your URL Classes have parsers linked, test your downloader! Note that Hydrus's URL drag-and-drop import uses URL Classes, so if you don't have the GUG and gallery stuff done but you have a Post URL set up, you can test that just by dragging a Post URL from your browser to the client, and it should be added to a new URL Downloader and just work. It feels pretty good once it does!
|
|
@ -0,0 +1,45 @@
|
|||
---
|
||||
title: Gallery URL Generators
|
||||
---
|
||||
|
||||
Gallery URL Generators, or **GUGs** are simple objects that take a simple string from the user, like:
|
||||
|
||||
* blue_eyes
|
||||
* blue\_eyes blonde\_hair
|
||||
* InCase
|
||||
* elsa dandon_fuga
|
||||
* wlop
|
||||
* goth* order:id_asc
|
||||
|
||||
And convert them into an initialising Gallery URL, such as:
|
||||
|
||||
* [http://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](http://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0)
|
||||
* [https://konachan.com/post?page=1&tags=blonde\_hair+blue\_eyes](https://konachan.com/post?page=1&tags=blonde_hair+blue_eyes)
|
||||
* [https://www.hentai-foundry.com/pictures/user/InCase/page/1](https://www.hentai-foundry.com/pictures/user/InCase/page/1)
|
||||
* [http://rule34.paheal.net/post/list/elsa dandon_fuga/1](http://rule34.paheal.net/post/list/elsa dandon_fuga/1)
|
||||
* [https://www.deviantart.com/wlop/favourites/?offset=0](https://www.deviantart.com/wlop/favourites/?offset=0)
|
||||
* [https://danbooru.donmai.us/posts?page=1&tags=goth*+order:id_asc](https://danbooru.donmai.us/posts?page=1&tags=goth*+order:id_asc)
|
||||
|
||||
These are all the 'first page' of the results if you type or click-through to the same location on those sites. We are essentially emulating their own simple search-url generation inside the hydrus client.
|
||||
|
||||
## actually doing it { id="doing_it" }
|
||||
|
||||
Although it is usually a fairly simple process of just substituting the inputted tags into a string template, there are a couple of extra things to think about. Let's look at the ui under _network->downloader definitions->manage gugs_:
|
||||
|
||||
![](images/downloader_edit_gug_panel.png)
|
||||
|
||||
The client will split whatever the user enters by whitespace, so `blue_eyes blonde_hair` becomes two search terms, `[ 'blue_eyes', 'blonde_hair' ]`, which are then joined back together with the given 'search terms separator', to make `blue_eyes+blonde_hair`. Different sites use different separators, although ' ', '+', and ',' are most common. The new string is substituted into the `%tags%` in the template phrase, and the URL is made.
|
||||
|
||||
Note that you will not have to make %20 or %3A percent-encodings for reserved characters here--the network engine handles all that before the request is sent. For the most part, if you need to include or a user puts in ':' or ' ' or 'おっぱい', you can just pass it along straight into the final URL without worrying.
|
||||
|
||||
This ui should update as you change it, so have a play and look at how the output example url changes to get a feel for things. Look at the other defaults to see different examples. Even if you break something, you can just cancel out.
|
||||
|
||||
The name of the GUG is important, as this is what will be listed when the user chooses what 'downloader' they want to use. Make sure it has a clear unambiguous name.
|
||||
|
||||
The initial search text is also important. Most downloaders just take some text tags, but if your GUG expects a numerical artist id (like pixiv artist search does), you should specify that explicitly to the user. You can even put in a brief '(two tag maximum)' type of instruction if you like.
|
||||
|
||||
Notice that the Deviart Art example above is actually the stream of wlop's _favourites_, not his works, and without an explicit notice of that, a user could easily mistake what they have selected. 'gelbooru' or 'newgrounds' are bad names, 'type here' is a bad initialising text.
|
||||
|
||||
## Nested GUGs { id="nested_gugs" }
|
||||
|
||||
Nested Gallery URL Generators are GUGs that hold other GUGs. Some searches actually use more than one stream (such as a Hentai Foundry artist lookup, where you might want to get both their regular works and their scraps, which are two separate galleries under the site), so NGUGs allow you to generate multiple initialising URLs per input. You can experiment with this ui if you like--it isn't too complicated--but you might want to hold off doing anything for real until you are comfortable with everything and know how producing multiple initialising URLs is going to work in the actual downloader.
|
|
@ -0,0 +1,50 @@
|
|||
---
|
||||
title: Making a Downloader
|
||||
---
|
||||
|
||||
# Making a Downloader
|
||||
|
||||
!!! caution
|
||||
Creating custom downloaders is only for advanced users who understand HTML or JSON. Beware! If you are simply looking for how to add new downloaders, please head over [here](adding_new_downloaders.html).
|
||||
|
||||
## this system { id="intro" }
|
||||
|
||||
The first versions of hydrus's downloaders were all hardcoded and static--I wrote everything into the program itself and nothing was user-creatable or -fixable. After the maintenance burden of the entire messy system proved too large for me to keep up with and a semi-editable booru system proved successful, I decided to overhaul the entire thing to allow user creation and sharing of every component. It is designed to be very simple to the front-end user--they will typically handle a couple of png files and then select a new downloader from a list--but very flexible (and hence potentially complicated) on the back-end. These help pages describe the different compontents with the intention of making an HTML- or JSON- fluent user able to create and share a full new downloader on their own.
|
||||
|
||||
As always, this is all under active development. Your feedback on the system would be appreciated, and if something is confusing or you discover something in here that is out of date, please [let me know](contact.html).
|
||||
|
||||
## what is a downloader? { id="downloader" }
|
||||
|
||||
In hydrus, a downloader is one of:
|
||||
|
||||
**Gallery Downloader**
|
||||
: This takes a string like 'blue_eyes' to produce a series of thumbnail gallery page URLs that can be parsed for image page URLs which can ultimately be parsed for file URLs and metadata like tags. Boorus fall into this category.
|
||||
|
||||
**URL Downloader**
|
||||
: This does just the Gallery Downloader's back-end--instead of taking a string query, it takes the gallery or post URLs directly from the user, whether that is one from a drag-and-drop event or hundreds pasted from clipboard. For our purposes here, the URL Downloader is a subset of the Gallery Downloader.
|
||||
|
||||
**Watcher**
|
||||
: This takes a URL that it will check in timed intervals, parsing it for new URLs that it then queues up to be downloaded. It typically stops checking after the 'file velocity' (such as '1 new file per day') drops below a certain level. It is mostly for watching imageboard threads.
|
||||
|
||||
**Simple Downloader**
|
||||
: This takes a URL one-time and parses it for direct file URLs. This is a miscellaneous system for certain simple gallery types and some testing/'I just need the third <img> tag's _src_ on this one page' jobs.
|
||||
|
||||
The system currently supports HTML and JSON parsing. XML should be fine under the HTML parser--it isn't strict about checking types and all that.
|
||||
|
||||
## what does a downloader do? { id="pipeline" }
|
||||
|
||||
The Gallery Downloader is the most complicated downloader and uses all the possible components. In order for hydrus to convert our example 'blue_eyes' query into a bunch of files with tags, it needs to:
|
||||
|
||||
* Present some user interface named 'safebooru tag search' to the user that will convert their input of 'blue_eyes' into [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0).
|
||||
* Recognise [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0) as a Safebooru Gallery URL.
|
||||
* Convert the HTML of a Safebooru Gallery URL into a list URLs like [https://safebooru.org/index.php?page=post&s=view&id=2437965](https://safebooru.org/index.php?page=post&s=view&id=2437965) and possibly a 'next page' URL (e.g. [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=40](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=40)) that points to the next page of thumbnails.
|
||||
* Recognise the [https://safebooru.org/index.php?page=post&s=view&id=2437965](https://safebooru.org/index.php?page=post&s=view&id=2437965) URLs as Safebooru Post URLs.
|
||||
* Convert the HTML of a Safebooru Post URL into a file URL like [https://safebooru.org//images/2329/b6e8c263d691d1c39a2eeba5e00709849d8f864d.jpg](https://safebooru.org//images/2329/b6e8c263d691d1c39a2eeba5e00709849d8f864d.jpg) and some tags like: 1girl, bangs, black gloves, blonde hair, blue eyes, braid, closed mouth, day, fingerless gloves, fingernails, gloves, grass, hair ornament, hairclip, hands clasped, creator:hankuri, interlocked fingers, long hair, long sleeves, outdoors, own hands together, parted bangs, pointy ears, character:princess zelda, smile, solo, series:the legend of zelda, underbust.
|
||||
|
||||
So we have three components:
|
||||
|
||||
* [**Gallery URL Generator (GUG):**](downloader_gugs.html) faces the user and converts text input into initialising Gallery URLs.
|
||||
* [**URL Class:**](downloader_url_classes.html) identifies URLs and informs the client how to deal with them.
|
||||
* [**Parser:**](downloader_parsers.html) converts data from URLs into hydrus-understandable metadata.
|
||||
|
||||
URL downloaders and watchers do not need the Gallery URL Generator, as their input _is_ an URL. And simple downloaders also have an explicit 'just download it and parse it with this simple rule' action, so they do not use URL Classes (or even full-fledged Page Parsers) either.
|
|
@ -0,0 +1,3 @@
|
|||
# Login Manager
|
||||
|
||||
The system works, but this help was never done! Check the defaults for examples of how it works, sorry!
|
|
@ -0,0 +1,30 @@
|
|||
# Parsers
|
||||
|
||||
In hydrus, a parser is an object that takes a single block of HTML or JSON data and returns many kinds of hydrus-level metadata.
|
||||
|
||||
Parsers are flexible and potentially quite complicated. You might like to open _network->manage parsers_ and explore the UI as you read these pages. Check out how the default parsers already in the client work, and if you want to write a new one, see if there is something already in there that is similar--it is usually easier to duplicate an existing parser and then alter it than to create a new one from scratch every time.
|
||||
|
||||
There are three main components in the parsing system (click to open each component's help page):
|
||||
|
||||
* [**Formulae:**](downloader_parsers_formulae.md) Take parsable data, search it in some manner, and return 0 to n strings.
|
||||
* [**Content Parsers:**](downloader_parsers_content_parsers.md) Take parsable data, apply a formula to it to get some strings, and apply a single metadata 'type' and perhaps some additional modifiers.
|
||||
* [**Page Parsers:**](downloader_parsers_page_parsers.md) Take parsable data, apply content parsers to it, and return all the metadata in an appropriate structure.
|
||||
|
||||
Once you are comfortable with these objects, you might like to check out these walkthroughs, which create full parsers from nothing:
|
||||
|
||||
* [e621 HTML gallery page](downloader_parsers_full_example_gallery_page.md)
|
||||
* [Gelbooru HTML file page](downloader_parsers_full_example_file_page.md)
|
||||
* [Artstation JSON file page API](downloader_parsers_full_example_api.md)
|
||||
|
||||
Once you are comfortable with parsers, and if you are feeling brave, check out how the default imageboard and pixiv parsers work. These are complicated and use more experimental areas of the code to get their job done. If you are trying to get a new imageboard parser going and can't figure out subsidiary page parsers, send me a mail or something and I'll try to help you out!
|
||||
|
||||
When you are making a parser, consider this checklist (you might want to copy/have your own version of this somewhere):
|
||||
|
||||
* Do you get good URLs with good priority? Do you ever accidentally get favourite/popular/advert results you didn't mean to?
|
||||
* If you need a next gallery page URL, is it ever not available (and hence needs a URL Class fix)? Does it change for search tags with unicode or http-restricted characters?
|
||||
* Do you get nice namespaced tags? Are any unwanted single characters like -/+/? getting through?
|
||||
* Is the file hash available anywhere?
|
||||
* Is a source/post time available?
|
||||
* Is a source URL available? Is it good quality, or does it often just point to an artist's base twitter profile? If you pull it from text or a tooltip, is it clipped for longer URLs?
|
||||
|
||||
[Taken a break? Now let's put it all together ---->](downloader_completion.md)
|
|
@ -0,0 +1,70 @@
|
|||
# Content Parsers
|
||||
|
||||
So, we can now generate some strings from a document. Content Parsers will let us apply a single metadata type to those strings to inform hydrus what they are.
|
||||
|
||||
![](images/edit_content_parser_panel_tags.png)
|
||||
|
||||
A content parser has a name, a content type, and a formula. This example fetches the character tags from a danbooru post.
|
||||
|
||||
The name is just decorative, but it is generally a good idea so you can find things again when you next revisit them.
|
||||
|
||||
The current content types are:
|
||||
|
||||
## urls { id="intro" }
|
||||
|
||||
This should be applied to relative ('/image/smile.jpg') and absolute ('https://mysite.com/content/image/smile.jpg') URLs. If the URL is relative, the client will generate an absolute URL based on the original URL used to fetch the data being parsed (i.e. it should all just work).
|
||||
|
||||
You can set several types of URL:
|
||||
|
||||
* **url to download/pursue** means a Post URL or a File URL in our URL Classes system, like a booru post or an actual raw file like a jpg or webm.
|
||||
* **url to associate** means an URL you want added to the list of 'known urls' for the file, but not one you want to client to actually download and parse. Use this to neatly add booru 'source' urls.
|
||||
* **next gallery page** means the next Gallery URL on from the current one.
|
||||
|
||||
The 'file url quality precedence' allows the client to select the best of several possible URLs. Given multiple content parsers producing URLs at the same 'level' of parsing, it will select the one with the highest value. Consider these two posts:
|
||||
|
||||
* [https://danbooru.donmai.us/posts/3016415](https://danbooru.donmai.us/posts/3016415)
|
||||
* [https://danbooru.donmai.us/posts/3040603](https://danbooru.donmai.us/posts/3040603)
|
||||
|
||||
The Garnet image fits into a regular page and so Danbooru embed the whole original file in the main media canvas. One easy way to find the full File URL in this case would be to select the "src" attribute of the "img" tag with id="image".
|
||||
|
||||
The Cirno one, however, is much larger and has been scaled down. The src of the main canvas tag points to a resized 'sample' link. The full link can be found at the 'view original' link up top, which is an "a" tag with id="image-resize-link".
|
||||
|
||||
The Garnet post does not have the 'view original' link, so to cover both situations we might want two content parsers--one fetching the 'canvas' "src" and the other finding the 'view original' "href". If we set the 'canvas' one with a quality of 40 and the 'view original' 60, then the parsing system would know to select the 60 when it was available but to fall back to the 40 if not.
|
||||
|
||||
As it happens, Danbooru (afaik, always) gives a link to the original file under the 'Size:' metadata to the left. This is the same 'best link' for both posts above, but it isn't so easy to identify. It is a quiet "a" tag without an "id" and it isn't always in the same location, but if you could pin it down reliably, it might be nice to circumvent the whole issue.
|
||||
|
||||
Sites can change suddenly, so it is nice to have a bit of redundancy here if it is easy.
|
||||
|
||||
## tags { id="tags" }
|
||||
|
||||
These are simple--they tell the client that the given strings are tags. You set the namespace here as well. I recommend you parse 'splashbrush' and set the namespace 'creator' here rather than trying to mess around with 'append prefix "creator:"' string conversions at the formula level--it is simpler up here and it lets hydrus handle any edge case logic for you.
|
||||
|
||||
Leave the namespace field blank for unnamespaced tags.
|
||||
|
||||
## file hash { id="file_hash" }
|
||||
|
||||
This says 'this is the hash for the file otherwise referenced in this parser'. So, if you have another content parser finding a File or Post URL, this lets the client know early that that destination happens to have a particular MD5, for instance. The client will look for that hash in its own database, and if it finds a match, it can predetermine if it already has the file (or has previously deleted it) without ever having to download it. When this happens, it will still add tags and associate the file with the URL for it's 'known urls' just as if it _had_ downloaded it!
|
||||
|
||||
If you understand this concept, it is great to include. It saves time and bandwidth for everyone. Many site APIs include a hash for this exact reason--they want you to be able to skip a needless download just as much as you do.
|
||||
|
||||
![](images/edit_content_parser_panel_hash.png)
|
||||
|
||||
The usual suite of hash types are supported: MD5, SHA1, SHA256, and SHA512. An old version of this required some weird string decoding, but this is no longer true. Select 'hex' or 'base64' from the encoding type dropdown, and then just parse the 'e5af57a687f089894f5ecede50049458' or '5a9XpofwiYlPXs7eUASUWA==' text, and hydrus should handle the rest. It will present the parsed hash in hex.
|
||||
|
||||
## timestamp { id="timestamp" }
|
||||
|
||||
This lets you say that a given number refers to a particular time for a file. At the moment, I only support 'source time', which represents a 'post' time for the file and is useful for thread and subscription check time calculations. It takes a Unix time integer, like 1520203484, which many APIs will provide.
|
||||
|
||||
If you are feeling very clever, you can decode a 'MM/DD/YYYY hh:mm:ss' style string to a Unix time integer using string converters, which use some hacky and semi-reliable python %d-style values as per [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior). Look at the existing defaults for examples of this, and don't worry about being more accurate than 12/24 hours--trying to figure out timezone is a hell not worth attempting, and doesn't really matter in the long-run for subscriptions and thread watchers that might care.
|
||||
|
||||
## watcher page title { id="page_title" }
|
||||
|
||||
This lets the watcher know a good name/subject for its entries. The subject of a thread is obviously ideal here, but failing that you can try to fetch the first part of the first post's comment. It has precendence, like for URLs, so you can tell the parser which to prefer if you have multiple options. Just for neatness and ease of testing, you probably want to use a string converter here to cut it down to the first 64 characters or so.
|
||||
|
||||
## veto { id="veto" }
|
||||
|
||||
This is a special content type--it tells the next highest stage of parsing that this 'post' of parsing is invalid and to cancel and not return any data. For instance, if a thread post's file was deleted, the site might provide a default '404' stock File URL using the same markup structure as it would for normal images. You don't want to give the user the same 404 image ten times over (with fifteen kinds of tag and source time metadata attached), so you can add a little rule here that says "If the image link is 'https://somesite.com/404.png', raise a veto: File 404" or "If the page has 'No results found' in its main content div, raise a veto: No results found" or "If the expected download tag does not have 'download link' as its text, raise a veto: No Download Link found--possibly Ugoira?" and so on.
|
||||
|
||||
![](images/edit_content_parser_panel_veto.png)
|
||||
|
||||
They will associate their name with the veto being raised, so it is useful to give these a decent descriptive name so you can see what might be going right or wrong during testing. If it is an appropriate and serious enough veto, it may also rise up to the user level and will be useful if they need to report you an error (like "After five pages of parsing, it gives 'veto: no next page link'").
|
|
@ -0,0 +1,160 @@
|
|||
---
|
||||
title: Parser Formulae
|
||||
---
|
||||
|
||||
# Parser Formulae { id="formulae" }
|
||||
|
||||
Formulae are tools used by higher-level components of the parsing system. They take some data (typically some HTML or JSON) and return 0 to n strings. For our purposes, these strings will usually be tags, URLs, and timestamps. You will usually see them summarised with this panel:
|
||||
|
||||
![](images/edit_formula_panel.png)
|
||||
|
||||
The different types are currently [html](#html_formula), [json](#json_formula), [compound](#compound_formula), and [context variable](#context_variable_formula).
|
||||
|
||||
## html { id="html_formula" }
|
||||
|
||||
This takes a full HTML document or a sample of HTML--and any regular sort of XML _should_ also work. It starts at the root node and searches for lower nodes using one or more ordered rules based on tag name and attributes, and then returns string data from those final nodes.
|
||||
|
||||
For instance, if you have this:
|
||||
|
||||
```html
|
||||
<html>
|
||||
<body>
|
||||
<div class="media_taglist">
|
||||
<span class="generaltag"><a href="(search page)">blonde hair</a> (3456)</span>
|
||||
<span class="generaltag"><a href="(search page)">blue eyes</a> (4567)</span>
|
||||
<span class="generaltag"><a href="(search page)">bodysuit</a> (5678)</span>
|
||||
<span class="charactertag"><a href="(search page)">samus aran</a> (2345)</span>
|
||||
<span class="artisttag"><a href="(search page)">splashbrush</a> (123)</span>
|
||||
</div>
|
||||
<div class="content">
|
||||
<span class="media">(a whole bunch of content that doesn't have tags in)</span>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
```
|
||||
|
||||
_(Most boorus have a taglist like this on their file pages.)_
|
||||
|
||||
To find the artist, "splashbrush", here, you could:
|
||||
|
||||
* search beneath the root tag (`#!html <html>`) for the `#!html <div>` tag with attribute `class="media_taglist"`
|
||||
* search beneath that `#!html <div>` for `#!html <span>` tags with attribute `class="artisttag"`
|
||||
* search beneath those `#!html <span>` tags for `#!html <a>` tags
|
||||
* and then get the string content of those `#!html <a>` tags
|
||||
|
||||
Changing the `artisttag` to `charactertag` or `generaltag` would give you `samus aran` or `blonde hair`, `blue eyes`, `bodysuit` respectively.
|
||||
|
||||
You might be tempted to just go straight for any `#!html <span>` with `class="artisttag"`, but many sites use the same class to render a sidebar of favourite/popular tags or some other sponsored content, so it is generally best to try to narrow down to a larger `#!html <div>` container so you don't get anything you don't mean.
|
||||
|
||||
### the ui
|
||||
|
||||
Clicking 'edit formula' on an HTML formula gives you this:
|
||||
|
||||
![](images/edit_html_formula_panel.png)
|
||||
|
||||
You edit on the left and test on the right.
|
||||
|
||||
### finding the right html tags
|
||||
|
||||
When you add or edit one of the specific tag search rules, you get this:
|
||||
|
||||
![](images/edit_html_tag_rule_panel.png)
|
||||
|
||||
You can set multiple key/value attribute search conditions, but you'll typically be searching for 'class' or 'id' here, if anything.
|
||||
|
||||
Note that you can set it to fetch only the xth instance of a found tag, which can be useful in situations like this:
|
||||
|
||||
```html
|
||||
<span class="generaltag">
|
||||
<a href="(add tag)">+</a>
|
||||
<a href="(remove tag)">-</a>
|
||||
<a href="(search page)">blonde hair</a> (3456)
|
||||
</span>
|
||||
```
|
||||
|
||||
Without any more attributes, there isn't a great way to distinguish the `#!html <a>` with "blonde hair" from the other two--so just set `get the 3rd <a> tag` and you are good.
|
||||
|
||||
Most of the time, you'll be searching descendants (i.e. walking down the tree), but sometimes you might have this:
|
||||
|
||||
```html
|
||||
<span>
|
||||
<a href="(link to post url)">
|
||||
<img class="thumb" src="(thumbnail image)" />
|
||||
</a>
|
||||
</span>
|
||||
```
|
||||
|
||||
There isn't a great way to find the `#!html <span>` or the `#!html <a>` when looking from above here, as they are lacking a class or id, but you can find the `#!html <img>` ok, so if you find those and then add a rule where instead of searching descendants, you are 'walking back up ancestors' like this:
|
||||
|
||||
![](images/edit_html_formula_panel_descendants_ancestors.png)
|
||||
|
||||
You can solve some tricky problems this way!
|
||||
|
||||
You can also set a String Match, which is the same panel as you say in with URL Classes. It tests its best guess at the tag's 'string' value, so you can find a tag with 'Original Image' as its text or that with a regex starts with 'Posted on: '. Have a play with it and you'll figure it out.
|
||||
|
||||
### content to fetch
|
||||
|
||||
Once you have narrowed down the right nodes you want, you can decide what text to fetch. Given a node of:
|
||||
|
||||
```html
|
||||
<a href="(URL A)" class="thumb_title">Forest Glade</a>
|
||||
```
|
||||
|
||||
Returning the `href` attribute would return the string "(URL A)", returning the string content would give "Forest Glade", and returning the full html would give `#!html <a href="(URL A)" class="thumb">Forest Glade</a>`. This last choice is useful in complicated situations where you want a second, separated layer of parsing, which we will get to later.
|
||||
|
||||
### string match and conversion
|
||||
|
||||
You can set a final String Match to filter the parsed results (e.g. "only allow strings that only contain numbers" or "only allow full URLs as based on (complicated regex)") and String Converter to edit it (e.g. "remove the first three characters of whatever you find" or "decode from base64").
|
||||
|
||||
You won't use these much, but they can sometimes get you out of a complicated situation.
|
||||
|
||||
### testing
|
||||
|
||||
The testing panel on the right is important and worth using. Copy the html from the source you want to parse and then hit the paste buttons to set that as the data to test with.
|
||||
|
||||
## json { id="json_formula" }
|
||||
|
||||
This takes some JSON and does a similar style of search:
|
||||
|
||||
![](images/edit_json_formula_panel.png)
|
||||
|
||||
It is a bit simpler than HTML--if the current node is a list (called an 'Array' in JSON), you can fetch every item or the xth item, and if it is a dictionary (called an 'Object' in JSON), you can fetch a particular entry by name. Since you can't jump down several layers with attribute lookups or tag names like with HTML, you have to go down every layer one at a time. In any case, if you have something like this:
|
||||
|
||||
[![](images/json_thread_example.png)](images/json_thread_example.png)
|
||||
|
||||
!!! note
|
||||
It is a great idea to check the html or json you are trying to parse with your browser. Some web browsers have excellent developer tools that let you walk through the nodes of the document you are trying to parse in a prettier way than I would ever have time to put together. This image is one of the views Firefox provides if you simply enter a JSON URL.
|
||||
|
||||
Searching for "posts"->1st list item->"sub" on this data will give you "Nobody like kino here.".
|
||||
|
||||
Searching for "posts"->all list items->"tim" will give you the three SHA256 file hashes (since the third post has no file attached and so no 'tim' entry, the parser skips over it without complaint).
|
||||
|
||||
Searching for "posts"->1st list item->"com" will give you the OP's comment, <span class="dealwithit">\~AS RAW UNPARSED HTML\~</span>.
|
||||
|
||||
The default is to fetch the final nodes' 'data content', which means coercing simple variables into strings. If the current node is a list or dict, no string is returned.
|
||||
|
||||
But if you like, you can return the json beneath the current node (which, like HTML, includes the current node). This again will come in useful later.
|
||||
|
||||
## compound { id="compound_formula" }
|
||||
|
||||
If you want to create a string from multiple parsed strings--for instance by appending the 'tim' and the 'ext' in our json example together--you can use a Compound formula. This fetches multiple lists of strings and tries to place them into a single string using `\1` regex substitution syntax:
|
||||
|
||||
![](images/edit_compound_formula_panel.png)
|
||||
|
||||
This is a complicated example taken from one of my thread parsers. I have to take a modified version of the original thread URL (the first rule, so `\1`) and then append the filename (`\2`) and its extension (`\3`) on the end to get the final file URL of a post. You can mix in more characters in the substitution phrase, like `\1.jpg` or even have multiple instances (`https://\2.muhsite.com/\2/\1`), if that is appropriate.
|
||||
|
||||
This is where the magic happens, sometimes, so keep it in mind if you need to do something cleverer than the data you have seems to provide.
|
||||
|
||||
## context variable { id="context_variable_formula" }
|
||||
|
||||
This is a basic hacky answer to a particular problem. It is a simple key:value dictionary that at the moment only stores one variable, 'url', which contains the original URL used to fetch the data being parsed.
|
||||
|
||||
If a different URL Class links to this parser via an API URL, this 'url' variable will always be the API URL (i.e. it literally is the URL used to fetch the data), not any thread/whatever URL the user entered.
|
||||
|
||||
![](images/edit_context_variable_formula_panel.png)
|
||||
|
||||
Hit the 'edit example parsing context' to change the URL used for testing.
|
||||
|
||||
I have used this several times to stitch together file URLs when I am pulling data from APIs, like in the compound formula example above. In this case, the starting URL is `https://a.4cdn.org/tg/thread/57806016.json`, from which I extract the board name, "tg", using the string converter, and then add in 4chan's CDN domain to make the appropriate base file URL (`https:/i.4cdn.org/tg/`) for the given thread. I only have to jump through this hoop in 4chan's case because they explicitly store file URLs by board name. 8chan on the other hand, for instance, has a static `https://media.8ch.net/file_store/` for all files, so it is a little easier (I think I just do a single 'prepend' string transformation somewhere).
|
||||
|
||||
If you want to make some parsers, you will have to get familiar with how different sites store and present their data!
|
|
@ -0,0 +1,45 @@
|
|||
# api example
|
||||
|
||||
Some sites offer API calls for their pages. Depending on complexity and quality of content, using these APIs may or may not be a good idea. Artstation has a good one--let's first review our URL Classes:
|
||||
|
||||
![](images/downloader_api_example_url_class_1.png) ![](images/downloader_api_example_url_class_2.png)
|
||||
|
||||
We convert the original Post URL, [https://www.artstation.com/artwork/mQLe1](https://www.artstation.com/artwork/mQLe1) to [https://www.artstation.com/projects/mQLe1.json](https://www.artstation.com/projects/mQLe1.json). Note that Artstation Post URLs can produce multiple files, and that the API url should not be associated with those final files.
|
||||
|
||||
So, when the client encounters an 'artstation file page' URL, it will generate the equivalent 'artstation file page json api' URL and use that for downloading and parsing. If you would like to review your API links, check out _network->downloader definitions->manage url class links->api links_. Using Example URLs, it will figure out which URL Classes link to others and ensure you are mapping parsers only to the final link in the chain--there should be several already in there by default.
|
||||
|
||||
Now lets look at the JSON. Loading clean JSON in a browser should present you with a nicer view:
|
||||
|
||||
![](images/downloader_api_example_json.png)
|
||||
|
||||
I have highlighted the data we want, which is:
|
||||
|
||||
* The File URLs.
|
||||
* Creator, title, medium, and unnamespaced tags.
|
||||
* Source time.
|
||||
|
||||
JSON is a dream to parse, and I will assume you are comfortable with Content Parsers from the previous examples, so I'll simply paste the different formulae one after another:
|
||||
|
||||
![](images/downloader_api_example_file_urls.png)
|
||||
|
||||
Each image is stored under a separate numbered 'assets' list item. This one has just two, but some Artstation pages have dozens of images. The only unusual part here is I also put a String Match of `^(?!.*assets\/covers).*$`, which filters out 'cover' images (such as on [here](https://www.artstation.com/projects/3KyXA.json)), which make for nice portfolio thumbs on the site but are not interesting to us.
|
||||
|
||||
![](images/downloader_api_example_creator.png)
|
||||
|
||||
This fetches the 'creator' tag. Artstation's API is great because it includes profile data in content requests. There's the creator's presentation name, username, profile link, avatar URLs, all that inside a regular request about this particular work. When that information is missing (like in yiff.party), it may make the API useless to you.
|
||||
|
||||
![](images/downloader_api_example_title.png)
|
||||
|
||||
![](images/downloader_api_example_medium.png)
|
||||
|
||||
![](images/downloader_api_example_unnamespaced.png)
|
||||
|
||||
These are all simple. You can take or leave the title and medium tags--some people like them, some don't. This example has no unnamespaced tags, but [this one](https://www.artstation.com/projects/XRm50.json) does. Creator-entered tags are sometimes not worth parsing (on tumblr, for instance, you often get run-on tags like #imbored #whatisevengoingon that are irrelevent to the work), but Artstation users are all professionals trying to get their work noticed, so the tags are usually pretty good.
|
||||
|
||||
![](images/downloader_api_example_source_time.png)
|
||||
|
||||
This again uses python's datetime to decode the date, which Artstation presents with millisecond accuracy, ha ha. I use a `(.+:..)\..*->\1` regex (i.e. "get everything before the period") to strip off the timezone and milliseconds and then decode as normal.
|
||||
|
||||
## summary { id="summary" }
|
||||
|
||||
APIs that are stable and free to access (e.g. do not require OAuth or other complicated login headers) can make parsing fantastic. They save bandwidth and CPU time, and they are typically easier to work with than HTML. Unfortunately, the boorus that do provide APIs often list their tags without namespace information, so I recommend you double-check you can get what you want before you get too deep into it. Some APIs also offer incomplete data, such as relative URLs (relative to the original URL!), which can be a pain to figure out in our system.
|
|
@ -0,0 +1,99 @@
|
|||
# file page example
|
||||
|
||||
Let's look at this page: [https://gelbooru.com/index.php?page=post&s=view&id=3837615](https://gelbooru.com/index.php?page=post&s=view&id=3837615).
|
||||
|
||||
What sorts of data are we interested in here?
|
||||
|
||||
* The image URL.
|
||||
* The different tags and their namespaces.
|
||||
* The secret md5 hash buried in the HTML.
|
||||
* The post time.
|
||||
* The Deviant Art source URL.
|
||||
|
||||
## the file url { id="the_file_url" }
|
||||
|
||||
A tempting strategy for pulling the file URL is to just fetch the src of the embedded `#!html <img>` tag, but:
|
||||
|
||||
* If the booru also supports videos or flash, you'll have to write separate and likely more complicated rules for `#!html <video>` and `#!html <embed>` tags.
|
||||
* If the booru shows 'sample' sizes for large images--as this one does!--pulling the src of the image you see won't get the full-size original for large images.
|
||||
|
||||
If you have an account with the site you are parsing and have clicked the appropriate 'Always view original' setting, you may not see these sorts of sample-size banners! I recommend you log out of/go incognito for sites you are inspecting for hydrus parsing (unless a log-in is required to see content, so the hydrus user will have to set up hydrus-side login to actually use the parser), or you can easily NSFW-gates and other logged-out hurdles.
|
||||
|
||||
When trying to pin down the right link, if there are no good alternatives, you often have to write several File URL rules with different precedence, saying 'get the "Click Here to See Full Size" link at 75' and 'get the embed's "src" at 25' and so on to make sure you cover different situations, but as it happens Gelbooru always posts the actual File URL at:
|
||||
|
||||
* `#!html <meta property="og:image" content="https://gelbooru.com/images/38/6e/386e12e33726425dbd637e134c4c09b5.jpeg" />` under the `#!html <head>`
|
||||
* `#!html <a href="https://simg3.gelbooru.com//images/38/6e/386e12e33726425dbd637e134c4c09b5.jpeg" target="_blank" style="font-weight: bold;">Original image</a>` which can be found by putting a String Match in the html formula.
|
||||
|
||||
`#!html <meta>` with `property="og:image"` is easy to search for (and they use the same tag for video links as well!). For the Original Image, you can use a String Match like so:
|
||||
|
||||
[![](images/downloader_post_example_clean.png)](images/downloader_post_example_clean.png)
|
||||
|
||||
Gelbooru uses "Original Image" even when they link to webm, which is helpful, but like "og:image", it could be changed to 'video' in future.
|
||||
|
||||
I think I wrote my gelbooru parser before I added String Matches to individual HTML formulae tag rules, so I went with this, which is a bit more cheeky:
|
||||
|
||||
![](images/downloader_post_example_cheeky.png)
|
||||
|
||||
But it works. Sometimes, just regexing for links that fit the site's CDN is a good bet for finding difficult stuff.
|
||||
|
||||
## tags { id="tags" }
|
||||
|
||||
Most boorus have a taglist on the left that has a nice id or class you can pull, and then each namespace gets its own class for CSS-colouring:
|
||||
|
||||
![](images/downloader_post_example_meta_tag.png)
|
||||
|
||||
Make sure you browse around the booru for a bit, so you can find all the different classes they use. character/artist/copyright are common, but some sneak in the odd meta/species/rating.
|
||||
|
||||
Skipping ?/-/+ characters can be a pain if you are lacking a nice tag-text class, in which case you can add a regex String Match to the HTML formula (as I do here, since Gelb offers '?' links for tag definitions) like \[^\\?\\-+\\s\], which means "the text includes something other than just '?' or '-' or '+' or whitespace".
|
||||
|
||||
## md5 hash { id="md5_hash" }
|
||||
|
||||
If you look at the Gelbooru File URL, [**https://gelbooru.com/images/38/6e/386e12e33726425dbd637e134c4c09b5.jpeg**](https://gelbooru.com/images/38/6e/386e12e33726425dbd637e134c4c09b5.jpeg), you may notice the filename is all hexadecimal. It looks like they store their files under a two-deep folder structure, using the first four characters--386e here--as the key. It sure looks like '386e12e33726425dbd637e134c4c09b5' is not random ephemeral garbage!
|
||||
|
||||
In fact, Gelbooru use the MD5 of the file as the filename. Many storage systems do something like this (hydrus uses SHA256!), so if they don't offer a `#!html <meta>` tag that explicitly states the md5 or sha1 or whatever, you can sometimes infer it from one of the file links. This screenshot is from the more recent version of hydrus, which has the more powerful 'string processing' system for string transformations. It has an intimidating number of nested dialogs, but we can stay simple for now, with only the one regex substitution step inside a string 'converter':
|
||||
|
||||
![](images/downloader_post_example_md5.png)
|
||||
|
||||
Here we are using the same property="og:image" rule to fetch the File URL, and then we are regexing the hex hash with `.*(\[0-9a-f\]{32}).*` (MD5s are 32 hex characters). We select 'hex' as the encoding type. Hashes require a tiny bit more data handling behind the scenes, but in the Content Parser test page it presents the hash again neatly in English: "md5 hash: 386e12e33726425dbd637e134c4c09b5"), meaning everything parsed correct. It presents the hash in hex even if you select the encoding type as base64.
|
||||
|
||||
If you think you have found a hash string, you should obviously test your theory! The site might not be using the actual MD5 of file bytes, as hydrus does, but instead some proprietary scheme. Download the file and run it through a program like HxD (or hydrus!) to figure out its hashes, and then search the View Source for those hex strings--you might be surprised!
|
||||
|
||||
Finding the hash is hugely beneficial for a parser--it lets hydrus skip downloading files without ever having seen them before!
|
||||
|
||||
## source time { id="source_time" }
|
||||
|
||||
Post/source time lets subscriptions and watchers make more accurate guesses at current file velocity. It is neat to have if you can find it, but:
|
||||
|
||||
<b class="dealwithit">FUCK ALL TIMEZONES FOREVER</b>
|
||||
|
||||
Gelbooru offers--
|
||||
|
||||
```html
|
||||
<li>Posted: 2017-08-18 19:59:44<br /> by <a href="index.php?page=account&s=profile&uname=jayage5ds">jayage5ds</a></li>
|
||||
```
|
||||
|
||||
--so let's see how we can turn that into a Unix timestamp:
|
||||
|
||||
![](images/downloader_post_example_source_time.png)
|
||||
|
||||
I find the `#!html <li>` that starts "Posted: " and then decode the date according to the hackery-dackery-doo format from [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior). `%c` and `%z` are unreliable, and attempting timezone adjustments is overall a supervoid that will kill your time for no real benefit--subs and watchers work fine with 12-hour imprecision, so if you have a +0300 or EST in your string, just cut those characters off with another String Transformation. As long as you are getting about the right day, you are fine.
|
||||
|
||||
## source url { id="source_url" }
|
||||
|
||||
Source URLs are nice to have if they are high quality. Some boorus only ever offer artist profiles, like `https://twitter.com/artistname`, whereas we want singular Post URLs that point to other places that host this work. For Gelbooru, you could fetch the Source URL as we did source time, searching for "Source: ", but they also offer more easily in an edit form:
|
||||
|
||||
```html
|
||||
<input type="text" name="source" size="40" id="source" value="https://www.deviantart.com/art/Lara-Croft-Artifact-Dive-699335378" />
|
||||
```
|
||||
|
||||
This is a bit of a fragile location to parse from--Gelb could change or remove this form at any time, whereas the "Posted: " `#!html <li>` is probably firmer, but I expect I wrote it before I had String Matches in. It works for now, which in this game is often Good Enough™.
|
||||
|
||||
Also--be careful pulling from text or tooltips rather than an href-like attribute, as whatever is presented to the user may be clipped for longer URLs. Make sure you try your rules on a couple of different pages to make sure you aren't pulling "https://www.deviantart.com/art/Lara..." by accident anywhere!
|
||||
|
||||
## summary { id="summary" }
|
||||
|
||||
Phew--all that for a bit of Lara Croft! Thankfully, most sites use similar schemes. Once you are familiar with the basic idea, the only real work is to duplicate an existing parser and edit for differences. Our final parser looks like this:
|
||||
|
||||
![](images/downloader_post_example_final.png)
|
||||
|
||||
This is overall a decent parser. Some parts of it may fail when Gelbooru update to their next version, but that can be true of even very good parsers with multiple redundancy. For now, hydrus can use this to quickly and efficiently pull content from anything running Gelbooru 0.2.5., and the effort spent now can save millions of combined _right-click->save as_ and manual tag copies in future. If you make something like this and share it about, you'll be doing a good service for those who could never figure it out.
|
|
@ -0,0 +1,42 @@
|
|||
# gallery page example
|
||||
|
||||
!!! caution
|
||||
These guides should _roughly_ follow what comes with the client by default! You might like to have the actual UI open in front of you so you can play around with the rules and try different test parses yourself.
|
||||
|
||||
Let's look at this page: [https://e621.net/post/index/1/rating:safe pokemon](https://e621.net/post/index/1/rating:safe pokemon)
|
||||
|
||||
We've got 75 thumbnails and a bunch of page URLs at the bottom.
|
||||
|
||||
## first, the main page { id="main_page" }
|
||||
|
||||
This is easy. It gets a good name and some example URLs. e621 has some different ways of writing out their queries (and as they use some tags with '/', like 'male/female', this can cause character encoding issues depending on whether the tag is in the path or query!), but we'll put that off for now--we just want to parse some stuff.
|
||||
|
||||
![](images/downloader_gallery_example_main.png)
|
||||
|
||||
## thumbnail links { id="thumbnail_urls" }
|
||||
|
||||
Most browsers have some good developer tools to let you Inspect Element and get a better view of the HTML DOM. Be warned that this information isn't always the same as View Source (which is what hydrus will get when it downloads the initial HTML document), as some sites load results dynamically with javascript and maybe an internal JSON API call (when sites move to systems that load more thumbs as you scroll down, it makes our job more difficult--in these cases, you'll need to chase down the embedded JSON or figure out what API calls their JS is making--the browser's developer tools can help you here again). Thankfully, e621 is (and most boorus are) fairly static and simple:
|
||||
|
||||
![](images/downloader_gallery_example_thumb_html.png)
|
||||
|
||||
Every thumb on e621 is a `#!html <span>` with class="thumb" wrapping an `#!html <a>` and an `#!html <img>`. This is a common pattern, and easy to parse:
|
||||
|
||||
![](images/downloader_gallery_example_thumb_parsing.png)
|
||||
|
||||
There's no tricky String Matches or String Converters needed--we are just fetching hrefs. Note that the links get relative-matched to example.com for now--I'll probably fix this to apply to one of the example URLs, but rest assured that IRL the parser will 'join' its url up with the appropriate Gallery URL used to fetch the data. Sometimes, you might want to add a rule for `search descendents for the first <div> tag with id=content` to make sure you are only grabbing thumbs from the main box, whether that is a `#!html <div>` or a `#!html <span>`, and whether it has `id="content`" or `class="mainBox"`, but unless you know that booru likes to embed "popular" or "favourite" 'thumbs' up top that will be accidentally caught by a `#!html <span>`'s with `class="thumb"`, I recommend you not make your rules overly specific--all it takes is for their dev to change the name of their content box, and your whole parser breaks. I've ditched the `#!html <span>` requirement in the rule here for exactly that reason--`class="thumb"` is necessary and sufficient.
|
||||
|
||||
Remember that the parsing system allows you to go up ancestors as well as down descendants. If your thumb-box has multiple links--like to see the artist's profile or 'set as favourite'--you can try searching for the `#!html <span>`s, then down to the `#!html <img>`, and then _up_ to the nearest `#!html <a>`. In English, this is saying, "Find me all the image link URLs in the thumb boxes."
|
||||
|
||||
## next gallery page link { id="next_gallery_url" }
|
||||
|
||||
Most boorus have 'next' or '>>' at the bottom, which can be simple enough, but many have a neat `#!html <link href="/post/index/2/rating:safe%20pokemon" rel="next" />` in the `#!html <head>`. The `#!html <head>` solution is easier, if available, but my default e621 parser happens to pursue the 'paginator':
|
||||
|
||||
![](images/downloader_gallery_example_paginator_parsing.png)
|
||||
|
||||
As it happens, e621 also apply the `rel="next"` attribute to their "Next >>" links, which makes it all that easier for us to find. Sometimes there is no "next" id or class, and you'll want to add a String Match to your html formula to test for a string value of '>>' or whatever it is. A good trick is to View Source and then search for the critical `/post/index/2/` phrase you are looking for--you might find what you want in a `#!html <link>` tag you didn't expect or even buried in a hidden 'share to tumblr' button. `#!html <form>`s for reporting or commenting on content are another good place to find content ids.
|
||||
|
||||
Note that this finds two URLs. e621 apply the `rel="next"` to both the "2" link and the "Next >>" one. The download engine merges the parser's dupes, so don't worry if you end up parsing both the 'top' and 'bottom' next page links, or if you use multiple rules to parse the same data in different ways.
|
||||
|
||||
## summary { id="summary" }
|
||||
|
||||
With those two rules, we are done. Gallery parsers are nice and simple.
|
|
@ -0,0 +1,59 @@
|
|||
# Page Parsers
|
||||
|
||||
We can now produce individual rows of rich metadata. To arrange them all into a useful structure, we will use Page Parsers.
|
||||
|
||||
The Page Parser is the top level parsing object. It takes a single document and produces a list--or a list of lists--of metadata. Here's the main UI:
|
||||
|
||||
![](images/edit_page_parser_panel_e621_main.png)
|
||||
|
||||
Notice that the edit panel has three sub-pages.
|
||||
|
||||
## main { id="main" }
|
||||
|
||||
* **Name**: Like for content parsers, I recommend you add good names for your parsers.
|
||||
* **Pre-parsing conversion**: If your API source encodes or wraps the data you want to parse, you can do some string transformations here. You won't need to use this very often, but if your source gives the JSON wrapped in javascript (like the old tumblr API), it can be invaluable.
|
||||
* **Example URLs**: Here you should add a list of example URLs the parser works for. This lets the client automatically link this parser up with URL classes for you and any users you share the parser with.
|
||||
|
||||
## content parsers { id="content_parsers" }
|
||||
|
||||
This page is just a simple list:
|
||||
|
||||
![](images/edit_page_parser_panel_e621_content_parsers.png)
|
||||
|
||||
Each content parser here will be applied to the document and returned in this page parser's results list. Like most boorus, e621's File Pages only ever present one file, and they have simple markup, so the solution here was simple. The full contents of that test window are:
|
||||
|
||||
```
|
||||
*** 1 RESULTS BEGIN ***
|
||||
|
||||
tag: character:krystal
|
||||
tag: creator:s mino930
|
||||
file url: https://static1.e621.net/data/fc/b6/fcb673ed89241a7b8d87a5dcb3a08af7.jpg
|
||||
tag: anthro
|
||||
tag: black nose
|
||||
tag: blue fur
|
||||
tag: blue hair
|
||||
tag: clothing
|
||||
tag: female
|
||||
tag: fur
|
||||
tag: green eyes
|
||||
tag: hair
|
||||
tag: hair ornament
|
||||
tag: jewelry
|
||||
tag: short hair
|
||||
tag: solo
|
||||
tag: video games
|
||||
tag: white fur
|
||||
tag: series:nintendo
|
||||
tag: series:star fox
|
||||
tag: species:canine
|
||||
tag: species:fox
|
||||
tag: species:mammal
|
||||
|
||||
*** RESULTS END ***
|
||||
```
|
||||
|
||||
When the client sees this in a downloader context, it will where to download the file and which tags to associate with it based on what the user has chosen in their 'tag import options'.
|
||||
|
||||
## subsidiary page parsers { id="subsidiary_page_parsers" }
|
||||
|
||||
Here be dragons. This was an attempt to make parsing more helpful in certain API situations, but it ended up ugly. I do not recommend you use it, as I will likely scratch the whole thing and replace it with something better one day. It basically splits the page up into pieces that can then be parsed by nested page parsers as separate objects, but the UI and workflow is hell. Afaik, the imageboard API parsers use it, but little/nothing else. If you are really interested, check out how those work and maybe duplicate to figure out your own imageboard parser and/or send me your thoughts on how to separate File URL/timestamp combos better.
|
|
@ -0,0 +1,19 @@
|
|||
---
|
||||
title: Sharing Downloaders
|
||||
---
|
||||
|
||||
# Sharing Downloaders
|
||||
|
||||
If you are working with users who also understand the downloader system, you can swap your GUGs, URL Classes, and Parsers separately using the import/export buttons on the relevant dialogs, which work in pngs and clipboard text.
|
||||
|
||||
But if you want to share conveniently, and with users who are not familiar with the different downloader objects, you can package everything into a single easy-import png as per [here](adding_new_downloaders.md).
|
||||
|
||||
The dialog to use is _network->downloader definitions->export downloaders_:
|
||||
|
||||
![](images/downloader_export_panel.png)
|
||||
|
||||
It isn't difficult. Essentially, you want to bundle enough objects to make one or more 'working' GUGs at the end. I recommend you start by just hitting 'add gug', which--using Example URLs--will attempt to figure out everything you need by itself.
|
||||
|
||||
This all works on Example URLs and some domain guesswork, so make sure your url classes are good and the parsers have correct Example URLs as well. If they don't, they won't all link up neatly for the end user. If part of your downloader is on a different domain to the GUGs and Gallery URLs, then you'll have to add them manually. Just start with 'add gug' and see if it looks like enough.
|
||||
|
||||
Once you have the necessary and sufficient objects added, you can export to png. You'll get a similar 'does this look right?' summary as what the end-user will see, just to check you have everything in order and the domains all correct. If that is good, then make sure to give the png a sensible filename and embellish the title and description if you need to. You can then send/post that png wherever, and any regular user will be able to use your work.
|
|
@ -0,0 +1,203 @@
|
|||
# URL Classes
|
||||
|
||||
The fundamental connective tissue of the downloader system is the 'URL Class'. This object identifies and normalises URLs and links them to other components. Whenever the client handles a URL, it tries to match it to a URL Class to figure out what to do.
|
||||
|
||||
## the types of url { id="url_types" }
|
||||
|
||||
For hydrus, an URL is useful if it is one of:
|
||||
|
||||
File URL
|
||||
:
|
||||
This returns the full, raw media file with no HTML wrapper. They typically end in a filename like [http://safebooru.org//images/2333/cab1516a7eecf13c462615120ecf781116265f17.jpg](http://safebooru.org//images/2333/cab1516a7eecf13c462615120ecf781116265f17.jpg), but sometimes they have a more complicated fetch command ending like 'file.php?id=123456' or '/post/content/123456'.
|
||||
|
||||
These URLs are remembered for the file in the 'known urls' list, so if the client happens to encounter the same URL in future, it can determine whether it can skip the download because the file is already in the database or has previously been deleted.
|
||||
|
||||
It is not important that File URLs be matched by a URL Class. File URL is considered the 'default', so if the client finds no match, it will assume the URL is a file and try to download and import the result. You might want to particularly specify them if you want to present them in the media viewer or discover File URLs are being confused for Post URLs or something.
|
||||
|
||||
Post URL
|
||||
|
||||
:
|
||||
This typically return some HTML that contains a File URL and metadata such as tags and post time. They sometimes present multiple sizes (like 'sample' vs 'full size') of the file or even different formats (like 'ugoira' vs 'webm'). The Post URL for the file above, [http://safebooru.org/index.php?page=post&s=view&id=2429668](http://safebooru.org/index.php?page=post&s=view&id=2429668) has this 'sample' presentation. Finding the best File URL in these cases can be tricky!
|
||||
|
||||
This URL is also saved to 'known urls' and will usually be similarly skipped if it has previously been downloaded. It will also appear in the media viewer as a clickable link.
|
||||
|
||||
Gallery URL
|
||||
:
|
||||
This presents a list of Post URLs or File URLs. They often also present a 'next page' URL. It could be a page like [http://safebooru.org/index.php?page=post&s=list&tags=yorha\_no.\_2\_type\_b&pid=0](http://safebooru.org/index.php?page=post&s=list&tags=yorha_no._2_type_b&pid=0) or an API URL like [http://safebooru.org/index.php?page=dapi&s=post&tags=yorha\_no.\_2\_type\_b&q=index&pid=0](http://safebooru.org/index.php?page=dapi&s=post&tags=yorha_no._2_type_b&q=index&pid=0).
|
||||
|
||||
Watchable URL
|
||||
:
|
||||
This is the same as a Gallery URL but represents an ephemeral page that receives new files much faster than a gallery but will soon 'die' and be deleted. For our purposes, this typically means imageboard threads.
|
||||
|
||||
## the components of a url { id="url_components" }
|
||||
|
||||
As far as we are concerned, a URL string has four parts:
|
||||
|
||||
* **Scheme:** `http` or `https`
|
||||
* **Location/Domain:** `safebooru.org` or `i.4cdn.org` or `cdn002.somebooru.net`
|
||||
* **Path Components:** `index.php` or `tesla/res/7518.json` or `pictures/user/daruak/page/2` or `art/Commission-animation-Elsa-and-Anna-541820782`
|
||||
* **Query Parameters:** `page=post&s=list&tags=yorha_no._2_type_b&pid=40` or `page=post&s=view&id=2429668`
|
||||
|
||||
So, let's look at the 'edit url class' panel, which is found under _network->manage url classes_:
|
||||
|
||||
![](images/downloader_edit_url_class_panel.png)
|
||||
|
||||
A TBIB File Page like [https://tbib.org/index.php?page=post&s=view&id=6391256](https://tbib.org/index.php?page=post&s=view&id=6391256) is a Post URL. Let's look at the metadata first:
|
||||
|
||||
Name and type
|
||||
:
|
||||
Like with GUGs, we should set a good unambiguous name so the client can clearly summarise this url to the user. 'tbib file page' is good.
|
||||
|
||||
This is a Post URL, so we set the 'post url' type.
|
||||
|
||||
Association logic
|
||||
:
|
||||
All boorus and most sites only present one file per page, but some sites present multiple files on one page, usually several pages in a series/comic, as with pixiv. Danbooru-style thumbnail links to 'this file has a post parent' do not count here--I mean that a single URL embeds multiple full-size images, either with shared or separate tags. It is **very important** to the hydrus client's downloader logic (making decisions about whether it has previously visited a URL, so whether to skip checking it again) that if a site can present multiple files on a single page that 'can produce multiple files' is checked.
|
||||
|
||||
Related is the idea of whether a 'known url' should be associated. Typically, this should be checked for Post and File URLs, which are fixed, and unchecked for Gallery and Watchable URLs, which are ephemeral and give different results from day to day. There are some unusual exceptions, so give it a brief thought--but if you have no special reason, leave this as the default for the url type.
|
||||
|
||||
|
||||
And now, for matching the string itself, let's revisit our four components:
|
||||
|
||||
Scheme
|
||||
:
|
||||
TBIB supports http and https, so I have set the 'preferred' scheme to https. Any 'http' TBIB URL a user inputs will be automatically converted to https.
|
||||
|
||||
Location/Domain
|
||||
:
|
||||
For Post URLs, the domain is always "tbib.org".
|
||||
|
||||
The 'allow' and 'keep' subdomains checkboxes let you determine if a URL with "artistname.artsite.com" will match a URL Class with "artsite.com" domain and if that subdomain should be remembered going forward. Most sites do not host content on subdomains, so you can usually leave 'match' unchecked. The 'keep' option (which is only available if 'keep' is checked) is more subtle, only useful for rare cases, and unless you have a special reason, you should leave it checked. (For keep: In cases where a site farms out File URLs to CDN servers on subdomains--like randomly serving a mirror of "https://muhbooru.org/file/123456" on "https://srv2.muhbooru.org/file/123456"--and removing the subdomain still gives a valid URL, you may not wish to keep the subdomain.) Since TBIB does not use subdomains, these options do not matter--we can leave both unchecked.
|
||||
|
||||
'www' and 'www2' and similar subdomains are automatically matched. Don't worry about them.
|
||||
|
||||
Path Components
|
||||
:
|
||||
TBIB just uses a single "index.php" on the root directory, so the path is not complicated. Were it longer (like "gallery/cgi/index.php", we would add more ("gallery" and "cgi"), and since the path of a URL has a strict order, we would need to arrange the items in the listbox there so they were sorted correctly.
|
||||
|
||||
Query Parameters
|
||||
:
|
||||
TBIB's index.php takes many query parameters to render different page types. Note that the Post URL uses "s=view", while TBIB Gallery URLs use "s=list". In any case, for a Post URL, "id", "page", and "s" are necessary and sufficient.
|
||||
|
||||
|
||||
## string matches { id="string_matches" }
|
||||
|
||||
As you edit these components, you will be presented with the Edit String Match Panel:
|
||||
|
||||
![](images/edit_string_match_panel.png)
|
||||
|
||||
This lets you set the type of string that will be valid for that component. If a given path or query component does not match the rules given here, the URL will not match the URL Class. Most of the time you will probably want to set 'fixed characters' of something like "post" or "index.php", but if the component you are editing is more complicated and could have a range of different valid values, you can specify just numbers or letters or even a regex pattern. If you try to do something complicated, experiment with the 'example string' entry to make sure you have it set how you think.
|
||||
|
||||
Don't go overboard with this stuff, though--most sites do not have super-fine distinctions between their different URL types, and hydrus users will not be dropping user account or logout pages or whatever on the client, so you can be fairly liberal with the rules.
|
||||
|
||||
## how do they match, exactly? { id="match_details" }
|
||||
|
||||
This URL Class will be assigned to any URL that matches the location, path, and query. Missing path compontent or query parameters in the URL will invalidate the match but additonal ones will not!
|
||||
|
||||
For instance, given:
|
||||
|
||||
* URL A: https://8ch.net/tv/res/1002432.html
|
||||
* URL B: https://8ch.net/tv/res
|
||||
* URL C: https://8ch.net/tv/res/1002432
|
||||
* URL D: https://8ch.net/tv/res/1002432.json
|
||||
* URL Class that looks for "(characters)/res/(numbers).html" for the path
|
||||
|
||||
Only URL A will match
|
||||
|
||||
And:
|
||||
|
||||
* URL A: https://boards.4chan.org/m/thread/16086187
|
||||
* URL B: https://boards.4chan.org/m/thread/16086187/ssg-super-sentai-general-651
|
||||
* URL Class that looks for "(characters)/thread/(numbers)" for the path
|
||||
|
||||
Both URL A and B will match
|
||||
|
||||
And:
|
||||
|
||||
* URL A: https://www.pixiv.net/member\_illust.php?mode=medium&illust\_id=66476204
|
||||
* URL B: https://www.pixiv.net/member\_illust.php?mode=medium&illust\_id=66476204&lang=jp
|
||||
* URL C: https://www.pixiv.net/member_illust.php?mode=medium
|
||||
* URL Class that looks for "illust_id=(numbers)" in the query
|
||||
|
||||
Both URL A and B will match, URL C will not
|
||||
|
||||
If multiple URL Classes match a URL, the client will try to assign the most 'complicated' one, with the most path components and then query parameters.
|
||||
|
||||
Given two example URLs and URL Classes:
|
||||
|
||||
* URL A: https://somebooru.com/post/123456
|
||||
* URL B: https://somebooru.com/post/123456/manga_subpage/2
|
||||
* URL Class A that looks for "post/(number)" for the path
|
||||
* URL Class B that looks for "post/(number)/manga_subpage/(number)" for the path
|
||||
|
||||
URL A will match URL Class A but not URL Class B and so will receive A.
|
||||
|
||||
URL B will match both and receive URL Class B as it is more complicated.
|
||||
|
||||
This situation is not common, but when it does pop up, it can be a pain. It is usually a good idea to match exactly what you need--no more, no less.
|
||||
|
||||
## normalising urls { id="url_normalisation" }
|
||||
|
||||
Different URLs can give the same content. The http and https versions of a URL are typically the same, and:
|
||||
|
||||
* [https://gelbooru.com/index.php?page=post&s=view&id=3767497](https://gelbooru.com/index.php?page=post&s=view&id=3767497)
|
||||
* gives the same as:
|
||||
* [https://gelbooru.com/index.php?id=3767497&page=post&s=view](https://gelbooru.com/index.php?id=3767497&page=post&s=view)
|
||||
|
||||
And:
|
||||
|
||||
* [https://e621.net/post/show/1421754/abstract\_background-animal\_humanoid-blush-brown_ey](https://e621.net/post/show/1421754/abstract_background-animal_humanoid-blush-brown_ey)
|
||||
* is the same as:
|
||||
* [https://e621.net/post/show/1421754](https://e621.net/post/show/1421754)
|
||||
* _is the same as:_
|
||||
* [https://e621.net/post/show/1421754/help\_computer-made\_up_tags-REEEEEEEE](https://e621.net/post/show/1421754/help_computer-made_up_tags-REEEEEEEE)
|
||||
|
||||
Since we are in the business of storing and comparing URLs, we want to 'normalise' them to a single comparable beautiful value. You see a preview of this normalisation on the edit panel. Normalisation happens to all URLs that enter the program.
|
||||
|
||||
Note that in e621's case (and for many other sites!), that text after the id is purely decoration. It can change when the file's tags change, so if we want to compare today's URLs with those we saw a month ago, we'd rather just be without it.
|
||||
|
||||
On normalisation, all URLs will get the preferred http/https switch, and their query parameters will be alphabetised. File and Post URLs will also cull out any surplus path or query components. This wouldn't affect our TBIB example above, but it will clip the e621 example down to that 'bare' id URL, and it will take any surplus 'lang=en' or 'browser=netscape_24.11' garbage off the query text as well. URLs that are not associated and saved and compared (i.e. normal Gallery and Watchable URLs) are not culled of unmatched path components or query parameters, which can sometimes be useful if you want to match (and keep intact) gallery URLs that might or might not include an important 'sort=desc' type of parameter.
|
||||
|
||||
Since File and Post URLs will do this culling, be careful that you not leave out anything important in your rules. Make sure what you have is both necessary (nothing can be removed and still keep it valid) and sufficient (no more needs to be added to make it valid). It is a good idea to try pasting the 'normalised' version of the example URL into your browser, just to check it still works.
|
||||
|
||||
## 'default' values { id="default_values" }
|
||||
|
||||
Some sites present the first page of a search like this:
|
||||
|
||||
[https://danbooru.donmai.us/posts?tags=skirt](https://danbooru.donmai.us/posts?tags=skirt)
|
||||
|
||||
But the second page is:
|
||||
|
||||
[https://danbooru.donmai.us/posts?tags=skirt&page=2](https://danbooru.donmai.us/posts?tags=skirt&page=2)
|
||||
|
||||
Another example is:
|
||||
|
||||
[https://www.hentai-foundry.com/pictures/user/Mister69M](https://www.hentai-foundry.com/pictures/user/Mister69M)
|
||||
|
||||
[https://www.hentai-foundry.com/pictures/user/Mister69M/page/2](https://www.hentai-foundry.com/pictures/user/Mister69M/page/2)
|
||||
|
||||
What happened to 'page=1' and '/page/1'? Adding those '1' values in works fine! Many sites, when an index is absent, will secretly imply an appropriate 0 or 1. This looks pretty to users looking at a browser address bar, but it can be a pain for us, who want to match both styles to one URL Class. It would be nice if we could recognise the 'bare' initial URL and fill in the '1' values to coerce it to the explicit, automation-friendly format. Defaults to the rescue:
|
||||
|
||||
![](images/downloader_edit_url_class_panel_default.png)
|
||||
|
||||
After you set a path component or query parameter String Match, you will be asked for an optional 'default' value. You won't want to set one most of the time, but for Gallery URLs, it can be hugely useful--see how the normalisation process automatically fills in the missing path component with the default! There are plenty of examples in the default Gallery URLs of this, so check them out. Most sites use page indices starting at '1', but Gelbooru-style imageboards use 'pid=0' file index (and often move forward 42, so the next pages will be 'pid=42', 'pid=84', and so on, although others use deltas of 20 or 40).
|
||||
|
||||
## can we predict the next gallery page? { id="next_gallery_page_prediction" }
|
||||
|
||||
Now we can harmonise gallery urls to a single format, we can predict the next gallery page! If, say, the third path component or 'page' query parameter is always a number referring to page, you can select this under the 'next gallery page' section and set the delta to change it by. The 'next gallery page url' section will be automatically filled in. This value will be consulted if the parser cannot find a 'next gallery page url' from the page content.
|
||||
|
||||
It is neat to set this up, but I only recommend it if you actually cannot reliably parse a next gallery page url from the HTML later in the process. It is neater to have searches stop naturally because the parser said 'no more gallery pages' than to have hydrus always one page beyond and end every single search on an uglier 'No results found' or 404 result.
|
||||
|
||||
Unfortunately, some sites will either not produce an easily parsable next page link or randomly just not include it due to some issue on their end (Gelbooru is a funny example of this). Also, APIs will often have a kind of 'start=200&num=50', 'start=250&num=50' progression but not include that state in the XML or JSON they return. These cases require the automatic next gallery page rules (check out Artstation and tumblr api gallery page URL Classes in the defaults for examples of this).
|
||||
|
||||
## how do we link to APIs? { id="api_links" }
|
||||
|
||||
If you know that a URL has an API backend, you can tell the client to use that API URL when it fetches data. The API URL needs its own URL Class.
|
||||
|
||||
To define the relationship, click the "String Converter" button, which gives you this:
|
||||
|
||||
![](images/edit_string_converter_panel.png)
|
||||
|
||||
You may have seen this panel elsewhere. It lets you convert a string to another over a number of transformation steps. The steps can be as simple as adding or removing some characters or applying a full regex substitution. For API URLs, you are mostly looking to isolate some unique identifying data ("m/thread/16086187" in this case) and then substituting that into the new API path. It is worth testing this with several different examples!
|
||||
|
||||
When the client links regular URLs to API URLs like this, it will still associate the human-pretty regular URL when it needs to display to the user and record 'known urls' and so on. The API is just a quick lookup when it actually fetches and parses the respective data.
|
17
mkdocs.yml
17
mkdocs.yml
|
@ -33,6 +33,23 @@ nav:
|
|||
- docker.md
|
||||
- wine.md
|
||||
- running_from_source.md
|
||||
- Downloader Creation:
|
||||
- Introduction: downloader_intro.md
|
||||
- downloader_gugs.md
|
||||
- downloader_url_classes.md
|
||||
- Parsers:
|
||||
- Overview: downloader_parsers.md
|
||||
- Components:
|
||||
- Formulae: downloader_parsers_formulae.md
|
||||
- downloader_parsers_content_parsers.md
|
||||
- downloader_parsers_page_parsers.md
|
||||
- Walkthroughs:
|
||||
- downloader_parsers_full_example_gallery_page.md
|
||||
- downloader_parsers_full_example_file_page.md
|
||||
- downloader_parsers_full_example_api.md
|
||||
- downloader_completion.md
|
||||
- Sharing: downloader_sharing.md
|
||||
- downloader_login.md
|
||||
- API: client_api.md
|
||||
- Misc:
|
||||
- faq.md
|
||||
|
|
Loading…
Reference in New Issue