diff --git a/docs/downloader_completion.md b/docs/downloader_completion.md new file mode 100644 index 00000000..2b2c6104 --- /dev/null +++ b/docs/downloader_completion.md @@ -0,0 +1,17 @@ +--- +title: Putting it All Together +--- + +Now you know what GUGs, URL Classes, and Parsers are, you should have some ideas of how URL Classes could steer what happens when the downloader is faced with an URL to process. Should a URL be imported as a media file, or should it be parsed? If so, how? + +You may have noticed in the Edit GUG ui that it lists if a current URL Class matches the example URL output. If the GUG has no matching URL Class, it won't be listed in the main 'gallery selector' button's list--it'll be relegated to the 'non-functioning' page. Without a URL Class, the client doesn't know what to do with the output of that GUG. But if a URL Class does match, we can then hand the result over to a parser set at _network->downloader definitions->manage url class links_: + +![](images/downloader_completion_url_links.png) + +Here you simply set which parsers go with which URL Classes. If you have URL Classes that do not have a parser linked (which is the default for new URL Classes), you can use the 'try to fill in gaps...' button to automatically fill the gaps based on guesses using the parsers' example URLs. This is usually the best way to line things up unless you have multiple potential parsers for that URL Class, in which case it'll usually go by the parser name earliest in the alphabet. + +If the URL Class has no parser set or the parser is broken or otherwise invalid, the respective URL's file import object in the downloader or subscription is going to throw some kind of error when it runs. If you make and share some parsers, the first indication that something is wrong is going to be several users saying 'I got this error: (_copy notes_ from file import status window)'. You can then load the parser back up in _manage parsers_ and try to figure out what changed and roll out an update. + +_manage url class links_ also shows 'api link review', which summarises which URL Classes api-link to others. In these cases, only the api URL gets a parser entry in the first 'parser links' window, since the first will never be fetched for parsing (in the downloader, it will always be converted to the API URL, and _that_ is fetched and parsed). + +Once your GUG has a URL Class and your URL Classes have parsers linked, test your downloader! Note that Hydrus's URL drag-and-drop import uses URL Classes, so if you don't have the GUG and gallery stuff done but you have a Post URL set up, you can test that just by dragging a Post URL from your browser to the client, and it should be added to a new URL Downloader and just work. It feels pretty good once it does! diff --git a/docs/downloader_gugs.md b/docs/downloader_gugs.md new file mode 100644 index 00000000..cd90d535 --- /dev/null +++ b/docs/downloader_gugs.md @@ -0,0 +1,45 @@ +--- +title: Gallery URL Generators +--- + +Gallery URL Generators, or **GUGs** are simple objects that take a simple string from the user, like: + +* blue_eyes +* blue\_eyes blonde\_hair +* InCase +* elsa dandon_fuga +* wlop +* goth* order:id_asc + +And convert them into an initialising Gallery URL, such as: + +* [http://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](http://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0) +* [https://konachan.com/post?page=1&tags=blonde\_hair+blue\_eyes](https://konachan.com/post?page=1&tags=blonde_hair+blue_eyes) +* [https://www.hentai-foundry.com/pictures/user/InCase/page/1](https://www.hentai-foundry.com/pictures/user/InCase/page/1) +* [http://rule34.paheal.net/post/list/elsa dandon_fuga/1](http://rule34.paheal.net/post/list/elsa dandon_fuga/1) +* [https://www.deviantart.com/wlop/favourites/?offset=0](https://www.deviantart.com/wlop/favourites/?offset=0) +* [https://danbooru.donmai.us/posts?page=1&tags=goth*+order:id_asc](https://danbooru.donmai.us/posts?page=1&tags=goth*+order:id_asc) + +These are all the 'first page' of the results if you type or click-through to the same location on those sites. We are essentially emulating their own simple search-url generation inside the hydrus client. + +## actually doing it { id="doing_it" } + +Although it is usually a fairly simple process of just substituting the inputted tags into a string template, there are a couple of extra things to think about. Let's look at the ui under _network->downloader definitions->manage gugs_: + +![](images/downloader_edit_gug_panel.png) + +The client will split whatever the user enters by whitespace, so `blue_eyes blonde_hair` becomes two search terms, `[ 'blue_eyes', 'blonde_hair' ]`, which are then joined back together with the given 'search terms separator', to make `blue_eyes+blonde_hair`. Different sites use different separators, although ' ', '+', and ',' are most common. The new string is substituted into the `%tags%` in the template phrase, and the URL is made. + +Note that you will not have to make %20 or %3A percent-encodings for reserved characters here--the network engine handles all that before the request is sent. For the most part, if you need to include or a user puts in ':' or ' ' or 'おっぱい', you can just pass it along straight into the final URL without worrying. + +This ui should update as you change it, so have a play and look at how the output example url changes to get a feel for things. Look at the other defaults to see different examples. Even if you break something, you can just cancel out. + +The name of the GUG is important, as this is what will be listed when the user chooses what 'downloader' they want to use. Make sure it has a clear unambiguous name. + +The initial search text is also important. Most downloaders just take some text tags, but if your GUG expects a numerical artist id (like pixiv artist search does), you should specify that explicitly to the user. You can even put in a brief '(two tag maximum)' type of instruction if you like. + +Notice that the Deviart Art example above is actually the stream of wlop's _favourites_, not his works, and without an explicit notice of that, a user could easily mistake what they have selected. 'gelbooru' or 'newgrounds' are bad names, 'type here' is a bad initialising text. + +## Nested GUGs { id="nested_gugs" } + +Nested Gallery URL Generators are GUGs that hold other GUGs. Some searches actually use more than one stream (such as a Hentai Foundry artist lookup, where you might want to get both their regular works and their scraps, which are two separate galleries under the site), so NGUGs allow you to generate multiple initialising URLs per input. You can experiment with this ui if you like--it isn't too complicated--but you might want to hold off doing anything for real until you are comfortable with everything and know how producing multiple initialising URLs is going to work in the actual downloader. diff --git a/docs/downloader_intro.md b/docs/downloader_intro.md new file mode 100644 index 00000000..b893720f --- /dev/null +++ b/docs/downloader_intro.md @@ -0,0 +1,50 @@ +--- +title: Making a Downloader +--- + +# Making a Downloader + +!!! caution + Creating custom downloaders is only for advanced users who understand HTML or JSON. Beware! If you are simply looking for how to add new downloaders, please head over [here](adding_new_downloaders.html). + +## this system { id="intro" } + +The first versions of hydrus's downloaders were all hardcoded and static--I wrote everything into the program itself and nothing was user-creatable or -fixable. After the maintenance burden of the entire messy system proved too large for me to keep up with and a semi-editable booru system proved successful, I decided to overhaul the entire thing to allow user creation and sharing of every component. It is designed to be very simple to the front-end user--they will typically handle a couple of png files and then select a new downloader from a list--but very flexible (and hence potentially complicated) on the back-end. These help pages describe the different compontents with the intention of making an HTML- or JSON- fluent user able to create and share a full new downloader on their own. + +As always, this is all under active development. Your feedback on the system would be appreciated, and if something is confusing or you discover something in here that is out of date, please [let me know](contact.html). + +## what is a downloader? { id="downloader" } + +In hydrus, a downloader is one of: + +**Gallery Downloader** +: This takes a string like 'blue_eyes' to produce a series of thumbnail gallery page URLs that can be parsed for image page URLs which can ultimately be parsed for file URLs and metadata like tags. Boorus fall into this category. + +**URL Downloader** +: This does just the Gallery Downloader's back-end--instead of taking a string query, it takes the gallery or post URLs directly from the user, whether that is one from a drag-and-drop event or hundreds pasted from clipboard. For our purposes here, the URL Downloader is a subset of the Gallery Downloader. + +**Watcher** +: This takes a URL that it will check in timed intervals, parsing it for new URLs that it then queues up to be downloaded. It typically stops checking after the 'file velocity' (such as '1 new file per day') drops below a certain level. It is mostly for watching imageboard threads. + +**Simple Downloader** +: This takes a URL one-time and parses it for direct file URLs. This is a miscellaneous system for certain simple gallery types and some testing/'I just need the third tag's _src_ on this one page' jobs. + +The system currently supports HTML and JSON parsing. XML should be fine under the HTML parser--it isn't strict about checking types and all that. + +## what does a downloader do? { id="pipeline" } + +The Gallery Downloader is the most complicated downloader and uses all the possible components. In order for hydrus to convert our example 'blue_eyes' query into a bunch of files with tags, it needs to: + +* Present some user interface named 'safebooru tag search' to the user that will convert their input of 'blue_eyes' into [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0). +* Recognise [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=0) as a Safebooru Gallery URL. +* Convert the HTML of a Safebooru Gallery URL into a list URLs like [https://safebooru.org/index.php?page=post&s=view&id=2437965](https://safebooru.org/index.php?page=post&s=view&id=2437965) and possibly a 'next page' URL (e.g. [https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=40](https://safebooru.org/index.php?page=post&s=list&tags=blue_eyes&pid=40)) that points to the next page of thumbnails. +* Recognise the [https://safebooru.org/index.php?page=post&s=view&id=2437965](https://safebooru.org/index.php?page=post&s=view&id=2437965) URLs as Safebooru Post URLs. +* Convert the HTML of a Safebooru Post URL into a file URL like [https://safebooru.org//images/2329/b6e8c263d691d1c39a2eeba5e00709849d8f864d.jpg](https://safebooru.org//images/2329/b6e8c263d691d1c39a2eeba5e00709849d8f864d.jpg) and some tags like: 1girl, bangs, black gloves, blonde hair, blue eyes, braid, closed mouth, day, fingerless gloves, fingernails, gloves, grass, hair ornament, hairclip, hands clasped, creator:hankuri, interlocked fingers, long hair, long sleeves, outdoors, own hands together, parted bangs, pointy ears, character:princess zelda, smile, solo, series:the legend of zelda, underbust. + +So we have three components: + +* [**Gallery URL Generator (GUG):**](downloader_gugs.html) faces the user and converts text input into initialising Gallery URLs. +* [**URL Class:**](downloader_url_classes.html) identifies URLs and informs the client how to deal with them. +* [**Parser:**](downloader_parsers.html) converts data from URLs into hydrus-understandable metadata. + +URL downloaders and watchers do not need the Gallery URL Generator, as their input _is_ an URL. And simple downloaders also have an explicit 'just download it and parse it with this simple rule' action, so they do not use URL Classes (or even full-fledged Page Parsers) either. \ No newline at end of file diff --git a/docs/downloader_login.md b/docs/downloader_login.md new file mode 100644 index 00000000..69497a8e --- /dev/null +++ b/docs/downloader_login.md @@ -0,0 +1,3 @@ +# Login Manager + +The system works, but this help was never done! Check the defaults for examples of how it works, sorry! \ No newline at end of file diff --git a/docs/downloader_parsers.md b/docs/downloader_parsers.md new file mode 100644 index 00000000..f3a986db --- /dev/null +++ b/docs/downloader_parsers.md @@ -0,0 +1,30 @@ +# Parsers + +In hydrus, a parser is an object that takes a single block of HTML or JSON data and returns many kinds of hydrus-level metadata. + +Parsers are flexible and potentially quite complicated. You might like to open _network->manage parsers_ and explore the UI as you read these pages. Check out how the default parsers already in the client work, and if you want to write a new one, see if there is something already in there that is similar--it is usually easier to duplicate an existing parser and then alter it than to create a new one from scratch every time. + +There are three main components in the parsing system (click to open each component's help page): + +* [**Formulae:**](downloader_parsers_formulae.md) Take parsable data, search it in some manner, and return 0 to n strings. +* [**Content Parsers:**](downloader_parsers_content_parsers.md) Take parsable data, apply a formula to it to get some strings, and apply a single metadata 'type' and perhaps some additional modifiers. +* [**Page Parsers:**](downloader_parsers_page_parsers.md) Take parsable data, apply content parsers to it, and return all the metadata in an appropriate structure. + +Once you are comfortable with these objects, you might like to check out these walkthroughs, which create full parsers from nothing: + +* [e621 HTML gallery page](downloader_parsers_full_example_gallery_page.md) +* [Gelbooru HTML file page](downloader_parsers_full_example_file_page.md) +* [Artstation JSON file page API](downloader_parsers_full_example_api.md) + +Once you are comfortable with parsers, and if you are feeling brave, check out how the default imageboard and pixiv parsers work. These are complicated and use more experimental areas of the code to get their job done. If you are trying to get a new imageboard parser going and can't figure out subsidiary page parsers, send me a mail or something and I'll try to help you out! + +When you are making a parser, consider this checklist (you might want to copy/have your own version of this somewhere): + +* Do you get good URLs with good priority? Do you ever accidentally get favourite/popular/advert results you didn't mean to? +* If you need a next gallery page URL, is it ever not available (and hence needs a URL Class fix)? Does it change for search tags with unicode or http-restricted characters? +* Do you get nice namespaced tags? Are any unwanted single characters like -/+/? getting through? +* Is the file hash available anywhere? +* Is a source/post time available? +* Is a source URL available? Is it good quality, or does it often just point to an artist's base twitter profile? If you pull it from text or a tooltip, is it clipped for longer URLs? + +[Taken a break? Now let's put it all together ---->](downloader_completion.md) \ No newline at end of file diff --git a/docs/downloader_parsers_content_parsers.md b/docs/downloader_parsers_content_parsers.md new file mode 100644 index 00000000..1a4de4f5 --- /dev/null +++ b/docs/downloader_parsers_content_parsers.md @@ -0,0 +1,70 @@ +# Content Parsers + +So, we can now generate some strings from a document. Content Parsers will let us apply a single metadata type to those strings to inform hydrus what they are. + +![](images/edit_content_parser_panel_tags.png) + +A content parser has a name, a content type, and a formula. This example fetches the character tags from a danbooru post. + +The name is just decorative, but it is generally a good idea so you can find things again when you next revisit them. + +The current content types are: + +## urls { id="intro" } + +This should be applied to relative ('/image/smile.jpg') and absolute ('https://mysite.com/content/image/smile.jpg') URLs. If the URL is relative, the client will generate an absolute URL based on the original URL used to fetch the data being parsed (i.e. it should all just work). + +You can set several types of URL: + +* **url to download/pursue** means a Post URL or a File URL in our URL Classes system, like a booru post or an actual raw file like a jpg or webm. +* **url to associate** means an URL you want added to the list of 'known urls' for the file, but not one you want to client to actually download and parse. Use this to neatly add booru 'source' urls. +* **next gallery page** means the next Gallery URL on from the current one. + +The 'file url quality precedence' allows the client to select the best of several possible URLs. Given multiple content parsers producing URLs at the same 'level' of parsing, it will select the one with the highest value. Consider these two posts: + +* [https://danbooru.donmai.us/posts/3016415](https://danbooru.donmai.us/posts/3016415) +* [https://danbooru.donmai.us/posts/3040603](https://danbooru.donmai.us/posts/3040603) + +The Garnet image fits into a regular page and so Danbooru embed the whole original file in the main media canvas. One easy way to find the full File URL in this case would be to select the "src" attribute of the "img" tag with id="image". + +The Cirno one, however, is much larger and has been scaled down. The src of the main canvas tag points to a resized 'sample' link. The full link can be found at the 'view original' link up top, which is an "a" tag with id="image-resize-link". + +The Garnet post does not have the 'view original' link, so to cover both situations we might want two content parsers--one fetching the 'canvas' "src" and the other finding the 'view original' "href". If we set the 'canvas' one with a quality of 40 and the 'view original' 60, then the parsing system would know to select the 60 when it was available but to fall back to the 40 if not. + +As it happens, Danbooru (afaik, always) gives a link to the original file under the 'Size:' metadata to the left. This is the same 'best link' for both posts above, but it isn't so easy to identify. It is a quiet "a" tag without an "id" and it isn't always in the same location, but if you could pin it down reliably, it might be nice to circumvent the whole issue. + +Sites can change suddenly, so it is nice to have a bit of redundancy here if it is easy. + +## tags { id="tags" } + +These are simple--they tell the client that the given strings are tags. You set the namespace here as well. I recommend you parse 'splashbrush' and set the namespace 'creator' here rather than trying to mess around with 'append prefix "creator:"' string conversions at the formula level--it is simpler up here and it lets hydrus handle any edge case logic for you. + +Leave the namespace field blank for unnamespaced tags. + +## file hash { id="file_hash" } + +This says 'this is the hash for the file otherwise referenced in this parser'. So, if you have another content parser finding a File or Post URL, this lets the client know early that that destination happens to have a particular MD5, for instance. The client will look for that hash in its own database, and if it finds a match, it can predetermine if it already has the file (or has previously deleted it) without ever having to download it. When this happens, it will still add tags and associate the file with the URL for it's 'known urls' just as if it _had_ downloaded it! + +If you understand this concept, it is great to include. It saves time and bandwidth for everyone. Many site APIs include a hash for this exact reason--they want you to be able to skip a needless download just as much as you do. + +![](images/edit_content_parser_panel_hash.png) + +The usual suite of hash types are supported: MD5, SHA1, SHA256, and SHA512. An old version of this required some weird string decoding, but this is no longer true. Select 'hex' or 'base64' from the encoding type dropdown, and then just parse the 'e5af57a687f089894f5ecede50049458' or '5a9XpofwiYlPXs7eUASUWA==' text, and hydrus should handle the rest. It will present the parsed hash in hex. + +## timestamp { id="timestamp" } + +This lets you say that a given number refers to a particular time for a file. At the moment, I only support 'source time', which represents a 'post' time for the file and is useful for thread and subscription check time calculations. It takes a Unix time integer, like 1520203484, which many APIs will provide. + +If you are feeling very clever, you can decode a 'MM/DD/YYYY hh:mm:ss' style string to a Unix time integer using string converters, which use some hacky and semi-reliable python %d-style values as per [here](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior). Look at the existing defaults for examples of this, and don't worry about being more accurate than 12/24 hours--trying to figure out timezone is a hell not worth attempting, and doesn't really matter in the long-run for subscriptions and thread watchers that might care. + +## watcher page title { id="page_title" } + +This lets the watcher know a good name/subject for its entries. The subject of a thread is obviously ideal here, but failing that you can try to fetch the first part of the first post's comment. It has precendence, like for URLs, so you can tell the parser which to prefer if you have multiple options. Just for neatness and ease of testing, you probably want to use a string converter here to cut it down to the first 64 characters or so. + +## veto { id="veto" } + +This is a special content type--it tells the next highest stage of parsing that this 'post' of parsing is invalid and to cancel and not return any data. For instance, if a thread post's file was deleted, the site might provide a default '404' stock File URL using the same markup structure as it would for normal images. You don't want to give the user the same 404 image ten times over (with fifteen kinds of tag and source time metadata attached), so you can add a little rule here that says "If the image link is 'https://somesite.com/404.png', raise a veto: File 404" or "If the page has 'No results found' in its main content div, raise a veto: No results found" or "If the expected download tag does not have 'download link' as its text, raise a veto: No Download Link found--possibly Ugoira?" and so on. + +![](images/edit_content_parser_panel_veto.png) + +They will associate their name with the veto being raised, so it is useful to give these a decent descriptive name so you can see what might be going right or wrong during testing. If it is an appropriate and serious enough veto, it may also rise up to the user level and will be useful if they need to report you an error (like "After five pages of parsing, it gives 'veto: no next page link'"). \ No newline at end of file diff --git a/docs/downloader_parsers_formulae.md b/docs/downloader_parsers_formulae.md new file mode 100644 index 00000000..eb8cf9eb --- /dev/null +++ b/docs/downloader_parsers_formulae.md @@ -0,0 +1,160 @@ +--- +title: Parser Formulae +--- + +# Parser Formulae { id="formulae" } + +Formulae are tools used by higher-level components of the parsing system. They take some data (typically some HTML or JSON) and return 0 to n strings. For our purposes, these strings will usually be tags, URLs, and timestamps. You will usually see them summarised with this panel: + +![](images/edit_formula_panel.png) + +The different types are currently [html](#html_formula), [json](#json_formula), [compound](#compound_formula), and [context variable](#context_variable_formula). + +## html { id="html_formula" } + +This takes a full HTML document or a sample of HTML--and any regular sort of XML _should_ also work. It starts at the root node and searches for lower nodes using one or more ordered rules based on tag name and attributes, and then returns string data from those final nodes. + +For instance, if you have this: + +```html + +
+ +