changelog
-
+
version 297
+ - finished a prototype 'file notes' system. thumbnails and media viewer canvas now support 'manage->file notes' in their right-click menus. this launches a simple text box which will save its contents to db +
- added 'manage_file_notes' shortcut to the 'media' shortcut set +
- tag summary generators now have a simple show/hide checkbox and (for thumbnails) custom colours for background and text including alpha channel! +
- fixed a variety of timing and display logic related to subscription query DEAD vs next check time calculation +
- all currently dead subscription queries will be revived on update, just in case they were formerly set dead by accident +
- the 'fetch tags even if url known and file already in db' option is moved from the download/subscription panel's cog icon to tag import options +
- cleaned up tag import options layout, controls, internal workflow, and help button +
- added 'select all/none' buttons to tag import options panels with multiple namespaces +
- if a subscription is blocked by bandwidth, the manage subscriptions dialog will display that in its 'recent error/delay' column +
- the edit subscription dialog will show similar bandwidth blocking info on a per-query basis, under a new 'recent delays' column +
- the review bandwidth usage panel will no longer show some unusual results by default that you can see with 'show all' hit anyway +
- the review bandwidth usage panel will show the usage at the current search distance in a new column +
- the review bandiwdth usage panel will show number of requests after data usage. this might be info-overload, so I might alter the syntax or roll it back entirely +
- fixed an issue with hentai foundry parser pulling images placed in the image description area instead of main image. this particularly affected the artist 'teku' +
- tags for deviant art and tumblr and thread watchers, which were formerly stored in volatile session memory--meaning half-completed import queues were losing their tags through a program restart--are now saved to the new import object directly +
- removed all the old volatile session memory patch code +
- added the new import object through a larger part of the parsing pipeline +
- deleted the old remains of the giphy parser--if it comes back, it'll all be rewritten in the new system +
- harmonised some other import pipeline code to the new system +
- added a new 'management and preview panels' submenu to the 'pages' menu +
- added an option to control 'save sash positions on close' to this menu +
- added an entry to force-save the current sash positions to this menu +
- added an entry to 'restore' the currently saved sash positions to all pages to this menu (this is useful if your window resizes real small and all your pages get crushed up) +
- rejiggered how URL Classes are matched with URLs to make sure some Post URLs are not lost (this was affecting Hentai Foundry Post URLs, which were sometimes not displaying in the media viewer despite matching) +
- fixed an issue where the duplicate filter page's jobs would not trigger an update after a job finished +
- fixed an outside chance of a crash after running a duplicate filter page job +
- improved how strings are coerced to unicode--now the preferred system encoding will be tried before utf-16, which should improve support for é-type characters in various non-unicode sources (like neighbouring .txt files) +
- fixed an issue with the client's local booru and flash files (and some other file fetching and mime reporting is a bit faster and neater overall) +
- the options should be more reliable about redrawing all thumbnail banner summaries on an option ok now +
- the options->media->media zooms option will now remove any <=0.0 values when it saves +
- fixed up some old test code +
- improved how some thread-to-gui update reporting code works +
- deleted some old network object code +
- converted manage subscriptions panel to an edit panel--a decoupling refactor I will likely ultimately make across the program +
- wrote a help page for content parsers +
- did the first half of a help page for page parsers +
- misc refactoring +
- misc cleanup +
version 296
- ?the 'allow decompression bombs' option is now moved to 'file import options'. it defaults to False +
- the 'allow decompression bombs' option is now moved to 'file import options'. it defaults to False
- file import options now allow max size and max resolution rules. they default to None
- file import options now allows a max gif size rule to deal with THE SFM COMMUNITY. it defaults to 32MB
- file imports will give better quality errors if they fail due to file import option exclusion rules diff --git a/help/downloader_parsers.html b/help/downloader_parsers.html index b7418ff5..955731f3 100644 --- a/help/downloader_parsers.html +++ b/help/downloader_parsers.html @@ -15,7 +15,7 @@
- Formulae: Take parsable data, search it in some manner, and return 0 to n strings.
- Content Parsers: Take parsable data, apply a formula to it to get some strings, and apply a single metadata 'type' and perhaps some additional modifiers. -
- Page Parsers: Take parsable data, apply content parsers to it, and return all the metadata. +
- Page Parsers: Take parsable data, apply content parsers to it, and return all the metadata in an appropriate structure.
-
+
urls
+This should be applied to relative ('/image/smile.jpg') and absolute ('https://mysite.com/content/image/smile.jpg') URLs. If the URL is relative, the client will attempt to generate an absolute URL based on the original URL used to fetch the current data being parsed.
+You can set several types of URL:
+-
+
- actual file means a File URL in our URL Classes system. An actual raw file like a jpg or webm. The client will typically be downloading and attempting to import these URLs, so make sure you are not accidentally linking to an html wrapper page. +
- post page means a Post URL. You will typically find these URLs as linked from thumbnails on a gallery page. +
- next gallery page means the next Gallery URL on from the current one. This will aid the downloader engine in finding a next page if that is otherwise difficult to guess (some sites have a nice page=1, page=2, page=3 system that we can predict elsewhere in the system, but others are not so simple). +
The 'quality precedence' allows the client to select the best of several possible URLs. Given multiple content parsers producing URLs at the same 'level' of parsing, it will select the one with the highest value. Consider these two posts:
+ +The Garnet image fits into a regular page and so Danbooru embed the whole original file in the main media canvas. One easy way to find the full File URL in this case would be to select the "src" attribute of the "img" tag with id="image".
+The Cirno one, however, is much larger and has been scaled down. The src of the main canvas tag points to a resized 'sample' link. The full link can be found at the 'view original' link up top, which is an "a" tag with id="image-resize-link".
+The Garnet post does not have the 'view original' link, so to cover both situations we might want two content parsers--one fetching the 'canvas' "src" and the other finding the 'view original' "href". If we set the canvas one with a quality of 40 and the view original 60, then the parsing system would know to select the 60 when it was available but to fall back to the 40 if not.
+As it happens, Danbooru (afaik, always) gives a link to the original file under the 'Size:' metadata to the left. This is the same 'best link' for both posts above, but it isn't so easy to identify. It is a quiet "a" tag without an "id" and it isn't always in the same location, but if you could pin it down reliably, it might be nice to circumvent the whole issue.
+Sites can change suddenly, so it is nice to have a bit of redundancy here if it is easy.
+
+ -
+
tags
+These are simple--they tell the client that the given strings are tags. You set the namespace here as well. I recommend you parse 'splashbrush' and set the namespace 'creator' here rather than trying to mess around with 'append prefix "creator:"' string conversions at the formula level--it is simpler up here and it lets hydrus handle any edge case logic for you.
+Leave the namespace field blank for unnamespaced tags.
+
+ -
+
file hash
+This says 'this is the hash for the file otherwise referenced in this parser'. So, if you have another content parser finding a File or Post URL, this lets the client know early that that destination happens to have a particular MD5, for instance. The client will look for that hash in its own database, and if it finds a match, it can predetermine if it already has the file (or has previously deleted it) without ever having to download it. Furthermore, if it does find the file for this URL but has never seen the URL before, it will still associate it with that file's 'known urls' as if it had downloaded it!
+If you understand this concept, it is great to include. It saves time and bandwidth for everyone. Many site APIs include a hash for this exact reason--they want you to be able to skip a needless download just as much as you do.
+
+The usual suite of hash types are supported: MD5, SHA1, SHA256, and SHA512. This expects the hash as raw bytes, so if your source provides it as hex or base64 (as above), make sure to decode it! In the area for test results, it will present the hash in hex for your convenience.
+
+ -
+
timestamp
+This lets you say that a given number refers to a particular time for a file. At the moment, I only support 'source time', which represents a 'post' time for the file and is useful for thread and subscription check time calculations. It takes a Unix time integer, like 1520203484, which many APIs will provide. If you are feeling very clever, you can decode a 'MM/DD/YYYY hh:mm:ss' style string to a Unix time integer using string converters, but I may need to put more time into that UI to make it more user friendly!
+
+ -
+
thread watcher page title
+This lets the thread watcher know a good name for its page tab. The subject of a thread is obviously ideal here, but failing that you can try to fetch the first part of the first post's comment. It has precendence, like for URLs, so you can tell the parser which to prefer if you have multiple options. Just for neatness and ease of testing, you probably want to use a string converter here to cut it down to the first 64 characters or so.
+
+ -
+
veto
+This is a special content type--it tells the next highest stage of parsing that this 'post' of parsing is invalid and to cancel and not return any data. For instance, if a thread post's file was deleted, the site might provide a default '404' stock File URL using the same markup structure as it would for normal images. You don't want to give the user the same 404 image ten times over (with fifteen kinds of tag and source time metadata attached), so you can add a little rule here that says "If the image link is 'https://somesite.com/404.png', raise a veto: File 404" or "If the page has 'No results found' in its main content div, raise a veto: No results found" or "If the expected download tag does not have 'download link' as its text, raise a veto: No Download Link found--possibly Ugoira?" and so on.
+
+They will associate their name with the veto being raised, so it is useful to give these a decent descriptive name so you can see what might be going right or wrong during testing. If it is an appropriate and serious enough veto, it may also rise up to the user level and will be useful if they need to report you an error (like "After five pages of parsing, it gives 'veto: no next page link'").
+
+
-
+
-
-
Once you are comfortable with these objects, you might like to check out these walkthroughs, which create full parsers from nothing:
-
diff --git a/help/downloader_parsers_content_parsers.html b/help/downloader_parsers_content_parsers.html
index 610e3024..c36abc82 100644
--- a/help/downloader_parsers_content_parsers.html
+++ b/help/downloader_parsers_content_parsers.html
@@ -8,9 +8,59 @@
<---- Back to main parsers page
content parsers
-different types and what they mean
-hash needs conversion to bytes
-vetos
+So, we can now generate some strings from a document. Content Parsers will let us apply a single metadata type to those strings to inform hydrus what they are.
+A content parser has a name, a content type, and a formula. This example fetches the character tags from a danbooru post.
+The name is just decorative, but it is generally a good idea so you can find things again when you next revisit them.
+The current content types are:
+-
+