hydrus/help/downloader_parsers_full_exa...

<html>
	<head>
		<title>downloader - parsers - gallery page example</title>
		<link href="hydrus.ico" rel="shortcut icon" />
		<link href="style.css" rel="stylesheet" type="text/css" />
	</head>
	<body>
		<div class="content">
			<p><a href="downloader_parsers.html"><---- Back to main parsers page</a></p>
			<p class="warning">These guides should <i>roughly</i> follow what comes with the client by default! You might like to have the actual UI open in front of you so you can play around with the rules and try different test parses yourself.</p>
			<h3 id="intro"><a href="#intro">gallery pages</a></h3>
			<p>Let's look at this page: <a href="https://e621.net/post/index/1/rating:safe pokemon">https://e621.net/post/index/1/rating:safe pokemon</a></p>
			<p>We've got 75 thumbnails and a bunch of page URLs at the bottom.</p>
			<h3 id="main_page"><a href="#main_page">first, the main page</a></h3>
			<p>This is easy. It gets a good name and some example URLs. e621 has some different ways of writing out their queries (and as they use some tags with '/', like 'male/female', this can cause character encoding issues depending on whether the tag is in the path or query!), but we'll put that off for now--we just want to parse some stuff.</p>
			<p><img src="downloader_gallery_example_main.png" /></p>
			<h3 id="thumbnail_urls"><a href="#thumbnail_urls">thumbnail links</a></h3>
			<p>Most browsers have some good developer tools to let you Inspect Element and get a better view of the HTML DOM. Be warned that this information isn't always the same as View Source (which is what hydrus will get when it downloads the initial HTML document), as some sites load results dynamically with javascript and maybe an internal JSON API call (when sites move to systems that load more thumbs as you scroll down, it makes our job more difficult--in these cases, you'll need to chase down the embedded JSON or figure out what API calls their JS is making--the browser's developer tools can help you here again). Thankfully, e621 is (and most boorus are) fairly static and simple:</p>
			<p><img src="downloader_gallery_example_thumb_html.png" /></p>
			<p>Every thumb on e621 is a &lt;span&gt; with class="thumb" wrapping an &lt;a&gt; and an &lt;img&gt;. This is a common pattern, and easy to parse:</p>
			<p><img src="downloader_gallery_example_thumb_parsing.png" /></p>
			<p>There's no tricky String Matches or String Converters needed--we are just fetching hrefs. Note that the links get relative-matched to example.com for now--I'll probably fix this to apply to one of the example URLs, but rest assured that IRL the parser will 'join' its url up with the appropriate Gallery URL used to fetch the data. Sometimes, you might want to add a rule for 'search descendents for the first &lt;div&gt; tag with id=content' to make sure you are only grabbing thumbs from the main box, whether that is a &lt;div&gt; or a &lt;span&gt;, and whether it has id="content" or class="mainBox", but unless you know that booru likes to embed "popular" or "favourite" 'thumbs' up top that will be accidentally caught by a &lt;span&gt;'s with class="thumb", I recommend you not make your rules overly specific--all it takes is for their dev to change the name of their content box, and your whole parser breaks. I've ditched the &lt;span&gt; requirement in the rule here for exactly that reason--class="thumb" is necessary and sufficient.</p>
			<p>Remember that the parsing system allows you to go up ancestors as well as down descendants. If your thumb-box has multiple links--like to see the artist's profile or 'set as favourite'--you can try searching for the &lt;span&gt;s, then down to the &lt;img&gt;, and then <i>up</i> to the nearest &lt;a&gt;. In English, this is saying, "Find me all the image link URLs in the thumb boxes."</p>
			<h3 id="next_gallery_url"><a href="#next_gallery_url">next gallery page link</a></h3>
			<p>Most boorus have 'next' or '>>' at the bottom, which can be simple enough, but many have a neat &lt;link href="/post/index/2/rating:safe%20pokemon" rel="next" /&gt; in the &lt;head&gt;. The &lt;head&gt; solution is easier, if available, but my default e621 parser happens to pursue the 'paginator':</p>
			<p><img src="downloader_gallery_example_paginator_parsing.png" /></p>
			<p>As it happens, e621 also apply the rel="next" attribute to their "Next >>" links, which makes it all that easier for us to find. Sometimes there is no "next" id or class, and you'll want to add a String Match to your html formula to test for a string value of '>>' or whatever it is. A good trick is to View Source and then search for the critical "/post/index/2/" phrase you are looking for--you might find what you want in a &lt;link&gt; tag you didn't expect or even buried in a hidden 'share to tumblr' button. &lt;form&gt;s for reporting or commenting on content are another good place to find content ids.</p>
			<p>Note that this finds two URLs. e621 apply the rel="next" to both the "2" link and the "Next >>" one. The download engine merges the parser's dupes, so don't worry if you end up parsing both the 'top' and 'bottom' next page links, or if you use multiple rules to parse the same data in different ways.</p>
			<h3 id="summary"><a href="#summary">summary</a></h3>
			<p>With those two rules, we are done. Gallery parsers are nice and simple.</p>
		</div>
	</body>
</html>
Version 296 2018-02-28 22:30:36 +00:00			`<html>`
			`<head>`
Version 322 2018-09-12 21:36:26 +00:00			`<title>downloader - parsers - gallery page example</title>`
Version 296 2018-02-28 22:30:36 +00:00			`<link href="hydrus.ico" rel="shortcut icon" />`
			`<link href="style.css" rel="stylesheet" type="text/css" />`
			`</head>`
			`<body>`
			`<div class="content">`
			`<p><a href="downloader_parsers.html"><---- Back to main parsers page</a></p>`
Version 322 2018-09-12 21:36:26 +00:00			`<p class="warning">These guides should <i>roughly</i> follow what comes with the client by default! You might like to have the actual UI open in front of you so you can play around with the rules and try different test parses yourself.</p>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`<h3 id="intro"><a href="#intro">gallery pages</a></h3>`
Version 322 2018-09-12 21:36:26 +00:00			`<p>Let's look at this page: <a href="https://e621.net/post/index/1/rating:safe pokemon">https://e621.net/post/index/1/rating:safe pokemon</a></p>`
			`<p>We've got 75 thumbnails and a bunch of page URLs at the bottom.</p>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`<h3 id="main_page"><a href="#main_page">first, the main page</a></h3>`
Version 322 2018-09-12 21:36:26 +00:00			`<p>This is easy. It gets a good name and some example URLs. e621 has some different ways of writing out their queries (and as they use some tags with '/', like 'male/female', this can cause character encoding issues depending on whether the tag is in the path or query!), but we'll put that off for now--we just want to parse some stuff.</p>`
			`<p><img src="downloader_gallery_example_main.png" /></p>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`<h3 id="thumbnail_urls"><a href="#thumbnail_urls">thumbnail links</a></h3>`
Version 322 2018-09-12 21:36:26 +00:00			<p>Most browsers have some good developer tools to let you Inspect Element and get a better view of the HTML DOM. Be warned that this information isn't always the same as View Source (which is what hydrus will get when it downloads the initial HTML document), as some sites load results dynamically with javascript and maybe an internal JSON API call (when sites move to systems that load more thumbs as you scroll down, it makes our job more difficult--in these cases, you'll need to chase down the embedded JSON or figure out what API calls their JS is making--the browser's developer tools can help you here again). Thankfully, e621 is (and most boorus are) fairly static and simple:</p>
			`<p><img src="downloader_gallery_example_thumb_html.png" /></p>`
			`<p>Every thumb on e621 is a <span> with class="thumb" wrapping an <a> and an <img>. This is a common pattern, and easy to parse:</p>`
			`<p><img src="downloader_gallery_example_thumb_parsing.png" /></p>`
			<p>There's no tricky String Matches or String Converters needed--we are just fetching hrefs. Note that the links get relative-matched to example.com for now--I'll probably fix this to apply to one of the example URLs, but rest assured that IRL the parser will 'join' its url up with the appropriate Gallery URL used to fetch the data. Sometimes, you might want to add a rule for 'search descendents for the first <div> tag with id=content' to make sure you are only grabbing thumbs from the main box, whether that is a <div> or a <span>, and whether it has id="content" or class="mainBox", but unless you know that booru likes to embed "popular" or "favourite" 'thumbs' up top that will be accidentally caught by a <span>'s with class="thumb", I recommend you not make your rules overly specific--all it takes is for their dev to change the name of their content box, and your whole parser breaks. I've ditched the <span> requirement in the rule here for exactly that reason--class="thumb" is necessary and sufficient.</p>
			`<p>Remember that the parsing system allows you to go up ancestors as well as down descendants. If your thumb-box has multiple links--like to see the artist's profile or 'set as favourite'--you can try searching for the <span>s, then down to the <img>, and then <i>up</i> to the nearest <a>. In English, this is saying, "Find me all the image link URLs in the thumb boxes."</p>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`<h3 id="next_gallery_url"><a href="#next_gallery_url">next gallery page link</a></h3>`
Version 322 2018-09-12 21:36:26 +00:00			`<p>Most boorus have 'next' or '>>' at the bottom, which can be simple enough, but many have a neat <link href="/post/index/2/rating:safe%20pokemon" rel="next" /> in the <head>. The <head> solution is easier, if available, but my default e621 parser happens to pursue the 'paginator':</p>`
			`<p><img src="downloader_gallery_example_paginator_parsing.png" /></p>`
			<p>As it happens, e621 also apply the rel="next" attribute to their "Next >>" links, which makes it all that easier for us to find. Sometimes there is no "next" id or class, and you'll want to add a String Match to your html formula to test for a string value of '>>' or whatever it is. A good trick is to View Source and then search for the critical "/post/index/2/" phrase you are looking for--you might find what you want in a <link> tag you didn't expect or even buried in a hidden 'share to tumblr' button. <form>s for reporting or commenting on content are another good place to find content ids.</p>
			`<p>Note that this finds two URLs. e621 apply the rel="next" to both the "2" link and the "Next >>" one. The download engine merges the parser's dupes, so don't worry if you end up parsing both the 'top' and 'bottom' next page links, or if you use multiple rules to parse the same data in different ways.</p>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`<h3 id="summary"><a href="#summary">summary</a></h3>`
Version 322 2018-09-12 21:36:26 +00:00			`<p>With those two rules, we are done. Gallery parsers are nice and simple.</p>`
Version 296 2018-02-28 22:30:36 +00:00			`</div>`
			`</body>`
Version 427 closes #788 2021-01-27 22:14:03 +00:00			`</html>`