From ceb888958eb902df387b7cc53484826fd4273cc5 Mon Sep 17 00:00:00 2001 From: peolic <66393006+peolic@users.noreply.github.com> Date: Tue, 10 Nov 2020 04:03:44 +0200 Subject: [PATCH] Update scraping docs (#929) * update editorconfig to ignore trailing whitespaces in markdown * fix incorrect code example * add missing genders * add gallery to scraping docs * reorder Scraping.md --- ui/v2.5/.editorconfig | 3 + ui/v2.5/src/docs/en/Scraping.md | 122 +++++++++++++++++++------------- 2 files changed, 74 insertions(+), 51 deletions(-) diff --git a/ui/v2.5/.editorconfig b/ui/v2.5/.editorconfig index 86a63dc0f..8c52ff937 100644 --- a/ui/v2.5/.editorconfig +++ b/ui/v2.5/.editorconfig @@ -7,3 +7,6 @@ indent_style = space indent_size = 2 insert_final_newline = true trim_trailing_whitespace = true + +[*.md] +trim_trailing_whitespace = false diff --git a/ui/v2.5/src/docs/en/Scraping.md b/ui/v2.5/src/docs/en/Scraping.md index dbff97f3e..10207f8a6 100644 --- a/ui/v2.5/src/docs/en/Scraping.md +++ b/ui/v2.5/src/docs/en/Scraping.md @@ -46,6 +46,10 @@ sceneByURL: movieByURL: +galleryByFragment: + +galleryByURL: + ``` @@ -62,6 +66,8 @@ The scraping types and their required fields are outlined in the following table | Scraper in `Scrape...` dropdown button in Scene Edit page | Valid `sceneByFragment` configuration. | | Scrape scene from URL | Valid `sceneByURL` configuration with matching URL. | | Scrape movie from URL | Valid `movieByURL` configuration with matching URL. | +| Scraper in `Scrape...` dropdown button in Gallery Edit page | Valid `galleryByFragment` configuration. | +| Scrape gallery from URL | Valid `galleryByURL` configuration with matching URL. | URL-based scraping accepts multiple scrape configurations, and each configuration requires a `url` field. stash iterates through these configurations, attempting to match the entered URL against the `url` fields in the configuration. It executes the first scraping configuration where the entered URL contains the value of the `url` field. @@ -93,6 +99,8 @@ The script is sent input and expects output based on the scraping type, as detai | `sceneByFragment` | JSON-encoded scene fragment | JSON-encoded scene fragment | | `sceneByURL` | `{"url": ""}` | JSON-encoded scene fragment | | `movieByURL` | `{"url": ""}` | JSON-encoded movie fragment | +| `galleryByFragment` | JSON-encoded gallery fragment | JSON-encoded gallery fragment | +| `galleryByURL` | `{"url": ""}` | JSON-encoded gallery fragment | For `performerByName`, only `name` is required in the returned performer fragments. One entire object is sent back to `performerByFragment` to scrape a specific performer, so the other fields may be included to assist in scraping a performer. For example, the `url` field may be filled in for the specific performer page, then `performerByFragment` can extract by using its value. @@ -162,7 +170,7 @@ elif sys.argv[1] == "scrapeURL": ### scrapeXPath -This action scrapes a web page using an xpath configuration to parse. This action is valid for `performerByName`, `performerByURL` and `sceneByURL` only. +This action scrapes a web page using an xpath configuration to parse. This action is **not valid** for `performerByFragment`. This action requires that the top-level `xPathScrapers` configuration is populated. The `scraper` field is required and must match the name of a scraper name configured in `xPathScrapers`. For example: @@ -180,7 +188,7 @@ XPath scraping configurations specify the mapping between object fields and an x ### scrapeJson -This action works in the same way as `scrapeXPath`, but uses a mapped json configuration to parse. It uses the top-level `jsonScrapers` configuration. This action is valid for `performerByName`, `performerByURL`, `sceneByFragment` and `sceneByURL`. +This action works in the same way as `scrapeXPath`, but uses a mapped json configuration to parse. It uses the top-level `jsonScrapers` configuration. This action is **not valid** for `performerByFragment`. JSON scraping configurations specify the mapping between object fields and a GJSON selector. The JSON scraper scrapes the applicable URL and uses [GJSON](https://github.com/tidwall/gjson/blob/master/SYNTAX.md) to parse the returned JSON object and populate the object fields. @@ -233,7 +241,27 @@ sceneByFragment: The above configuration would scrape from the value of `queryURL`, replacing `{filename}` with the base filename of the scene, after it has been manipulated by the regex replacements. -### Xpath and JSON scrapers configuration +### Stash + +A different stash server can be configured as a scraping source. This action applies only to `performerByName`, `performerByFragment`, and `sceneByFragment` types. This action requires that the top-level `stashServer` field is configured. + +`stashServer` contains a single `url` field for the remote stash server. The username and password can be embedded in this string using `username:password@host`. + +An example stash scrape configuration is below: + +```yaml +name: stash +performerByName: + action: stash +performerByFragment: + action: stash +sceneByFragment: + action: stash +stashServer: + url: http://stashserver.com:9999 +``` + +## Xpath and JSON scrapers configuration The top-level `xPathScrapers` field contains xpath scraping configurations, freely named. These are referenced in the `scraper` field for `scrapeXPath` scrapers. @@ -241,9 +269,9 @@ Likewise, the top-level `jsonScrapers` field contains json scraping configuratio Collectively, these configurations are known as mapped scraping configurations. -A mapped scraping configuration may contain a `common` field, and must contain `performer` or `scene` depending on the scraping type it is configured for. +A mapped scraping configuration may contain a `common` field, and must contain `performer`, `scene`, `movie` or `gallery` depending on the scraping type it is configured for. -Within the `performer`/`scene` field are key/value pairs corresponding to the golang fields (see below) on the performer/scene object. These fields are case-sensitive. +Within the `performer`/`scene`/`movie`/`gallery` field are key/value pairs corresponding to the golang fields (see below) on the performer/scene object. These fields are case-sensitive. The values of these may be either a simple selector value, which tells the system where to get the value of the field from, or a more advanced configuration (see below). For example, for an xpath configuration: @@ -271,7 +299,7 @@ performer: # post-processing config values ``` -#### Fixed attribute values +### Fixed attribute values Alternatively, an attribute value may be set to a fixed value, rather than scraping it from the webpage. This can be done by replacing `selector` with `fixed`. For example: @@ -281,7 +309,7 @@ performer: fixed: Female ``` -##### Common fragments +### Common fragments The `common` field is used to configure selector fragments that can be referenced in the selector strings. These are key-value pairs where the key is the string to reference the fragment, and the value is the string that the fragment will be replaced with. For example: @@ -294,7 +322,7 @@ performer: The `Measurements` xpath string will replace `$infoPiece` with `//div[@class="infoPiece"]/span`, resulting in: `//div[@class="infoPiece"]/span[text() = 'Measurements:']/../span[@class="smallInfo"]`. -##### Post-processing options +### Post-processing options Post-processing operations are contained in the `postProcess` key. Post-processing operations are performed in the order they are specified. The following post-processing operations are available: * `feetToCm`: converts a string containing feet and inches numbers into centimetres. Looks for up to two separate integers and interprets the first as the number of feet, and the second as the number of inches. The numbers can be separated by any non-numeric character including the `.` character. It does not handle decimal numbers. For example `6.3` and `6ft3.3` would both be interpreted as 6 feet, 3 inches before converting into centimetres. @@ -337,7 +365,23 @@ For backwards compatibility, `replace`, `subscraper` and `parseDate` are also al Post-processing on attribute post-process is done in the following order: `concat`, `replace`, `subscraper`, `parseDate` and then `split`. -##### CDP support +### XPath resources: + +- Test XPaths in Firefox: https://addons.mozilla.org/en-US/firefox/addon/try-xpath/ +- XPath cheatsheet: https://devhints.io/xpath + +### GJSON resources: + +- GJSON Path Syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md + +### Debugging support +To print the received html/json from a scraper request to the log file, add the following to your scraper yml file: +```yaml +debug: + printHTML: true +``` + +### CDP support Some websites deliver content that cannot be scraped using the raw html file alone. These websites use javascript to dynamically load the content. As such, direct xpath scraping will not work on these websites. There is an option to use Chrome DevTools Protocol to load the webpage using an instance of Chrome, then scrape the result. @@ -353,7 +397,7 @@ When `useCDP` is set to true, stash will execute or connect to an instance of Ch `Chrome CDP path` can be set to a path to the chrome executable, or an http(s) address to remote chrome instance (for example: `http://localhost:9222/json/version`). -##### XPath scraper example +### XPath scraper example A performer and scene xpath scraper is shown as an example below: @@ -415,7 +459,7 @@ xPathScrapers: See also [#333](https://github.com/stashapp/stash/pull/333) for more examples. -##### JSON scraper example +### JSON scraper example A performer and scene scraper for ThePornDB is shown below: @@ -498,17 +542,8 @@ jsonScrapers: Name: $data.tags.#.tag ``` -#### XPath resources: - -- Test XPaths in Firefox: https://addons.mozilla.org/en-US/firefox/addon/try-xpath/ -- XPath cheatsheet: https://devhints.io/xpath - -#### GJSON resources: - -- GJSON Path Syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md - -#### Object fields -##### Performer +## Object fields +### Performer ``` Name @@ -530,9 +565,9 @@ Aliases Image ``` -*Note:* - `Gender` must be one of `male`, `female`, `transgender_male`, `transgender_female` (case insensitive). +*Note:* - `Gender` must be one of `male`, `female`, `transgender_male`, `transgender_female`, `intersex`, `non_binary` (case insensitive). -##### Scene +### Scene ``` Title Details @@ -544,18 +579,18 @@ Movies (see Movie Fields) Tags (see Tag fields) Performers (list of Performer fields) ``` -##### Studio +### Studio ``` Name URL ``` -##### Tag +### Tag ``` Name ``` -##### Movie +### Movie ``` Name Aliases @@ -570,29 +605,14 @@ FrontImage BackImage ``` -### Stash - -A different stash server can be configured as a scraping source. This action applies only to `performerByName`, `performerByFragment`, and `sceneByFragment` types. This action requires that the top-level `stashServer` field is configured. - -`stashServer` contains a single `url` field for the remote stash server. The username and password can be embedded in this string using `username:password@host`. - -An example stash scrape configuration is below: - -```yaml -name: stash -performerByName: - action: stash -performerByFragment: - action: stash -sceneByFragment: - - action: stash -stashServer: - url: http://stashserver.com:9999 +### Gallery ``` - -### Debugging support -To print the received html/json from a scraper request to the log file, add the following to your scraper yml file: -```yaml -debug: - printHTML: true +Title +Details +URL +Date +Rating +Studio (see Studio Fields) +Tags (see Tag fields) +Performers (list of Performer fields) ```