Commit Graph

16 Commits

Author SHA1 Message Date
WithoutPants 7b5bd80515 Separate graphql API from rest of the system (#2503)
* Move graphql generated files to api
* Refactor identify options
* Remove models.StashBoxes
* Move ScraperSource to scraper package
* Rename field strategy enums
* Rename identify.TaskOptions to Options
2022-09-06 07:03:40 +00:00
SmallCoccinelle cf4ab843f6
Fix setting images (#2068)
When postprocessing, pass the images by reference rather than value,
so we get the Image fields populated correctly in the output.
2021-11-29 14:54:01 +11:00
SmallCoccinelle 4089fcf1e2
Scraper refactor middle (#2043)
* Push scrapeByURL into scrapers

Replace ScrapePerfomerByURL, ScrapeMovie..., ... with ScrapeByURL in
the scraperActionImpl interface. This allows us to delete a lot of
repeated code in the scrapers and replace the central part with a
switch on the scraper type.

* Fold name scraping into one call

Follow up on scraper refactoring. Name scrapers use the same code path.
This allows us to restructure some code and kill some functions, adding
variance to the name scraping code. It allows us to remove some code
repetition as well.

* Do not export loop refs.

* Simplify fragment scraping

Generalize fragment scrapers into ScrapeByFragment. This simplifies
fragment code flows into a simpler pathing which should be easier
to handle in the future.

* Eliminate more context.TODO()

In a number of cases, we have a context now. Use the context rather than
TODO() for those cases in order to make those operations cancellable.

* Pass the context for the stashbox scraper

This removes all context.TODO() in the path of the stashbox scraper,
and replaces it with the context that's present on each of the paths.

* Pass the context into subscrapers

Mostly a mechanical update, where we pass in the context for
subscraping. This removes the final context.TODO() in the scraper
code.

* Warn on unknown fields from scripts

A common mistake for new script writers are that they return fields
not known to stash. For instance the name "description" is used rather
than "details".

Decode disallowing unknown fields. If this fails, use a tee-reader to
fall back to the old behavior, but print a warning for the user in this
case. Thus, we retain the old behavior, but print warnings for scripts
which fails the more strict unknown-fields detection.

* Nil-check before running the postprocessing chain

Fixes panics when scraping returns nil values.

* Lift nil-ness in post-postprocessing

If the struct we are trying to post-process is nil, we shouldn't
enter the postprocessing flow at all. Pass the struct as a value
rather than a pointer, eliminating nil-checks as we go. Use the
top-level postProcess call to make the nil-check and then abort there
if the object we are looking at is nil.

* Allow conversion routines to handle values

If we have a non-pointer type in the interface, we should also convert
those into ScrapedContent. Otherwise we get errors on deprecated
functions.
2021-11-26 11:20:06 +11:00
SmallCoccinelle e513b6ffa5
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request

Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.

The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.

This patch enables the lifting of the http client as well over time.

* Introduce a cached scraper HTTP client

The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.

When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.

* Set MaxIdleConnsPerHost. Closes #1850

We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.

The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.

* Reinstate driverOptions / useCDP check

Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.

* Documentation fixup.

* Use the scraper http.Client when fetching images

Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.

Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.

Style roughly follows that of txnManagers.

* Use the same http Client as the GraphQL client use

Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.

* Hoist HTTP client construction

Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.

* Reinstate printCookies

This is a debugging function, and it might still come in handy in the
future at some point.

* Nitpick comment.

* Minor tidy

Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 16:12:24 +11:00
SmallCoccinelle 655d3ae969
Toward better context handling (#1835)
* Use the request context

The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.

* Use a true context in the plugin example

Let AddTag/RemoveTag take a context and use that context throughout
the example.

* Avoid the use of context.Background

Prefer context.TODO over context.Background deep in the call chain.

This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.

While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.

* Prefer context.TODO over context.Background

For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.

* Thread context upwards

Initialization requires context for transactions. Thread the context
upward the call chain.

At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.

* make tasks context-aware

Change the task interface to understand contexts.

Pass the context down in some of the branches where it is needed.

* Make QueryStashBoxScene context-aware

This call naturally sits inside the request-context. Use it.

* Introduce a context in the JS plugin code

This allows us to use a context for HTTP calls inside the system.

Mark the context with a TODO at top level for now.

* Nitpick error formatting

Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.

* Avoid the use of http.Get in FFMPEG download chain

Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.

Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.

* Make getRemoteCDPWSAddress aware of contexts

Eliminate a call to http.Get and replace it with a context-aware
variant.

Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.

Plugged with context.TODO()

* Scraper: make the getImage function context-aware

Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.

Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.

* Make NOTIFY request context-aware

The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.

* Use a context in the url scraper code

We are sitting in code which has a context, so utilize it for the
request as well.

* Use a context when checking versions

When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.

This paves the way for providing a context to the HTTP server code in a
future patch.

* Make utils func ReadImage context-aware

In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.

The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.

* Make stash-box get context-aware

Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.

* Enable the noctx linter

The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 15:32:41 +11:00
Eng Zer Jun 62af723017
refactor: move from io/ioutil to io and os package (#1772)
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2021-09-27 10:55:23 +10:00
WithoutPants 4625e1f955
Unify scrape refactor (#1630)
* Unify scraped types
* Make name fields optional
* Unify single scrape queries
* Change UI to use new interfaces
* Add multi scrape interfaces
* Use images instead of image
2021-09-07 11:54:22 +10:00
WithoutPants f6ffda7504
Setup and migration UI refactor (#1190)
* Make config instance-based
* Remove config dependency in paths
* Refactor config init
* Allow startup without database
* Get system status at UI initialise
* Add setup wizard
* Cache and Metadata optional. Database mandatory
* Handle metadata not set during full import/export
* Add links
* Remove config check middleware
* Stash not mandatory
* Panic on missing mandatory config fields
* Redirect setup to main page if setup not required
* Add migration UI
* Remove unused stuff
* Move UI initialisation into App
* Don't create metadata paths on RefreshConfig
* Add folder selector for generated in setup
* Env variable to set and create config file.
Make docker images use a fixed config file.
* Set config file during setup
2021-04-12 09:31:33 +10:00
bnkai 68d4a4fe42
Add User Agent to image download reqs (#1222) 2021-03-24 08:12:11 +11:00
bnkai 144cd6e4f2
Skip insecure certificates check when scraping (#1120)
* Ignore insecure certificates when scraping
* add ScraperCertCheck to scraper config options
2021-03-01 11:47:39 +11:00
bnkai aecbd236bc
Tune image referrer path (#968) 2020-11-30 10:50:43 +11:00
woodgen 4045ddf3e9
Implement scraping movies by URL (#709)
* api/urlbuilders/movie: Auto format.

* graphql+pkg+ui: Implement scraping movies by URL.

This patch implements the missing required boilerplate for scraping
movies by URL, using performers and scenes as a reference.

Although this patch contains a big chunck of ground work for enabling
scraping movies by fragment, the feature would require additional
changes to be completely implemented and was not tested.

* graphql+pkg+ui: Scrape movie studio.

Extends and corrects the movie model for the ability to store and
dereference studio IDs with received studio string from the scraper.
This was done with Scenes as a reference. For simplicity the duplication
of having `ScrapedMovieStudio` and `ScrapedSceneStudio` was kept, which
should probably be refactored to be the same type in the model in the
future.

* ui/movies: Add movie scrape dialog.

Adds possibility to update existing movie entries with the URL scraper.

For this the MovieScrapeDialog.tsx was implemented with Performers and
Scenes as a reference. In addition DurationUtils needs to be called one
time for converting seconds from the model to the string that is
displayed in the component. This seemed the least intrusive to me as it
kept a ScrapeResult<string> type compatible with ScrapedInputGroupRow.
2020-08-10 15:34:15 +10:00
WithoutPants 2b9215702e
Refactor xpath scraper code. Add fixed and map (#616)
* Refactor xpath scraper code
* Make post-process a list
* Add map post-process action
* Add fixed xpath values
* Refactor scrapers into cache
* Refactor into mapped config
* Trim test html
2020-07-21 14:06:25 +10:00
bnkai 56210cf456
Use referer on xpath getImage, apply printHTML to subscraper also (#661) 2020-07-10 08:42:06 +10:00
WithoutPants abf2b49803
Configurable scraper user agent string (#409)
* Add debug scrape option.

Co-authored-by: HiddenPants255 <>
2020-03-21 08:55:15 +11:00
WithoutPants 34d829338d
Add image scraping support (#370)
* Add sub-scraper functionality
* Add scraping of performer image
* Add scene cover image scraping
* Port UI changes to v2.5
* Fix v2.5 dialog suggest color
* Don't convert eol of UI to support pretty
2020-03-11 11:41:55 +11:00