stash/pkg/scraper/action.go

50 lines
1.6 KiB
Go
Raw Normal View History

package scraper
Toward better context handling (#1835) * Use the request context The code uses context.Background() in a flow where there is a http.Request. Use the requests context instead. * Use a true context in the plugin example Let AddTag/RemoveTag take a context and use that context throughout the example. * Avoid the use of context.Background Prefer context.TODO over context.Background deep in the call chain. This marks the site as something which we need to context-handle later, and also makes it clear to the reader that the context is sort-of temporary in the code base. While here, be consistent in handling the `act` variable in each branch of the if .. { .. } .. check. * Prefer context.TODO over context.Background For the different scraping operations here, there is a context higher up the call chain, which we ought to use. Mark the call-sites as TODO for now, so we can come back later on a sweep of which parts can be context-lifted. * Thread context upwards Initialization requires context for transactions. Thread the context upward the call chain. At the intialization call, add a context.TODO since we can't break this yet. The singleton assumption prevents us from pulling it up into main for now. * make tasks context-aware Change the task interface to understand contexts. Pass the context down in some of the branches where it is needed. * Make QueryStashBoxScene context-aware This call naturally sits inside the request-context. Use it. * Introduce a context in the JS plugin code This allows us to use a context for HTTP calls inside the system. Mark the context with a TODO at top level for now. * Nitpick error formatting Use %v rather than %s for error interfaces. Do not begin an error strong with a capital letter. * Avoid the use of http.Get in FFMPEG download chain Since http.Get has no context, it isn't possible to break out or have policy induced. The call will block until the GET completes. Rewrite to use a http Request and provide a context. Thread the context through the call chain for now. provide context.TODO() at the top level of the initialization chain. * Make getRemoteCDPWSAddress aware of contexts Eliminate a call to http.Get and replace it with a context-aware variant. Push the context upwards in the call chain, but plug it before the scraper interface so we don't have to rewrite said interface yet. Plugged with context.TODO() * Scraper: make the getImage function context-aware Use a context, and pass it upwards. Plug it with context.TODO() up the chain before the rewrite gets too much out of hand for now. Minor tweaks along the way, remove a call to context.Background() deep in the call chain. * Make NOTIFY request context-aware The call sits inside a Request-handler. So it's natural to use the requests context as the context for the outgoing HTTP request. * Use a context in the url scraper code We are sitting in code which has a context, so utilize it for the request as well. * Use a context when checking versions When we check the version of stash on Github, use a context. Thread the context up to the initialization routine of the HTTP/GraphQL server and plug it with a context.TODO() for now. This paves the way for providing a context to the HTTP server code in a future patch. * Make utils func ReadImage context-aware In almost all of the cases, there is a context in the call chain which is a natural use. This is true for all the GraphQL mutations. The exception is in task_stash_box_tag, so plug that task with context.TODO() for now. * Make stash-box get context-aware Thread a context through the call chain until we hit the Client API. Plug it with context.TODO() there for now. * Enable the noctx linter The code is now free of any uncontexted HTTP request. This means we pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
import (
Refactor scraper top half (#1893) * Simplify scraper listing Introduce an enum, scraper.Kind, which explains what we are looking for. Make it possible to match this from a scraper struct. Use the enum to rewrite all the listing code to use the same code path. * Use a map, nitpick ScrapePerformerList Let the cache store a map from ID of a scraper to the scraper. This improves lookups when there are many scrapers, making it practically O(1) rather than O(n). If many scrapers are stored, this is faster. Since range expressions work unchanged, we don't have to change much, and things will still work. make Kind a Stringer Rename ScraperPerformerList -> ScraperPerformerQuery since that name is used in the other scrapers, and we value consistency. Tune ScraperPerformerQuery: * Return static errors * Use the new functionality * When loading scrapers, do so directly Rather than first walking the directory structure to obtain file paths, fold the load directly in the the filepath walk. This makes the code for more direct. * Use static ErrNotFound If a scraper isn't found, return one static error. This paves the way for eventually doing our own error-presenter in gqlgen. * Store the cache in the Resolver state Putting the scraperCache directly in the resolver avoids the need to call manager.GetInstance() all over the place to get access to the scraper cache. The cache is stored by pointer, so it should be safe, since the cache will just update its internal state rather than being overwritten. We can now utilize the resolver state to grab the cache where needed. While here, pass context.Context from the resolver down into a function, which removes a context.TODO() * Introduce ScrapedContent Create a union in the GraphQL schema for all scraped content. This simplifies the internal implementation because we get variance on the output content type. Introduce a new type ScrapedContentType which signifies the scraped content you want as a caller. Use these to generalize the List interface and the URL scraping interface. * Simplify the scraper API Introduce a new interface for scraping. This interface is then used in the upper half of the scraper code, to make the code use one code flow rather than multiple code flows. Variance is currently at the old scraper structure. Add extending interfaces for the different ways of invoking scrapes. Use interface conversions to convert a scraper from the cache to a scraper supporting the extra methods. The return path returns models.ScrapedContent. Write a general postProcess function in the scraper, handling all ScrapedContent via type switching. This consolidates all postprocessing code flows. Introduce marhsallers in the resolver code for converting ScrapedContent into the underlying concrete types. Use this to plug the existing fields in the Query resolver, so everything still works. * ScrapedContent: add more marshalling functions Handle all marshalling of ScrapedContent through marhsalling functions. Removes some hand-rolled early variants of it, and replaces it with a canonical code flow. * Support loadByName via scraper_s In order to temporarily plug a hole in the current implementation, we use the older implementation as a hook to get the newer implementation to run. Later on, this can serve as a guide for how to implement the lower level bits inside the scrapers themselves. For now, it just enables support. * Plug the remaining scraper functions for now Since we would like to have a scraper which works in between refactors, plug the lower level parts of the scraper for now. It avoids us having to tackle this part just yet. * Move postprocessing to its own file There's enough postprocessing to clutter the main scrapers.go file. Move all of this into a new file, postprocessing to make the API simpler. It now lives in scrapers.go. * Scraper: Invoke API consistency scraper.Cache.ScrapeByName -> ScrapeName * Fix scraping scenes by URL Simple typo. While here, also make a single marshaller nil-aware. * Introduce scraper groups, consolidate loadByURL Rename `scraper_s` into `group`. A group is a group of scrapers with the same identity. This corresponds to a single YAML file for a scraper configuration. It defines a group which supports different types of scraping contexts. Move config into the group, and lift txnManager and globalConfig to the group. Because we now return models.ScrapedContent we can use interfaces to get variance from the different underlying scrapers. Use a type switch for the URL matcher candidates. And then again for the scrapers. This consolidates all URL scraping paths into one. While here, remove the urlMatcher interface which isn't needed. Also clean up the remaining interfaces for url scraping and delete code which has no purpose anymore. * Consolidate fragment scraping in one code path While here, abide the linters checks. * Refactor loadByFragment Give it the same treatment as loadByURL: Step 1: find a scraperActionImpl which works for the data. Step 2: use that to scrape Most of this is simple analysis on the data at hand. It can be pushed down further in a later commit, but for now we leave it here. * Remove configScraper, autotag is a scraper Remove the remains of the configScraper struct. It now lives on in the group struct. Kill the remaining interfaces from the old implementation while here. Remove group.specification since it can now be handled by a simple func call to spec(). Work through the autotag scraper. It now implements the scraper interface, so it can be used as a scraper. This also simplifies the autotag scraper quite a bit since it doens't have to implement a number of unsupported func calls. * Simplify the fragment scraper flow * Pass the context Eliminate a round of context.TODO() in the scraper code by passing the calling context down into the subsystem. This will gracefully allow for termination of remote calls if the client goes away for some reason in GraphQL requests. * Improve listScrapers in the schema Support lists of types we accept. * Be graceful on nil values in conversion Supporting nil-values make the API more robust in the case of partial results in a multi-scrape situation. * Improve listScrapers: output at-most-once Use the ID of a scraper to reduce the output set. If a scraper has been included, don't include it again. * Consolidate all API level errors into resolver.go * Reorder files and functions: scrapers.go -> cache.go: It almost contains nothing but the cache code. Move errors into scraper.go from here because It is a better place to have them living right now group.go: All of the group structure. This can now go from scraper.go, making it more lean. Move group create from config_scraper to here. config.go: Move the `(c config) spec()` call to here. config_scraper.go: Empty file by now * Name-update the scraper interfaces Use 'via' rather than 'loadBy'. The scrape happens via a given scrape method, so I think this is a nice name for it. * Rename scrapers for consistency. While here, improve the error formatting, so different errors come back differently. * Nuke the freeones field from the GraphQL schema * Fix autotag interfacing, refactor The autotag scraper uses a pointer receiver, but the rest of the code we use for scraping doesn't expect a pointer-receiver. Hence, to fix the autotag scraper, we change it to be a value receiver, like the rest of the code. Fix: viaScene, and viaGallery. While here, remove a couple of pointer-receiver methods which can be trivially rewritten into plain functions. * Protect against pointer interfaces The underlying code can be a bit inconsistent in what it returns. Introduce pointer-types in the postprocessing layer and handle them accordingly for now. Once a better understanding of the lower levels are understood, we can lift this. * Move ErrConversion into the models package. The conversion error pertains to the logic of converting models. Because of this, it should move there, so it is centralized. * Be consistent in scraper resolver error handling If we have a static error Err = errors.New(..) Then use it wrapped at the start: fmt.Errorf("%w: ...context...", Err) This reads better. While here, avoid using the underlying Atoi errors: they are verbose, and like 99% of the time, the user know what is wrong from the input string, so just give that back. Also, remove the scraper id from the error contexts: it is implicit, and the error wouldn't change if we used a different scraper, which the error message would imply. * Mark the list*Scrapers() API as deprecated The same functionality is now present in listScrapers. * Improve error formatting Think about how each error is going to be used and tweak them to be nicer. * Return a sorted list of scrapers This helps testing, it's closer to what we had, caches like stable data, and it is easier for humans. It also makes the output stable, because map iteration is randomized. * Fix listScrapers calls to return in ID-order Since we need the ordering to be by ID in all situations, it is easier to just generalize the cache listScrapers call to support multiple scraper types. This avoids a de-dupe map up the chain, since every scraper is only considered once. Sorting now happens in the cache listScrapers call. Use this generalized function in all resolvers, which are now simple passthroughs. * Remove UpdateConfig from the scraper cache. This isn't needed, so get rid of it. * Pull a context into identify Scraping scenes in the identify tasks now use a context from up the call chain. * Do not store the scraper cache in the resolver. Scraper caches are updated through manager.singleton•RefreshScraperCache, so we can't keep a pointer to it in the resolver. Instead, solve this by adding a fetcher method to the resolver type. This keeps it local to the resolver, while handling the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
"context"
Cache and reuse the scraper HTTP client (#1855) * Add Cookies directly to the request Rather than maintaining a cookie jar on a one-shot HTTP client, maintain the jar ourselves: make a new jar, then use it to select the right cookies. The cookies are set on the request rather than on the client. This will retain the current behavior as we are always throwing the client away after each use. This patch enables the lifting of the http client as well over time. * Introduce a cached scraper HTTP client The scraper cache is augmented with an *http.Client. These are safe for concurrent use, so the pointer can safely be passed around. Push this into scraper configurations where applicable, next to the txnManagers. When we issue a loadUrl request, do so on the cached *http.Client, which will reuse existing idle connections in the client if any are present. * Set MaxIdleConnsPerHost. Closes #1850 We allow for up to 8 idle connections to a single host. This should make concurrent operation toward the same host reuse connections, even for sizeable concurrency. The number isn't bumped excessively high. We should probably limit concurrency toward a single site anyway, since we'll be able to overrun a site with queries quite easily if we have many concurrent goroutines issuing requests at the same time. * Reinstate driverOptions / useCDP check Use DeMorgan's laws to invert the logic and exit early. Fixes tests breaking. * Documentation fixup. * Use the scraper http.Client when fetching images Fold image fetchers onto the cached scraper http.Client as well. This makes the scraper have a single http.Client cache for all its operations. Thread the client upwards to the relevant attachment points: either the cache, or a stash_box instance, which is extended to include a pointer to the client. Style roughly follows that of txnManagers. * Use the same http Client as the GraphQL client use Rather than using http.DefaultClient, use the same client as the GraphQL client use in the stash_box subsystem. This localizes the client used in the subsystem into the constructing New.. call. * Hoist HTTP client construction Create a function for initializaing the HTTP Client we use. While here hoist magic numbers into constants. Introduce a proper static redirect error and use it in the client code as well. * Reinstate printCookies This is a debugging function, and it might still come in handy in the future at some point. * Nitpick comment. * Minor tidy Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
"net/http"
Toward better context handling (#1835) * Use the request context The code uses context.Background() in a flow where there is a http.Request. Use the requests context instead. * Use a true context in the plugin example Let AddTag/RemoveTag take a context and use that context throughout the example. * Avoid the use of context.Background Prefer context.TODO over context.Background deep in the call chain. This marks the site as something which we need to context-handle later, and also makes it clear to the reader that the context is sort-of temporary in the code base. While here, be consistent in handling the `act` variable in each branch of the if .. { .. } .. check. * Prefer context.TODO over context.Background For the different scraping operations here, there is a context higher up the call chain, which we ought to use. Mark the call-sites as TODO for now, so we can come back later on a sweep of which parts can be context-lifted. * Thread context upwards Initialization requires context for transactions. Thread the context upward the call chain. At the intialization call, add a context.TODO since we can't break this yet. The singleton assumption prevents us from pulling it up into main for now. * make tasks context-aware Change the task interface to understand contexts. Pass the context down in some of the branches where it is needed. * Make QueryStashBoxScene context-aware This call naturally sits inside the request-context. Use it. * Introduce a context in the JS plugin code This allows us to use a context for HTTP calls inside the system. Mark the context with a TODO at top level for now. * Nitpick error formatting Use %v rather than %s for error interfaces. Do not begin an error strong with a capital letter. * Avoid the use of http.Get in FFMPEG download chain Since http.Get has no context, it isn't possible to break out or have policy induced. The call will block until the GET completes. Rewrite to use a http Request and provide a context. Thread the context through the call chain for now. provide context.TODO() at the top level of the initialization chain. * Make getRemoteCDPWSAddress aware of contexts Eliminate a call to http.Get and replace it with a context-aware variant. Push the context upwards in the call chain, but plug it before the scraper interface so we don't have to rewrite said interface yet. Plugged with context.TODO() * Scraper: make the getImage function context-aware Use a context, and pass it upwards. Plug it with context.TODO() up the chain before the rewrite gets too much out of hand for now. Minor tweaks along the way, remove a call to context.Background() deep in the call chain. * Make NOTIFY request context-aware The call sits inside a Request-handler. So it's natural to use the requests context as the context for the outgoing HTTP request. * Use a context in the url scraper code We are sitting in code which has a context, so utilize it for the request as well. * Use a context when checking versions When we check the version of stash on Github, use a context. Thread the context up to the initialization routine of the HTTP/GraphQL server and plug it with a context.TODO() for now. This paves the way for providing a context to the HTTP server code in a future patch. * Make utils func ReadImage context-aware In almost all of the cases, there is a context in the call chain which is a natural use. This is true for all the GraphQL mutations. The exception is in task_stash_box_tag, so plug that task with context.TODO() for now. * Make stash-box get context-aware Thread a context through the call chain until we hit the Client API. Plug it with context.TODO() there for now. * Enable the noctx linter The code is now free of any uncontexted HTTP request. This means we pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
"github.com/stashapp/stash/pkg/models"
)
type scraperAction string
const (
scraperActionScript scraperAction = "script"
scraperActionStash scraperAction = "stash"
scraperActionXPath scraperAction = "scrapeXPath"
scraperActionJson scraperAction = "scrapeJson"
)
func (e scraperAction) IsValid() bool {
switch e {
case scraperActionScript, scraperActionStash, scraperActionXPath, scraperActionJson:
return true
}
return false
}
type scraperActionImpl interface {
Scraper refactor middle (#2043) * Push scrapeByURL into scrapers Replace ScrapePerfomerByURL, ScrapeMovie..., ... with ScrapeByURL in the scraperActionImpl interface. This allows us to delete a lot of repeated code in the scrapers and replace the central part with a switch on the scraper type. * Fold name scraping into one call Follow up on scraper refactoring. Name scrapers use the same code path. This allows us to restructure some code and kill some functions, adding variance to the name scraping code. It allows us to remove some code repetition as well. * Do not export loop refs. * Simplify fragment scraping Generalize fragment scrapers into ScrapeByFragment. This simplifies fragment code flows into a simpler pathing which should be easier to handle in the future. * Eliminate more context.TODO() In a number of cases, we have a context now. Use the context rather than TODO() for those cases in order to make those operations cancellable. * Pass the context for the stashbox scraper This removes all context.TODO() in the path of the stashbox scraper, and replaces it with the context that's present on each of the paths. * Pass the context into subscrapers Mostly a mechanical update, where we pass in the context for subscraping. This removes the final context.TODO() in the scraper code. * Warn on unknown fields from scripts A common mistake for new script writers are that they return fields not known to stash. For instance the name "description" is used rather than "details". Decode disallowing unknown fields. If this fails, use a tee-reader to fall back to the old behavior, but print a warning for the user in this case. Thus, we retain the old behavior, but print warnings for scripts which fails the more strict unknown-fields detection. * Nil-check before running the postprocessing chain Fixes panics when scraping returns nil values. * Lift nil-ness in post-postprocessing If the struct we are trying to post-process is nil, we shouldn't enter the postprocessing flow at all. Pass the struct as a value rather than a pointer, eliminating nil-checks as we go. Use the top-level postProcess call to make the nil-check and then abort there if the object we are looking at is nil. * Allow conversion routines to handle values If we have a non-pointer type in the interface, we should also convert those into ScrapedContent. Otherwise we get errors on deprecated functions.
2021-11-26 00:20:06 +00:00
scrapeByURL(ctx context.Context, url string, ty models.ScrapeContentType) (models.ScrapedContent, error)
scrapeByName(ctx context.Context, name string, ty models.ScrapeContentType) ([]models.ScrapedContent, error)
scrapeByFragment(ctx context.Context, input Input) (models.ScrapedContent, error)
Refactor scraper top half (#1893) * Simplify scraper listing Introduce an enum, scraper.Kind, which explains what we are looking for. Make it possible to match this from a scraper struct. Use the enum to rewrite all the listing code to use the same code path. * Use a map, nitpick ScrapePerformerList Let the cache store a map from ID of a scraper to the scraper. This improves lookups when there are many scrapers, making it practically O(1) rather than O(n). If many scrapers are stored, this is faster. Since range expressions work unchanged, we don't have to change much, and things will still work. make Kind a Stringer Rename ScraperPerformerList -> ScraperPerformerQuery since that name is used in the other scrapers, and we value consistency. Tune ScraperPerformerQuery: * Return static errors * Use the new functionality * When loading scrapers, do so directly Rather than first walking the directory structure to obtain file paths, fold the load directly in the the filepath walk. This makes the code for more direct. * Use static ErrNotFound If a scraper isn't found, return one static error. This paves the way for eventually doing our own error-presenter in gqlgen. * Store the cache in the Resolver state Putting the scraperCache directly in the resolver avoids the need to call manager.GetInstance() all over the place to get access to the scraper cache. The cache is stored by pointer, so it should be safe, since the cache will just update its internal state rather than being overwritten. We can now utilize the resolver state to grab the cache where needed. While here, pass context.Context from the resolver down into a function, which removes a context.TODO() * Introduce ScrapedContent Create a union in the GraphQL schema for all scraped content. This simplifies the internal implementation because we get variance on the output content type. Introduce a new type ScrapedContentType which signifies the scraped content you want as a caller. Use these to generalize the List interface and the URL scraping interface. * Simplify the scraper API Introduce a new interface for scraping. This interface is then used in the upper half of the scraper code, to make the code use one code flow rather than multiple code flows. Variance is currently at the old scraper structure. Add extending interfaces for the different ways of invoking scrapes. Use interface conversions to convert a scraper from the cache to a scraper supporting the extra methods. The return path returns models.ScrapedContent. Write a general postProcess function in the scraper, handling all ScrapedContent via type switching. This consolidates all postprocessing code flows. Introduce marhsallers in the resolver code for converting ScrapedContent into the underlying concrete types. Use this to plug the existing fields in the Query resolver, so everything still works. * ScrapedContent: add more marshalling functions Handle all marshalling of ScrapedContent through marhsalling functions. Removes some hand-rolled early variants of it, and replaces it with a canonical code flow. * Support loadByName via scraper_s In order to temporarily plug a hole in the current implementation, we use the older implementation as a hook to get the newer implementation to run. Later on, this can serve as a guide for how to implement the lower level bits inside the scrapers themselves. For now, it just enables support. * Plug the remaining scraper functions for now Since we would like to have a scraper which works in between refactors, plug the lower level parts of the scraper for now. It avoids us having to tackle this part just yet. * Move postprocessing to its own file There's enough postprocessing to clutter the main scrapers.go file. Move all of this into a new file, postprocessing to make the API simpler. It now lives in scrapers.go. * Scraper: Invoke API consistency scraper.Cache.ScrapeByName -> ScrapeName * Fix scraping scenes by URL Simple typo. While here, also make a single marshaller nil-aware. * Introduce scraper groups, consolidate loadByURL Rename `scraper_s` into `group`. A group is a group of scrapers with the same identity. This corresponds to a single YAML file for a scraper configuration. It defines a group which supports different types of scraping contexts. Move config into the group, and lift txnManager and globalConfig to the group. Because we now return models.ScrapedContent we can use interfaces to get variance from the different underlying scrapers. Use a type switch for the URL matcher candidates. And then again for the scrapers. This consolidates all URL scraping paths into one. While here, remove the urlMatcher interface which isn't needed. Also clean up the remaining interfaces for url scraping and delete code which has no purpose anymore. * Consolidate fragment scraping in one code path While here, abide the linters checks. * Refactor loadByFragment Give it the same treatment as loadByURL: Step 1: find a scraperActionImpl which works for the data. Step 2: use that to scrape Most of this is simple analysis on the data at hand. It can be pushed down further in a later commit, but for now we leave it here. * Remove configScraper, autotag is a scraper Remove the remains of the configScraper struct. It now lives on in the group struct. Kill the remaining interfaces from the old implementation while here. Remove group.specification since it can now be handled by a simple func call to spec(). Work through the autotag scraper. It now implements the scraper interface, so it can be used as a scraper. This also simplifies the autotag scraper quite a bit since it doens't have to implement a number of unsupported func calls. * Simplify the fragment scraper flow * Pass the context Eliminate a round of context.TODO() in the scraper code by passing the calling context down into the subsystem. This will gracefully allow for termination of remote calls if the client goes away for some reason in GraphQL requests. * Improve listScrapers in the schema Support lists of types we accept. * Be graceful on nil values in conversion Supporting nil-values make the API more robust in the case of partial results in a multi-scrape situation. * Improve listScrapers: output at-most-once Use the ID of a scraper to reduce the output set. If a scraper has been included, don't include it again. * Consolidate all API level errors into resolver.go * Reorder files and functions: scrapers.go -> cache.go: It almost contains nothing but the cache code. Move errors into scraper.go from here because It is a better place to have them living right now group.go: All of the group structure. This can now go from scraper.go, making it more lean. Move group create from config_scraper to here. config.go: Move the `(c config) spec()` call to here. config_scraper.go: Empty file by now * Name-update the scraper interfaces Use 'via' rather than 'loadBy'. The scrape happens via a given scrape method, so I think this is a nice name for it. * Rename scrapers for consistency. While here, improve the error formatting, so different errors come back differently. * Nuke the freeones field from the GraphQL schema * Fix autotag interfacing, refactor The autotag scraper uses a pointer receiver, but the rest of the code we use for scraping doesn't expect a pointer-receiver. Hence, to fix the autotag scraper, we change it to be a value receiver, like the rest of the code. Fix: viaScene, and viaGallery. While here, remove a couple of pointer-receiver methods which can be trivially rewritten into plain functions. * Protect against pointer interfaces The underlying code can be a bit inconsistent in what it returns. Introduce pointer-types in the postprocessing layer and handle them accordingly for now. Once a better understanding of the lower levels are understood, we can lift this. * Move ErrConversion into the models package. The conversion error pertains to the logic of converting models. Because of this, it should move there, so it is centralized. * Be consistent in scraper resolver error handling If we have a static error Err = errors.New(..) Then use it wrapped at the start: fmt.Errorf("%w: ...context...", Err) This reads better. While here, avoid using the underlying Atoi errors: they are verbose, and like 99% of the time, the user know what is wrong from the input string, so just give that back. Also, remove the scraper id from the error contexts: it is implicit, and the error wouldn't change if we used a different scraper, which the error message would imply. * Mark the list*Scrapers() API as deprecated The same functionality is now present in listScrapers. * Improve error formatting Think about how each error is going to be used and tweak them to be nicer. * Return a sorted list of scrapers This helps testing, it's closer to what we had, caches like stable data, and it is easier for humans. It also makes the output stable, because map iteration is randomized. * Fix listScrapers calls to return in ID-order Since we need the ordering to be by ID in all situations, it is easier to just generalize the cache listScrapers call to support multiple scraper types. This avoids a de-dupe map up the chain, since every scraper is only considered once. Sorting now happens in the cache listScrapers call. Use this generalized function in all resolvers, which are now simple passthroughs. * Remove UpdateConfig from the scraper cache. This isn't needed, so get rid of it. * Pull a context into identify Scraping scenes in the identify tasks now use a context from up the call chain. * Do not store the scraper cache in the resolver. Scraper caches are updated through manager.singleton•RefreshScraperCache, so we can't keep a pointer to it in the resolver. Instead, solve this by adding a fetcher method to the resolver type. This keeps it local to the resolver, while handling the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
scrapeSceneByScene(ctx context.Context, scene *models.Scene) (*models.ScrapedScene, error)
scrapeGalleryByGallery(ctx context.Context, gallery *models.Gallery) (*models.ScrapedGallery, error)
}
Cache and reuse the scraper HTTP client (#1855) * Add Cookies directly to the request Rather than maintaining a cookie jar on a one-shot HTTP client, maintain the jar ourselves: make a new jar, then use it to select the right cookies. The cookies are set on the request rather than on the client. This will retain the current behavior as we are always throwing the client away after each use. This patch enables the lifting of the http client as well over time. * Introduce a cached scraper HTTP client The scraper cache is augmented with an *http.Client. These are safe for concurrent use, so the pointer can safely be passed around. Push this into scraper configurations where applicable, next to the txnManagers. When we issue a loadUrl request, do so on the cached *http.Client, which will reuse existing idle connections in the client if any are present. * Set MaxIdleConnsPerHost. Closes #1850 We allow for up to 8 idle connections to a single host. This should make concurrent operation toward the same host reuse connections, even for sizeable concurrency. The number isn't bumped excessively high. We should probably limit concurrency toward a single site anyway, since we'll be able to overrun a site with queries quite easily if we have many concurrent goroutines issuing requests at the same time. * Reinstate driverOptions / useCDP check Use DeMorgan's laws to invert the logic and exit early. Fixes tests breaking. * Documentation fixup. * Use the scraper http.Client when fetching images Fold image fetchers onto the cached scraper http.Client as well. This makes the scraper have a single http.Client cache for all its operations. Thread the client upwards to the relevant attachment points: either the cache, or a stash_box instance, which is extended to include a pointer to the client. Style roughly follows that of txnManagers. * Use the same http Client as the GraphQL client use Rather than using http.DefaultClient, use the same client as the GraphQL client use in the stash_box subsystem. This localizes the client used in the subsystem into the constructing New.. call. * Hoist HTTP client construction Create a function for initializaing the HTTP Client we use. While here hoist magic numbers into constants. Introduce a proper static redirect error and use it in the client code as well. * Reinstate printCookies This is a debugging function, and it might still come in handy in the future at some point. * Nitpick comment. * Minor tidy Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
func (c config) getScraper(scraper scraperTypeConfig, client *http.Client, txnManager models.TransactionManager, globalConfig GlobalConfig) scraperActionImpl {
switch scraper.Action {
case scraperActionScript:
return newScriptScraper(scraper, c, globalConfig)
case scraperActionStash:
Cache and reuse the scraper HTTP client (#1855) * Add Cookies directly to the request Rather than maintaining a cookie jar on a one-shot HTTP client, maintain the jar ourselves: make a new jar, then use it to select the right cookies. The cookies are set on the request rather than on the client. This will retain the current behavior as we are always throwing the client away after each use. This patch enables the lifting of the http client as well over time. * Introduce a cached scraper HTTP client The scraper cache is augmented with an *http.Client. These are safe for concurrent use, so the pointer can safely be passed around. Push this into scraper configurations where applicable, next to the txnManagers. When we issue a loadUrl request, do so on the cached *http.Client, which will reuse existing idle connections in the client if any are present. * Set MaxIdleConnsPerHost. Closes #1850 We allow for up to 8 idle connections to a single host. This should make concurrent operation toward the same host reuse connections, even for sizeable concurrency. The number isn't bumped excessively high. We should probably limit concurrency toward a single site anyway, since we'll be able to overrun a site with queries quite easily if we have many concurrent goroutines issuing requests at the same time. * Reinstate driverOptions / useCDP check Use DeMorgan's laws to invert the logic and exit early. Fixes tests breaking. * Documentation fixup. * Use the scraper http.Client when fetching images Fold image fetchers onto the cached scraper http.Client as well. This makes the scraper have a single http.Client cache for all its operations. Thread the client upwards to the relevant attachment points: either the cache, or a stash_box instance, which is extended to include a pointer to the client. Style roughly follows that of txnManagers. * Use the same http Client as the GraphQL client use Rather than using http.DefaultClient, use the same client as the GraphQL client use in the stash_box subsystem. This localizes the client used in the subsystem into the constructing New.. call. * Hoist HTTP client construction Create a function for initializaing the HTTP Client we use. While here hoist magic numbers into constants. Introduce a proper static redirect error and use it in the client code as well. * Reinstate printCookies This is a debugging function, and it might still come in handy in the future at some point. * Nitpick comment. * Minor tidy Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
return newStashScraper(scraper, client, txnManager, c, globalConfig)
case scraperActionXPath:
Cache and reuse the scraper HTTP client (#1855) * Add Cookies directly to the request Rather than maintaining a cookie jar on a one-shot HTTP client, maintain the jar ourselves: make a new jar, then use it to select the right cookies. The cookies are set on the request rather than on the client. This will retain the current behavior as we are always throwing the client away after each use. This patch enables the lifting of the http client as well over time. * Introduce a cached scraper HTTP client The scraper cache is augmented with an *http.Client. These are safe for concurrent use, so the pointer can safely be passed around. Push this into scraper configurations where applicable, next to the txnManagers. When we issue a loadUrl request, do so on the cached *http.Client, which will reuse existing idle connections in the client if any are present. * Set MaxIdleConnsPerHost. Closes #1850 We allow for up to 8 idle connections to a single host. This should make concurrent operation toward the same host reuse connections, even for sizeable concurrency. The number isn't bumped excessively high. We should probably limit concurrency toward a single site anyway, since we'll be able to overrun a site with queries quite easily if we have many concurrent goroutines issuing requests at the same time. * Reinstate driverOptions / useCDP check Use DeMorgan's laws to invert the logic and exit early. Fixes tests breaking. * Documentation fixup. * Use the scraper http.Client when fetching images Fold image fetchers onto the cached scraper http.Client as well. This makes the scraper have a single http.Client cache for all its operations. Thread the client upwards to the relevant attachment points: either the cache, or a stash_box instance, which is extended to include a pointer to the client. Style roughly follows that of txnManagers. * Use the same http Client as the GraphQL client use Rather than using http.DefaultClient, use the same client as the GraphQL client use in the stash_box subsystem. This localizes the client used in the subsystem into the constructing New.. call. * Hoist HTTP client construction Create a function for initializaing the HTTP Client we use. While here hoist magic numbers into constants. Introduce a proper static redirect error and use it in the client code as well. * Reinstate printCookies This is a debugging function, and it might still come in handy in the future at some point. * Nitpick comment. * Minor tidy Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
return newXpathScraper(scraper, client, txnManager, c, globalConfig)
case scraperActionJson:
Cache and reuse the scraper HTTP client (#1855) * Add Cookies directly to the request Rather than maintaining a cookie jar on a one-shot HTTP client, maintain the jar ourselves: make a new jar, then use it to select the right cookies. The cookies are set on the request rather than on the client. This will retain the current behavior as we are always throwing the client away after each use. This patch enables the lifting of the http client as well over time. * Introduce a cached scraper HTTP client The scraper cache is augmented with an *http.Client. These are safe for concurrent use, so the pointer can safely be passed around. Push this into scraper configurations where applicable, next to the txnManagers. When we issue a loadUrl request, do so on the cached *http.Client, which will reuse existing idle connections in the client if any are present. * Set MaxIdleConnsPerHost. Closes #1850 We allow for up to 8 idle connections to a single host. This should make concurrent operation toward the same host reuse connections, even for sizeable concurrency. The number isn't bumped excessively high. We should probably limit concurrency toward a single site anyway, since we'll be able to overrun a site with queries quite easily if we have many concurrent goroutines issuing requests at the same time. * Reinstate driverOptions / useCDP check Use DeMorgan's laws to invert the logic and exit early. Fixes tests breaking. * Documentation fixup. * Use the scraper http.Client when fetching images Fold image fetchers onto the cached scraper http.Client as well. This makes the scraper have a single http.Client cache for all its operations. Thread the client upwards to the relevant attachment points: either the cache, or a stash_box instance, which is extended to include a pointer to the client. Style roughly follows that of txnManagers. * Use the same http Client as the GraphQL client use Rather than using http.DefaultClient, use the same client as the GraphQL client use in the stash_box subsystem. This localizes the client used in the subsystem into the constructing New.. call. * Hoist HTTP client construction Create a function for initializaing the HTTP Client we use. While here hoist magic numbers into constants. Introduce a proper static redirect error and use it in the client code as well. * Reinstate printCookies This is a debugging function, and it might still come in handy in the future at some point. * Nitpick comment. * Minor tidy Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
return newJsonScraper(scraper, client, txnManager, c, globalConfig)
}
panic("unknown scraper action: " + scraper.Action)
}