2019-11-19 02:49:05 +00:00
|
|
|
package scraper
|
|
|
|
|
|
|
|
import (
|
2021-01-18 01:23:20 +00:00
|
|
|
"context"
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
"crypto/tls"
|
2019-11-19 02:49:05 +00:00
|
|
|
"errors"
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
"fmt"
|
|
|
|
"net/http"
|
2020-05-18 22:44:33 +00:00
|
|
|
"os"
|
2019-11-19 02:49:05 +00:00
|
|
|
"path/filepath"
|
2021-08-10 04:07:01 +00:00
|
|
|
"regexp"
|
2020-08-04 00:42:40 +00:00
|
|
|
"strings"
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
"time"
|
2019-11-19 02:49:05 +00:00
|
|
|
|
|
|
|
"github.com/stashapp/stash/pkg/logger"
|
2021-08-10 04:07:01 +00:00
|
|
|
stash_config "github.com/stashapp/stash/pkg/manager/config"
|
2021-10-11 12:06:06 +00:00
|
|
|
"github.com/stashapp/stash/pkg/match"
|
2019-11-19 02:49:05 +00:00
|
|
|
"github.com/stashapp/stash/pkg/models"
|
2020-11-15 22:21:26 +00:00
|
|
|
"github.com/stashapp/stash/pkg/utils"
|
2019-11-19 02:49:05 +00:00
|
|
|
)
|
|
|
|
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
var ErrMaxRedirects = errors.New("maximum number of HTTP redirects reached")
|
|
|
|
|
|
|
|
const (
|
|
|
|
// scrapeGetTimeout is the timeout for scraper HTTP requests. Includes transfer time.
|
|
|
|
// We may want to bump this at some point and use local context-timeouts if more granularity
|
|
|
|
// is needed.
|
|
|
|
scrapeGetTimeout = time.Second * 60
|
|
|
|
|
|
|
|
// maxIdleConnsPerHost is the maximum number of idle connections the HTTP client will
|
|
|
|
// keep on a per-host basis.
|
|
|
|
maxIdleConnsPerHost = 8
|
|
|
|
|
|
|
|
// maxRedirects defines the maximum number of redirects the HTTP client will follow
|
|
|
|
maxRedirects = 20
|
|
|
|
)
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// GlobalConfig contains the global scraper options.
|
2021-04-11 23:31:33 +00:00
|
|
|
type GlobalConfig interface {
|
|
|
|
GetScraperUserAgent() string
|
|
|
|
GetScrapersPath() string
|
|
|
|
GetScraperCDPPath() string
|
|
|
|
GetScraperCertCheck() bool
|
2020-08-04 00:42:40 +00:00
|
|
|
}
|
|
|
|
|
2021-04-11 23:31:33 +00:00
|
|
|
func isCDPPathHTTP(c GlobalConfig) bool {
|
|
|
|
return strings.HasPrefix(c.GetScraperCDPPath(), "http://") || strings.HasPrefix(c.GetScraperCDPPath(), "https://")
|
2020-08-04 00:42:40 +00:00
|
|
|
}
|
|
|
|
|
2021-04-11 23:31:33 +00:00
|
|
|
func isCDPPathWS(c GlobalConfig) bool {
|
|
|
|
return strings.HasPrefix(c.GetScraperCDPPath(), "ws://")
|
2020-07-21 04:06:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Cache stores scraper details.
|
|
|
|
type Cache struct {
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
client *http.Client
|
2021-10-11 12:06:06 +00:00
|
|
|
scrapers []scraper
|
2020-07-21 04:06:25 +00:00
|
|
|
globalConfig GlobalConfig
|
2021-01-18 01:23:20 +00:00
|
|
|
txnManager models.TransactionManager
|
2020-07-21 04:06:25 +00:00
|
|
|
}
|
2019-11-19 02:49:05 +00:00
|
|
|
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
// newClient creates a scraper-local http client we use throughout the scraper subsystem.
|
|
|
|
func newClient(gc GlobalConfig) *http.Client {
|
|
|
|
client := &http.Client{
|
|
|
|
Transport: &http.Transport{ // ignore insecure certificates
|
|
|
|
TLSClientConfig: &tls.Config{InsecureSkipVerify: !gc.GetScraperCertCheck()},
|
|
|
|
MaxIdleConnsPerHost: maxIdleConnsPerHost,
|
|
|
|
},
|
|
|
|
Timeout: scrapeGetTimeout,
|
|
|
|
// defaultCheckRedirect code with max changed from 10 to maxRedirects
|
|
|
|
CheckRedirect: func(req *http.Request, via []*http.Request) error {
|
|
|
|
if len(via) >= maxRedirects {
|
|
|
|
return fmt.Errorf("after %d redirects: %w", maxRedirects, ErrMaxRedirects)
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
},
|
|
|
|
}
|
|
|
|
|
|
|
|
return client
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// NewCache returns a new Cache loading scraper configurations from the
|
|
|
|
// scraper path provided in the global config object. It returns a new
|
|
|
|
// instance and an error if the scraper directory could not be loaded.
|
|
|
|
//
|
|
|
|
// Scraper configurations are loaded from yml files in the provided scrapers
|
|
|
|
// directory and any subdirectories.
|
2021-01-18 01:23:20 +00:00
|
|
|
func NewCache(globalConfig GlobalConfig, txnManager models.TransactionManager) (*Cache, error) {
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
// HTTP Client setup
|
|
|
|
client := newClient(globalConfig)
|
|
|
|
|
|
|
|
scrapers, err := loadScrapers(globalConfig, client, txnManager)
|
2020-07-21 04:06:25 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
return &Cache{
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
client: client,
|
2020-07-21 04:06:25 +00:00
|
|
|
globalConfig: globalConfig,
|
|
|
|
scrapers: scrapers,
|
2021-01-18 01:23:20 +00:00
|
|
|
txnManager: txnManager,
|
2020-07-21 04:06:25 +00:00
|
|
|
}, nil
|
|
|
|
}
|
|
|
|
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
func loadScrapers(globalConfig GlobalConfig, client *http.Client, txnManager models.TransactionManager) ([]scraper, error) {
|
2021-10-11 12:06:06 +00:00
|
|
|
path := globalConfig.GetScrapersPath()
|
|
|
|
scrapers := make([]scraper, 0)
|
2019-11-19 02:49:05 +00:00
|
|
|
|
|
|
|
logger.Debugf("Reading scraper configs from %s", path)
|
2020-05-18 22:44:33 +00:00
|
|
|
scraperFiles := []string{}
|
2020-11-15 22:21:26 +00:00
|
|
|
err := utils.SymWalk(path, func(fp string, f os.FileInfo, err error) error {
|
2020-05-18 22:44:33 +00:00
|
|
|
if filepath.Ext(fp) == ".yml" {
|
|
|
|
scraperFiles = append(scraperFiles, fp)
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
})
|
2019-11-19 02:49:05 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
logger.Errorf("Error reading scraper configs: %s", err.Error())
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
// add built-in freeones scraper
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
scrapers = append(scrapers, getFreeonesScraper(client, txnManager, globalConfig), getAutoTagScraper(txnManager, globalConfig))
|
2019-11-19 02:49:05 +00:00
|
|
|
|
|
|
|
for _, file := range scraperFiles {
|
2021-10-11 12:06:06 +00:00
|
|
|
c, err := loadConfigFromYAMLFile(file)
|
2019-11-19 02:49:05 +00:00
|
|
|
if err != nil {
|
|
|
|
logger.Errorf("Error loading scraper %s: %s", file, err.Error())
|
|
|
|
} else {
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
scraper := createScraperFromConfig(*c, client, txnManager, globalConfig)
|
2021-10-11 12:06:06 +00:00
|
|
|
scrapers = append(scrapers, scraper)
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return scrapers, nil
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ReloadScrapers clears the scraper cache and reloads from the scraper path.
|
|
|
|
// In the event of an error during loading, the cache will be left empty.
|
|
|
|
func (c *Cache) ReloadScrapers() error {
|
|
|
|
c.scrapers = nil
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
scrapers, err := loadScrapers(c.globalConfig, c.client, c.txnManager)
|
2019-11-19 02:49:05 +00:00
|
|
|
if err != nil {
|
2020-07-21 04:06:25 +00:00
|
|
|
return err
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
c.scrapers = scrapers
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2021-04-11 23:31:33 +00:00
|
|
|
// TODO - don't think this is needed
|
2020-07-21 04:06:25 +00:00
|
|
|
// UpdateConfig updates the global config for the cache. If the scraper path
|
|
|
|
// has changed, ReloadScrapers will need to be called separately.
|
|
|
|
func (c *Cache) UpdateConfig(globalConfig GlobalConfig) {
|
|
|
|
c.globalConfig = globalConfig
|
|
|
|
}
|
|
|
|
|
|
|
|
// ListPerformerScrapers returns a list of scrapers that are capable of
|
|
|
|
// scraping performers.
|
|
|
|
func (c Cache) ListPerformerScrapers() []*models.Scraper {
|
2019-11-19 02:49:05 +00:00
|
|
|
var ret []*models.Scraper
|
2020-07-21 04:06:25 +00:00
|
|
|
for _, s := range c.scrapers {
|
2019-11-19 02:49:05 +00:00
|
|
|
// filter on type
|
2021-10-11 12:06:06 +00:00
|
|
|
if s.Performer != nil {
|
|
|
|
ret = append(ret, s.Spec)
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
return ret
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ListSceneScrapers returns a list of scrapers that are capable of
|
|
|
|
// scraping scenes.
|
|
|
|
func (c Cache) ListSceneScrapers() []*models.Scraper {
|
2019-12-16 01:35:34 +00:00
|
|
|
var ret []*models.Scraper
|
2020-07-21 04:06:25 +00:00
|
|
|
for _, s := range c.scrapers {
|
2019-12-16 01:35:34 +00:00
|
|
|
// filter on type
|
2021-10-11 12:06:06 +00:00
|
|
|
if s.Scene != nil {
|
|
|
|
ret = append(ret, s.Spec)
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
return ret
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
|
|
|
|
2020-10-20 22:24:32 +00:00
|
|
|
// ListGalleryScrapers returns a list of scrapers that are capable of
|
|
|
|
// scraping galleries.
|
|
|
|
func (c Cache) ListGalleryScrapers() []*models.Scraper {
|
|
|
|
var ret []*models.Scraper
|
|
|
|
for _, s := range c.scrapers {
|
|
|
|
// filter on type
|
2021-10-11 12:06:06 +00:00
|
|
|
if s.Gallery != nil {
|
|
|
|
ret = append(ret, s.Spec)
|
2020-10-20 22:24:32 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret
|
|
|
|
}
|
|
|
|
|
2020-08-10 05:34:15 +00:00
|
|
|
// ListMovieScrapers returns a list of scrapers that are capable of
|
|
|
|
// scraping scenes.
|
|
|
|
func (c Cache) ListMovieScrapers() []*models.Scraper {
|
|
|
|
var ret []*models.Scraper
|
|
|
|
for _, s := range c.scrapers {
|
|
|
|
// filter on type
|
2021-10-11 12:06:06 +00:00
|
|
|
if s.Movie != nil {
|
|
|
|
ret = append(ret, s.Spec)
|
2020-08-10 05:34:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret
|
|
|
|
}
|
|
|
|
|
2021-10-28 03:25:17 +00:00
|
|
|
// GetScraper returns the scraper matching the provided id.
|
|
|
|
func (c Cache) GetScraper(scraperID string) *models.Scraper {
|
|
|
|
ret := c.findScraper(scraperID)
|
|
|
|
if ret != nil {
|
|
|
|
return ret.Spec
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2021-10-11 12:06:06 +00:00
|
|
|
func (c Cache) findScraper(scraperID string) *scraper {
|
2020-07-21 04:06:25 +00:00
|
|
|
for _, s := range c.scrapers {
|
2019-11-19 02:49:05 +00:00
|
|
|
if s.ID == scraperID {
|
|
|
|
return &s
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ScrapePerformerList uses the scraper with the provided ID to query for
|
|
|
|
// performers using the provided query string. It returns a list of
|
|
|
|
// scraped performer data.
|
|
|
|
func (c Cache) ScrapePerformerList(scraperID string, query string) ([]*models.ScrapedPerformer, error) {
|
2019-11-19 02:49:05 +00:00
|
|
|
// find scraper with the provided id
|
2020-07-21 04:06:25 +00:00
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Performer != nil {
|
|
|
|
return s.Performer.scrapeByName(query)
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraper with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ScrapePerformer uses the scraper with the provided ID to scrape a
|
|
|
|
// performer using the provided performer fragment.
|
|
|
|
func (c Cache) ScrapePerformer(scraperID string, scrapedPerformer models.ScrapedPerformerInput) (*models.ScrapedPerformer, error) {
|
2019-11-19 02:49:05 +00:00
|
|
|
// find scraper with the provided id
|
2020-07-21 04:06:25 +00:00
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Performer != nil {
|
|
|
|
ret, err := s.Performer.scrapeByFragment(scrapedPerformer)
|
2020-03-11 00:41:55 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
2021-05-21 02:32:28 +00:00
|
|
|
if ret != nil {
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
err = c.postScrapePerformer(context.TODO(), ret)
|
2021-05-21 02:32:28 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
2020-03-11 00:41:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraper with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ScrapePerformerURL uses the first scraper it finds that matches the URL
|
|
|
|
// provided to scrape a performer. If no scrapers are found that matches
|
|
|
|
// the URL, then nil is returned.
|
|
|
|
func (c Cache) ScrapePerformerURL(url string) (*models.ScrapedPerformer, error) {
|
|
|
|
for _, s := range c.scrapers {
|
2021-10-11 12:06:06 +00:00
|
|
|
if matchesURL(s.Performer, url) {
|
|
|
|
ret, err := s.Performer.scrapeByURL(url)
|
2020-03-11 00:41:55 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
2021-03-10 01:25:51 +00:00
|
|
|
if ret != nil {
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
err = c.postScrapePerformer(context.TODO(), ret)
|
2021-03-10 01:25:51 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
2020-03-11 00:41:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
2019-12-12 19:27:44 +00:00
|
|
|
}
|
2019-11-19 02:49:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil, nil
|
|
|
|
}
|
2019-12-16 01:35:34 +00:00
|
|
|
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
func (c Cache) postScrapePerformer(ctx context.Context, ret *models.ScrapedPerformer) error {
|
|
|
|
if err := c.txnManager.WithReadTxn(ctx, func(r models.ReaderRepository) error {
|
2021-03-10 01:25:51 +00:00
|
|
|
tqb := r.Tag()
|
|
|
|
|
2021-08-10 04:07:01 +00:00
|
|
|
tags, err := postProcessTags(tqb, ret.Tags)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2021-03-10 01:25:51 +00:00
|
|
|
}
|
2021-08-10 04:07:01 +00:00
|
|
|
ret.Tags = tags
|
2021-03-10 01:25:51 +00:00
|
|
|
|
|
|
|
return nil
|
|
|
|
}); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
// post-process - set the image if applicable
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
if err := setPerformerImage(ctx, c.client, ret, c.globalConfig); err != nil {
|
2021-03-10 01:25:51 +00:00
|
|
|
logger.Warnf("Could not set image using URL %s: %s", *ret.Image, err.Error())
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2021-09-07 01:54:22 +00:00
|
|
|
func (c Cache) postScrapeScenePerformer(ret *models.ScrapedPerformer) error {
|
2021-03-10 01:25:51 +00:00
|
|
|
if err := c.txnManager.WithReadTxn(context.TODO(), func(r models.ReaderRepository) error {
|
|
|
|
tqb := r.Tag()
|
|
|
|
|
2021-08-10 04:07:01 +00:00
|
|
|
tags, err := postProcessTags(tqb, ret.Tags)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2021-03-10 01:25:51 +00:00
|
|
|
}
|
2021-08-10 04:07:01 +00:00
|
|
|
ret.Tags = tags
|
2021-03-10 01:25:51 +00:00
|
|
|
|
|
|
|
return nil
|
|
|
|
}); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
func (c Cache) postScrapeScene(ctx context.Context, ret *models.ScrapedScene) error {
|
|
|
|
if err := c.txnManager.WithReadTxn(ctx, func(r models.ReaderRepository) error {
|
2021-01-18 01:23:20 +00:00
|
|
|
pqb := r.Performer()
|
|
|
|
mqb := r.Movie()
|
|
|
|
tqb := r.Tag()
|
|
|
|
sqb := r.Studio()
|
|
|
|
|
|
|
|
for _, p := range ret.Performers {
|
2021-03-10 01:25:51 +00:00
|
|
|
if err := c.postScrapeScenePerformer(p); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2021-10-11 12:06:06 +00:00
|
|
|
if err := match.ScrapedPerformer(pqb, p); err != nil {
|
2021-01-18 01:23:20 +00:00
|
|
|
return err
|
|
|
|
}
|
2020-03-10 03:28:15 +00:00
|
|
|
}
|
|
|
|
|
2021-01-18 01:23:20 +00:00
|
|
|
for _, p := range ret.Movies {
|
2021-10-11 12:06:06 +00:00
|
|
|
err := match.ScrapedMovie(mqb, p)
|
2021-01-18 01:23:20 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
|
|
|
|
2021-08-10 04:07:01 +00:00
|
|
|
tags, err := postProcessTags(tqb, ret.Tags)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
2021-08-10 04:07:01 +00:00
|
|
|
ret.Tags = tags
|
2019-12-16 01:35:34 +00:00
|
|
|
|
2021-01-18 01:23:20 +00:00
|
|
|
if ret.Studio != nil {
|
2021-10-11 12:06:06 +00:00
|
|
|
err := match.ScrapedStudio(sqb, ret.Studio)
|
2021-01-18 01:23:20 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
2021-01-18 01:23:20 +00:00
|
|
|
|
|
|
|
return nil
|
|
|
|
}); err != nil {
|
|
|
|
return err
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
|
|
|
|
2020-03-11 00:41:55 +00:00
|
|
|
// post-process - set the image if applicable
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
if err := setSceneImage(ctx, c.client, ret, c.globalConfig); err != nil {
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
logger.Warnf("Could not set image using URL %s: %v", *ret.Image, err)
|
2020-03-11 00:41:55 +00:00
|
|
|
}
|
|
|
|
|
2019-12-16 01:35:34 +00:00
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2020-10-20 22:24:32 +00:00
|
|
|
func (c Cache) postScrapeGallery(ret *models.ScrapedGallery) error {
|
2021-01-18 01:23:20 +00:00
|
|
|
if err := c.txnManager.WithReadTxn(context.TODO(), func(r models.ReaderRepository) error {
|
|
|
|
pqb := r.Performer()
|
|
|
|
tqb := r.Tag()
|
|
|
|
sqb := r.Studio()
|
|
|
|
|
|
|
|
for _, p := range ret.Performers {
|
2021-10-11 12:06:06 +00:00
|
|
|
err := match.ScrapedPerformer(pqb, p)
|
2021-01-18 01:23:20 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2020-10-20 22:24:32 +00:00
|
|
|
}
|
|
|
|
|
2021-08-10 04:07:01 +00:00
|
|
|
tags, err := postProcessTags(tqb, ret.Tags)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2020-10-20 22:24:32 +00:00
|
|
|
}
|
2021-08-10 04:07:01 +00:00
|
|
|
ret.Tags = tags
|
2020-10-20 22:24:32 +00:00
|
|
|
|
2021-01-18 01:23:20 +00:00
|
|
|
if ret.Studio != nil {
|
2021-10-11 12:06:06 +00:00
|
|
|
err := match.ScrapedStudio(sqb, ret.Studio)
|
2021-01-18 01:23:20 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2020-10-20 22:24:32 +00:00
|
|
|
}
|
2021-01-18 01:23:20 +00:00
|
|
|
|
|
|
|
return nil
|
|
|
|
}); err != nil {
|
|
|
|
return err
|
2020-10-20 22:24:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2021-09-07 01:54:22 +00:00
|
|
|
// ScrapeScene uses the scraper with the provided ID to scrape a scene using existing data.
|
|
|
|
func (c Cache) ScrapeScene(scraperID string, sceneID int) (*models.ScrapedScene, error) {
|
|
|
|
// find scraper with the provided id
|
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Scene != nil {
|
2021-09-14 04:54:53 +00:00
|
|
|
// get scene from id
|
|
|
|
scene, err := getScene(sceneID, c.txnManager)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
2021-09-07 01:54:22 +00:00
|
|
|
|
2021-10-11 12:06:06 +00:00
|
|
|
ret, err := s.Scene.scrapeByScene(scene)
|
2021-09-07 01:54:22 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
2021-09-14 04:54:53 +00:00
|
|
|
|
|
|
|
if ret != nil {
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
err = c.postScrapeScene(context.TODO(), ret)
|
2021-09-14 04:54:53 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
2021-09-07 01:54:22 +00:00
|
|
|
}
|
|
|
|
|
2021-09-14 04:54:53 +00:00
|
|
|
return nil, errors.New("Scraper with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeSceneQuery uses the scraper with the provided ID to query for
|
|
|
|
// scenes using the provided query string. It returns a list of
|
|
|
|
// scraped scene data.
|
|
|
|
func (c Cache) ScrapeSceneQuery(scraperID string, query string) ([]*models.ScrapedScene, error) {
|
|
|
|
// find scraper with the provided id
|
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Scene != nil {
|
|
|
|
return s.Scene.scrapeByName(query)
|
2021-09-14 04:54:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraper with ID " + scraperID + " not found")
|
2021-09-07 01:54:22 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeSceneFragment uses the scraper with the provided ID to scrape a scene.
|
|
|
|
func (c Cache) ScrapeSceneFragment(scraperID string, scene models.ScrapedSceneInput) (*models.ScrapedScene, error) {
|
2019-12-16 01:35:34 +00:00
|
|
|
// find scraper with the provided id
|
2020-07-21 04:06:25 +00:00
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Scene != nil {
|
|
|
|
ret, err := s.Scene.scrapeByFragment(scene)
|
2019-12-16 01:35:34 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
2019-12-21 00:13:23 +00:00
|
|
|
if ret != nil {
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
err = c.postScrapeScene(context.TODO(), ret)
|
2019-12-21 00:13:23 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
2019-12-16 01:35:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraper with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
2020-07-21 04:06:25 +00:00
|
|
|
// ScrapeSceneURL uses the first scraper it finds that matches the URL
|
|
|
|
// provided to scrape a scene. If no scrapers are found that matches
|
|
|
|
// the URL, then nil is returned.
|
|
|
|
func (c Cache) ScrapeSceneURL(url string) (*models.ScrapedScene, error) {
|
|
|
|
for _, s := range c.scrapers {
|
2021-10-11 12:06:06 +00:00
|
|
|
if matchesURL(s.Scene, url) {
|
|
|
|
ret, err := s.Scene.scrapeByURL(url)
|
2019-12-16 01:35:34 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
Toward better context handling (#1835)
* Use the request context
The code uses context.Background() in a flow where there is a
http.Request. Use the requests context instead.
* Use a true context in the plugin example
Let AddTag/RemoveTag take a context and use that context throughout
the example.
* Avoid the use of context.Background
Prefer context.TODO over context.Background deep in the call chain.
This marks the site as something which we need to context-handle
later, and also makes it clear to the reader that the context is
sort-of temporary in the code base.
While here, be consistent in handling the `act` variable in each
branch of the if .. { .. } .. check.
* Prefer context.TODO over context.Background
For the different scraping operations here, there is a context
higher up the call chain, which we ought to use. Mark the call-sites
as TODO for now, so we can come back later on a sweep of which parts
can be context-lifted.
* Thread context upwards
Initialization requires context for transactions. Thread the context
upward the call chain.
At the intialization call, add a context.TODO since we can't break this
yet. The singleton assumption prevents us from pulling it up into main for
now.
* make tasks context-aware
Change the task interface to understand contexts.
Pass the context down in some of the branches where it is needed.
* Make QueryStashBoxScene context-aware
This call naturally sits inside the request-context. Use it.
* Introduce a context in the JS plugin code
This allows us to use a context for HTTP calls inside the system.
Mark the context with a TODO at top level for now.
* Nitpick error formatting
Use %v rather than %s for error interfaces.
Do not begin an error strong with a capital letter.
* Avoid the use of http.Get in FFMPEG download chain
Since http.Get has no context, it isn't possible to break out or have
policy induced. The call will block until the GET completes. Rewrite
to use a http Request and provide a context.
Thread the context through the call chain for now. provide
context.TODO() at the top level of the initialization chain.
* Make getRemoteCDPWSAddress aware of contexts
Eliminate a call to http.Get and replace it with a context-aware
variant.
Push the context upwards in the call chain, but plug it before the
scraper interface so we don't have to rewrite said interface yet.
Plugged with context.TODO()
* Scraper: make the getImage function context-aware
Use a context, and pass it upwards. Plug it with context.TODO()
up the chain before the rewrite gets too much out of hand for now.
Minor tweaks along the way, remove a call to context.Background()
deep in the call chain.
* Make NOTIFY request context-aware
The call sits inside a Request-handler. So it's natural to use the
requests context as the context for the outgoing HTTP request.
* Use a context in the url scraper code
We are sitting in code which has a context, so utilize it for the
request as well.
* Use a context when checking versions
When we check the version of stash on Github, use a context. Thread
the context up to the initialization routine of the HTTP/GraphQL
server and plug it with a context.TODO() for now.
This paves the way for providing a context to the HTTP server code in a
future patch.
* Make utils func ReadImage context-aware
In almost all of the cases, there is a context in the call chain which
is a natural use. This is true for all the GraphQL mutations.
The exception is in task_stash_box_tag, so plug that task with
context.TODO() for now.
* Make stash-box get context-aware
Thread a context through the call chain until we hit the Client API.
Plug it with context.TODO() there for now.
* Enable the noctx linter
The code is now free of any uncontexted HTTP request. This means we
pass the noctx linter, and we can enable it in the code base.
2021-10-14 04:32:41 +00:00
|
|
|
err = c.postScrapeScene(context.TODO(), ret)
|
2019-12-16 01:35:34 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, nil
|
|
|
|
}
|
2020-08-10 05:34:15 +00:00
|
|
|
|
2021-09-07 01:54:22 +00:00
|
|
|
// ScrapeGallery uses the scraper with the provided ID to scrape a gallery using existing data.
|
|
|
|
func (c Cache) ScrapeGallery(scraperID string, galleryID int) (*models.ScrapedGallery, error) {
|
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Gallery != nil {
|
2021-09-07 01:54:22 +00:00
|
|
|
// get gallery from id
|
|
|
|
gallery, err := getGallery(galleryID, c.txnManager)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
2021-10-11 12:06:06 +00:00
|
|
|
ret, err := s.Gallery.scrapeByGallery(gallery)
|
2021-09-07 01:54:22 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
if ret != nil {
|
|
|
|
err = c.postScrapeGallery(ret)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraped with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeGalleryFragment uses the scraper with the provided ID to scrape a gallery.
|
|
|
|
func (c Cache) ScrapeGalleryFragment(scraperID string, gallery models.ScrapedGalleryInput) (*models.ScrapedGallery, error) {
|
2020-10-20 22:24:32 +00:00
|
|
|
s := c.findScraper(scraperID)
|
2021-10-11 12:06:06 +00:00
|
|
|
if s != nil && s.Gallery != nil {
|
|
|
|
ret, err := s.Gallery.scrapeByFragment(gallery)
|
2020-10-20 22:24:32 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
if ret != nil {
|
|
|
|
err = c.postScrapeGallery(ret)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, errors.New("Scraped with ID " + scraperID + " not found")
|
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeGalleryURL uses the first scraper it finds that matches the URL
|
|
|
|
// provided to scrape a scene. If no scrapers are found that matches
|
|
|
|
// the URL, then nil is returned.
|
|
|
|
func (c Cache) ScrapeGalleryURL(url string) (*models.ScrapedGallery, error) {
|
|
|
|
for _, s := range c.scrapers {
|
2021-10-11 12:06:06 +00:00
|
|
|
if matchesURL(s.Gallery, url) {
|
|
|
|
ret, err := s.Gallery.scrapeByURL(url)
|
2020-10-20 22:24:32 +00:00
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
err = c.postScrapeGallery(ret)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, nil
|
|
|
|
}
|
|
|
|
|
2020-08-10 05:34:15 +00:00
|
|
|
// ScrapeMovieURL uses the first scraper it finds that matches the URL
|
|
|
|
// provided to scrape a movie. If no scrapers are found that matches
|
|
|
|
// the URL, then nil is returned.
|
|
|
|
func (c Cache) ScrapeMovieURL(url string) (*models.ScrapedMovie, error) {
|
|
|
|
for _, s := range c.scrapers {
|
2021-10-11 12:06:06 +00:00
|
|
|
if s.Movie != nil && matchesURL(s.Movie, url) {
|
|
|
|
ret, err := s.Movie.scrapeByURL(url)
|
2020-08-10 05:34:15 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
if ret.Studio != nil {
|
2021-01-18 01:23:20 +00:00
|
|
|
if err := c.txnManager.WithReadTxn(context.TODO(), func(r models.ReaderRepository) error {
|
2021-10-11 12:06:06 +00:00
|
|
|
return match.ScrapedStudio(r.Studio(), ret.Studio)
|
2021-01-18 01:23:20 +00:00
|
|
|
}); err != nil {
|
2020-08-10 05:34:15 +00:00
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// post-process - set the image if applicable
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
if err := setMovieFrontImage(context.TODO(), c.client, ret, c.globalConfig); err != nil {
|
2020-08-10 05:34:15 +00:00
|
|
|
logger.Warnf("Could not set front image using URL %s: %s", *ret.FrontImage, err.Error())
|
|
|
|
}
|
Cache and reuse the scraper HTTP client (#1855)
* Add Cookies directly to the request
Rather than maintaining a cookie jar on a one-shot HTTP client, maintain
the jar ourselves: make a new jar, then use it to select the right
cookies.
The cookies are set on the request rather than on the client. This will
retain the current behavior as we are always throwing the client away
after each use.
This patch enables the lifting of the http client as well over time.
* Introduce a cached scraper HTTP client
The scraper cache is augmented with an *http.Client. These are safe for
concurrent use, so the pointer can safely be passed around. Push this
into scraper configurations where applicable, next to the txnManagers.
When we issue a loadUrl request, do so on the cached *http.Client,
which will reuse existing idle connections in the client if any are
present.
* Set MaxIdleConnsPerHost. Closes #1850
We allow for up to 8 idle connections to a single host. This should
make concurrent operation toward the same host reuse connections, even
for sizeable concurrency.
The number isn't bumped excessively high. We should probably limit
concurrency toward a single site anyway, since we'll be able to overrun
a site with queries quite easily if we have many concurrent goroutines
issuing requests at the same time.
* Reinstate driverOptions / useCDP check
Use DeMorgan's laws to invert the logic and exit early. Fixes tests
breaking.
* Documentation fixup.
* Use the scraper http.Client when fetching images
Fold image fetchers onto the cached scraper http.Client as well. This
makes the scraper have a single http.Client cache for all its
operations.
Thread the client upwards to the relevant attachment points: either the
cache, or a stash_box instance, which is extended to include a pointer
to the client.
Style roughly follows that of txnManagers.
* Use the same http Client as the GraphQL client use
Rather than using http.DefaultClient, use the same client as the
GraphQL client use in the stash_box subsystem. This localizes the
client used in the subsystem into the constructing New.. call.
* Hoist HTTP client construction
Create a function for initializaing the HTTP Client we use. While here
hoist magic numbers into constants. Introduce a proper static redirect
error and use it in the client code as well.
* Reinstate printCookies
This is a debugging function, and it might still come in handy in the
future at some point.
* Nitpick comment.
* Minor tidy
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
2021-10-20 05:12:24 +00:00
|
|
|
if err := setMovieBackImage(context.TODO(), c.client, ret, c.globalConfig); err != nil {
|
2020-08-10 05:34:15 +00:00
|
|
|
logger.Warnf("Could not set back image using URL %s: %s", *ret.BackImage, err.Error())
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, nil
|
|
|
|
}
|
2021-08-10 04:07:01 +00:00
|
|
|
|
2021-09-07 01:54:22 +00:00
|
|
|
func postProcessTags(tqb models.TagReader, scrapedTags []*models.ScrapedTag) ([]*models.ScrapedTag, error) {
|
|
|
|
var ret []*models.ScrapedTag
|
2021-08-10 04:07:01 +00:00
|
|
|
|
|
|
|
excludePatterns := stash_config.GetInstance().GetScraperExcludeTagPatterns()
|
|
|
|
var excludeRegexps []*regexp.Regexp
|
|
|
|
|
|
|
|
for _, excludePattern := range excludePatterns {
|
|
|
|
reg, err := regexp.Compile(strings.ToLower(excludePattern))
|
|
|
|
if err != nil {
|
|
|
|
logger.Errorf("Invalid tag exclusion pattern :%v", err)
|
|
|
|
} else {
|
|
|
|
excludeRegexps = append(excludeRegexps, reg)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
var ignoredTags []string
|
|
|
|
ScrapeTag:
|
|
|
|
for _, t := range scrapedTags {
|
|
|
|
for _, reg := range excludeRegexps {
|
|
|
|
if reg.MatchString(strings.ToLower(t.Name)) {
|
|
|
|
ignoredTags = append(ignoredTags, t.Name)
|
|
|
|
continue ScrapeTag
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-11 12:06:06 +00:00
|
|
|
err := match.ScrapedTag(tqb, t)
|
2021-08-10 04:07:01 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
ret = append(ret, t)
|
|
|
|
}
|
|
|
|
|
|
|
|
if len(ignoredTags) > 0 {
|
|
|
|
logger.Infof("Scraping ignored tags: %s", strings.Join(ignoredTags, ", "))
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret, nil
|
|
|
|
}
|