Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
package scraper
|
|
|
|
|
|
|
|
import (
|
|
|
|
"context"
|
|
|
|
"crypto/tls"
|
|
|
|
"fmt"
|
|
|
|
"net/http"
|
|
|
|
"os"
|
|
|
|
"path/filepath"
|
|
|
|
"sort"
|
|
|
|
"strings"
|
|
|
|
"time"
|
|
|
|
|
2022-03-17 00:33:59 +00:00
|
|
|
"github.com/stashapp/stash/pkg/fsutil"
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
"github.com/stashapp/stash/pkg/logger"
|
2022-05-19 07:49:32 +00:00
|
|
|
"github.com/stashapp/stash/pkg/match"
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
"github.com/stashapp/stash/pkg/models"
|
2022-05-19 07:49:32 +00:00
|
|
|
"github.com/stashapp/stash/pkg/txn"
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
)
|
|
|
|
|
|
|
|
const (
|
|
|
|
// scrapeGetTimeout is the timeout for scraper HTTP requests. Includes transfer time.
|
|
|
|
// We may want to bump this at some point and use local context-timeouts if more granularity
|
|
|
|
// is needed.
|
|
|
|
scrapeGetTimeout = time.Second * 60
|
|
|
|
|
|
|
|
// maxIdleConnsPerHost is the maximum number of idle connections the HTTP client will
|
|
|
|
// keep on a per-host basis.
|
|
|
|
maxIdleConnsPerHost = 8
|
|
|
|
|
|
|
|
// maxRedirects defines the maximum number of redirects the HTTP client will follow
|
|
|
|
maxRedirects = 20
|
|
|
|
)
|
|
|
|
|
|
|
|
// GlobalConfig contains the global scraper options.
|
|
|
|
type GlobalConfig interface {
|
|
|
|
GetScraperUserAgent() string
|
|
|
|
GetScrapersPath() string
|
|
|
|
GetScraperCDPPath() string
|
|
|
|
GetScraperCertCheck() bool
|
2022-03-23 22:22:41 +00:00
|
|
|
GetPythonPath() string
|
2023-02-06 22:46:18 +00:00
|
|
|
GetProxy() string
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func isCDPPathHTTP(c GlobalConfig) bool {
|
|
|
|
return strings.HasPrefix(c.GetScraperCDPPath(), "http://") || strings.HasPrefix(c.GetScraperCDPPath(), "https://")
|
|
|
|
}
|
|
|
|
|
|
|
|
func isCDPPathWS(c GlobalConfig) bool {
|
|
|
|
return strings.HasPrefix(c.GetScraperCDPPath(), "ws://")
|
|
|
|
}
|
|
|
|
|
2023-07-12 01:51:52 +00:00
|
|
|
type SceneFinder interface {
|
2023-09-01 00:39:29 +00:00
|
|
|
models.SceneGetter
|
2023-07-12 01:51:52 +00:00
|
|
|
models.URLLoader
|
2024-06-11 03:12:45 +00:00
|
|
|
models.VideoFileLoader
|
2023-07-12 01:51:52 +00:00
|
|
|
}
|
|
|
|
|
2022-05-19 07:49:32 +00:00
|
|
|
type PerformerFinder interface {
|
2023-09-01 00:39:29 +00:00
|
|
|
models.PerformerAutoTagQueryer
|
2022-05-19 07:49:32 +00:00
|
|
|
match.PerformerFinder
|
|
|
|
}
|
|
|
|
|
|
|
|
type StudioFinder interface {
|
2023-09-01 00:39:29 +00:00
|
|
|
models.StudioAutoTagQueryer
|
|
|
|
FindByStashID(ctx context.Context, stashID models.StashID) ([]*models.Studio, error)
|
2022-05-19 07:49:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
type TagFinder interface {
|
2023-09-01 00:39:29 +00:00
|
|
|
models.TagGetter
|
|
|
|
models.TagAutoTagQueryer
|
2022-05-19 07:49:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
type GalleryFinder interface {
|
2023-09-01 00:39:29 +00:00
|
|
|
models.GalleryGetter
|
2022-09-01 07:54:34 +00:00
|
|
|
models.FileLoader
|
2023-09-30 00:43:57 +00:00
|
|
|
models.URLLoader
|
2022-05-19 07:49:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
type Repository struct {
|
2023-10-16 03:26:34 +00:00
|
|
|
TxnManager models.TxnManager
|
|
|
|
|
2023-07-12 01:51:52 +00:00
|
|
|
SceneFinder SceneFinder
|
2022-05-19 07:49:32 +00:00
|
|
|
GalleryFinder GalleryFinder
|
|
|
|
TagFinder TagFinder
|
|
|
|
PerformerFinder PerformerFinder
|
2024-07-03 23:10:26 +00:00
|
|
|
GroupFinder match.GroupNamesFinder
|
2022-05-19 07:49:32 +00:00
|
|
|
StudioFinder StudioFinder
|
|
|
|
}
|
|
|
|
|
2023-10-16 03:26:34 +00:00
|
|
|
func NewRepository(repo models.Repository) Repository {
|
|
|
|
return Repository{
|
|
|
|
TxnManager: repo.TxnManager,
|
|
|
|
SceneFinder: repo.Scene,
|
|
|
|
GalleryFinder: repo.Gallery,
|
|
|
|
TagFinder: repo.Tag,
|
|
|
|
PerformerFinder: repo.Performer,
|
2024-07-03 23:10:26 +00:00
|
|
|
GroupFinder: repo.Group,
|
2023-10-16 03:26:34 +00:00
|
|
|
StudioFinder: repo.Studio,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
func (r *Repository) WithReadTxn(ctx context.Context, fn txn.TxnFunc) error {
|
|
|
|
return txn.WithReadTxn(ctx, r.TxnManager, fn)
|
|
|
|
}
|
|
|
|
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
// Cache stores the database of scrapers
|
|
|
|
type Cache struct {
|
|
|
|
client *http.Client
|
|
|
|
scrapers map[string]scraper // Scraper ID -> Scraper
|
|
|
|
globalConfig GlobalConfig
|
2022-05-19 07:49:32 +00:00
|
|
|
|
|
|
|
repository Repository
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// newClient creates a scraper-local http client we use throughout the scraper subsystem.
|
|
|
|
func newClient(gc GlobalConfig) *http.Client {
|
|
|
|
client := &http.Client{
|
|
|
|
Transport: &http.Transport{ // ignore insecure certificates
|
|
|
|
TLSClientConfig: &tls.Config{InsecureSkipVerify: !gc.GetScraperCertCheck()},
|
|
|
|
MaxIdleConnsPerHost: maxIdleConnsPerHost,
|
2023-02-06 22:46:18 +00:00
|
|
|
Proxy: http.ProxyFromEnvironment,
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
},
|
|
|
|
Timeout: scrapeGetTimeout,
|
|
|
|
// defaultCheckRedirect code with max changed from 10 to maxRedirects
|
|
|
|
CheckRedirect: func(req *http.Request, via []*http.Request) error {
|
|
|
|
if len(via) >= maxRedirects {
|
|
|
|
return fmt.Errorf("%w: gave up after %d redirects", ErrMaxRedirects, maxRedirects)
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
},
|
|
|
|
}
|
|
|
|
|
|
|
|
return client
|
|
|
|
}
|
|
|
|
|
2023-11-28 02:56:46 +00:00
|
|
|
// NewCache returns a new Cache.
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
//
|
2023-11-28 02:56:46 +00:00
|
|
|
// Scraper configurations are loaded from yml files in the scrapers
|
|
|
|
// directory in the config and any subdirectories.
|
|
|
|
//
|
|
|
|
// Does not load scrapers. Scrapers will need to be
|
|
|
|
// loaded explicitly using ReloadScrapers.
|
|
|
|
func NewCache(globalConfig GlobalConfig, repo Repository) *Cache {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
// HTTP Client setup
|
|
|
|
client := newClient(globalConfig)
|
|
|
|
|
2023-11-28 02:56:46 +00:00
|
|
|
return &Cache{
|
2022-05-19 07:49:32 +00:00
|
|
|
client: client,
|
|
|
|
globalConfig: globalConfig,
|
|
|
|
repository: repo,
|
|
|
|
}
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
}
|
|
|
|
|
2023-11-28 02:56:46 +00:00
|
|
|
// ReloadScrapers clears the scraper cache and reloads from the scraper path.
|
|
|
|
// If a scraper cannot be loaded, an error is logged and the scraper is skipped.
|
|
|
|
func (c *Cache) ReloadScrapers() {
|
2022-05-19 07:49:32 +00:00
|
|
|
path := c.globalConfig.GetScrapersPath()
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
scrapers := make(map[string]scraper)
|
|
|
|
|
|
|
|
// Add built-in scrapers
|
2022-05-19 07:49:32 +00:00
|
|
|
freeOnes := getFreeonesScraper(c.globalConfig)
|
2023-10-16 03:26:34 +00:00
|
|
|
autoTag := getAutoTagScraper(c.repository, c.globalConfig)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
scrapers[freeOnes.spec().ID] = freeOnes
|
|
|
|
scrapers[autoTag.spec().ID] = autoTag
|
|
|
|
|
|
|
|
logger.Debugf("Reading scraper configs from %s", path)
|
|
|
|
|
2022-03-17 00:33:59 +00:00
|
|
|
err := fsutil.SymWalk(path, func(fp string, f os.FileInfo, err error) error {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if filepath.Ext(fp) == ".yml" {
|
2022-05-19 07:49:32 +00:00
|
|
|
conf, err := loadConfigFromYAMLFile(fp)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if err != nil {
|
|
|
|
logger.Errorf("Error loading scraper %s: %v", fp, err)
|
|
|
|
} else {
|
2022-05-19 07:49:32 +00:00
|
|
|
scraper := newGroupScraper(*conf, c.globalConfig)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
scrapers[scraper.spec().ID] = scraper
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
})
|
|
|
|
|
|
|
|
if err != nil {
|
|
|
|
logger.Errorf("Error reading scraper configs: %v", err)
|
|
|
|
}
|
|
|
|
|
|
|
|
c.scrapers = scrapers
|
|
|
|
}
|
|
|
|
|
|
|
|
// ListScrapers lists scrapers matching one of the given types.
|
2023-05-03 03:34:57 +00:00
|
|
|
// Returns a list of scrapers, sorted by their name.
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) ListScrapers(tys []ScrapeContentType) []*Scraper {
|
|
|
|
var ret []*Scraper
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
for _, s := range c.scrapers {
|
|
|
|
for _, t := range tys {
|
|
|
|
if s.supports(t) {
|
|
|
|
spec := s.spec()
|
|
|
|
ret = append(ret, &spec)
|
|
|
|
break
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
sort.Slice(ret, func(i, j int) bool {
|
2023-05-03 03:34:57 +00:00
|
|
|
return strings.ToLower(ret[i].Name) < strings.ToLower(ret[j].Name)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
})
|
|
|
|
|
|
|
|
return ret
|
|
|
|
}
|
|
|
|
|
|
|
|
// GetScraper returns the scraper matching the provided id.
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) GetScraper(scraperID string) *Scraper {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
s := c.findScraper(scraperID)
|
|
|
|
if s != nil {
|
|
|
|
spec := s.spec()
|
|
|
|
return &spec
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func (c Cache) findScraper(scraperID string) scraper {
|
|
|
|
s, ok := c.scrapers[scraperID]
|
|
|
|
if ok {
|
|
|
|
return s
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) ScrapeName(ctx context.Context, id, query string, ty ScrapeContentType) ([]ScrapedContent, error) {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
// find scraper with the provided id
|
|
|
|
s := c.findScraper(id)
|
|
|
|
if s == nil {
|
|
|
|
return nil, fmt.Errorf("%w: id %s", ErrNotFound, id)
|
|
|
|
}
|
|
|
|
if !s.supports(ty) {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s as a %v scraper", ErrNotSupported, id, ty)
|
|
|
|
}
|
|
|
|
|
|
|
|
ns, ok := s.(nameScraper)
|
|
|
|
if !ok {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s to scrape by name", ErrNotSupported, id)
|
|
|
|
}
|
|
|
|
|
2023-07-12 01:51:52 +00:00
|
|
|
content, err := ns.viaName(ctx, c.client, query, ty)
|
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("error while name scraping with scraper %s: %w", id, err)
|
|
|
|
}
|
|
|
|
|
|
|
|
for i, cc := range content {
|
|
|
|
content[i], err = c.postScrape(ctx, cc)
|
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("error while post-scraping with scraper %s: %w", id, err)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return content, nil
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeFragment uses the given fragment input to scrape
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) ScrapeFragment(ctx context.Context, id string, input Input) (ScrapedContent, error) {
|
2023-07-26 02:59:16 +00:00
|
|
|
// set the deprecated URL field if it's not set
|
|
|
|
input.populateURL()
|
|
|
|
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
s := c.findScraper(id)
|
|
|
|
if s == nil {
|
|
|
|
return nil, fmt.Errorf("%w: id %s", ErrNotFound, id)
|
|
|
|
}
|
|
|
|
|
|
|
|
fs, ok := s.(fragmentScraper)
|
|
|
|
if !ok {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s as a fragment scraper", ErrNotSupported, id)
|
|
|
|
}
|
|
|
|
|
|
|
|
content, err := fs.viaFragment(ctx, c.client, input)
|
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("error while fragment scraping with scraper %s: %w", id, err)
|
|
|
|
}
|
|
|
|
|
|
|
|
return c.postScrape(ctx, content)
|
|
|
|
}
|
|
|
|
|
|
|
|
// ScrapeURL scrapes a given url for the given content. Searches the scraper cache
|
|
|
|
// and picks the first scraper capable of scraping the given url into the desired
|
|
|
|
// content. Returns the scraped content or an error if the scrape fails.
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) ScrapeURL(ctx context.Context, url string, ty ScrapeContentType) (ScrapedContent, error) {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
for _, s := range c.scrapers {
|
|
|
|
if s.supportsURL(url, ty) {
|
|
|
|
ul, ok := s.(urlScraper)
|
|
|
|
if !ok {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s as an url scraper", ErrNotSupported, s.spec().ID)
|
|
|
|
}
|
|
|
|
ret, err := ul.viaURL(ctx, c.client, url, ty)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
|
|
|
|
if ret == nil {
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
return c.postScrape(ctx, ret)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil, nil
|
|
|
|
}
|
|
|
|
|
2022-04-25 05:55:05 +00:00
|
|
|
func (c Cache) ScrapeID(ctx context.Context, scraperID string, id int, ty ScrapeContentType) (ScrapedContent, error) {
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
s := c.findScraper(scraperID)
|
|
|
|
if s == nil {
|
|
|
|
return nil, fmt.Errorf("%w: id %s", ErrNotFound, scraperID)
|
|
|
|
}
|
|
|
|
|
|
|
|
if !s.supports(ty) {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s to scrape %v content", ErrNotSupported, scraperID, ty)
|
|
|
|
}
|
|
|
|
|
2022-04-25 05:55:05 +00:00
|
|
|
var ret ScrapedContent
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
switch ty {
|
2022-04-25 05:55:05 +00:00
|
|
|
case ScrapeContentTypeScene:
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
ss, ok := s.(sceneScraper)
|
|
|
|
if !ok {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s as a scene scraper", ErrNotSupported, scraperID)
|
|
|
|
}
|
|
|
|
|
2022-05-19 07:49:32 +00:00
|
|
|
scene, err := c.getScene(ctx, id)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("scraper %s: unable to load scene id %v: %w", scraperID, id, err)
|
|
|
|
}
|
|
|
|
|
2022-03-14 22:42:22 +00:00
|
|
|
// don't assign nil concrete pointer to ret interface, otherwise nil
|
|
|
|
// detection is harder
|
|
|
|
scraped, err := ss.viaScene(ctx, c.client, scene)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("scraper %s: %w", scraperID, err)
|
|
|
|
}
|
2022-03-14 22:42:22 +00:00
|
|
|
|
|
|
|
if scraped != nil {
|
|
|
|
ret = scraped
|
|
|
|
}
|
2022-04-25 05:55:05 +00:00
|
|
|
case ScrapeContentTypeGallery:
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
gs, ok := s.(galleryScraper)
|
|
|
|
if !ok {
|
|
|
|
return nil, fmt.Errorf("%w: cannot use scraper %s as a gallery scraper", ErrNotSupported, scraperID)
|
|
|
|
}
|
|
|
|
|
2022-05-19 07:49:32 +00:00
|
|
|
gallery, err := c.getGallery(ctx, id)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("scraper %s: unable to load gallery id %v: %w", scraperID, id, err)
|
|
|
|
}
|
|
|
|
|
2022-03-14 22:42:22 +00:00
|
|
|
// don't assign nil concrete pointer to ret interface, otherwise nil
|
|
|
|
// detection is harder
|
|
|
|
scraped, err := gs.viaGallery(ctx, c.client, gallery)
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("scraper %s: %w", scraperID, err)
|
|
|
|
}
|
2022-03-14 22:42:22 +00:00
|
|
|
|
|
|
|
if scraped != nil {
|
|
|
|
ret = scraped
|
|
|
|
}
|
Refactor scraper top half (#1893)
* Simplify scraper listing
Introduce an enum, scraper.Kind, which explains what we are looking
for. Make it possible to match this from a scraper struct.
Use the enum to rewrite all the listing code to use the same code path.
* Use a map, nitpick ScrapePerformerList
Let the cache store a map from ID of a scraper to the scraper. This
improves lookups when there are many scrapers, making it practically
O(1) rather than O(n). If many scrapers are stored, this is faster.
Since range expressions work unchanged, we don't have to change much,
and things will still work.
make Kind a Stringer
Rename ScraperPerformerList -> ScraperPerformerQuery since that name
is used in the other scrapers, and we value consistency.
Tune ScraperPerformerQuery:
* Return static errors
* Use the new functionality
* When loading scrapers, do so directly
Rather than first walking the directory structure to obtain file paths,
fold the load directly in the the filepath walk. This makes the code
for more direct.
* Use static ErrNotFound
If a scraper isn't found, return one static error. This paves the way
for eventually doing our own error-presenter in gqlgen.
* Store the cache in the Resolver state
Putting the scraperCache directly in the resolver avoids the need to
call manager.GetInstance() all over the place to get access to the
scraper cache. The cache is stored by pointer, so it should be safe,
since the cache will just update its internal state rather than being
overwritten.
We can now utilize the resolver state to grab the cache where needed.
While here, pass context.Context from the resolver down into a function,
which removes a context.TODO()
* Introduce ScrapedContent
Create a union in the GraphQL schema for all scraped content. This
simplifies the internal implementation because we get variance on
the output content type.
Introduce a new type ScrapedContentType which signifies the scraped
content you want as a caller.
Use these to generalize the List interface and the URL scraping
interface.
* Simplify the scraper API
Introduce a new interface for scraping. This interface is then
used in the upper half of the scraper code, to make the code use one
code flow rather than multiple code flows. Variance is currently at
the old scraper structure.
Add extending interfaces for the different ways of invoking scrapes.
Use interface conversions to convert a scraper from the cache to a
scraper supporting the extra methods.
The return path returns models.ScrapedContent.
Write a general postProcess function in the scraper, handling all
ScrapedContent via type switching. This consolidates all postprocessing
code flows.
Introduce marhsallers in the resolver code for converting ScrapedContent
into the underlying concrete types. Use this to plug the existing
fields in the Query resolver, so everything still works.
* ScrapedContent: add more marshalling functions
Handle all marshalling of ScrapedContent through marhsalling functions.
Removes some hand-rolled early variants of it, and replaces it with
a canonical code flow.
* Support loadByName via scraper_s
In order to temporarily plug a hole in the current implementation, we
use the older implementation as a hook to get the newer implementation
to run.
Later on, this can serve as a guide for how to implement the lower level
bits inside the scrapers themselves. For now, it just enables support.
* Plug the remaining scraper functions for now
Since we would like to have a scraper which works in between refactors,
plug the lower level parts of the scraper for now. It avoids us having
to tackle this part just yet.
* Move postprocessing to its own file
There's enough postprocessing to clutter the main scrapers.go file.
Move all of this into a new file, postprocessing to make the API
simpler. It now lives in scrapers.go.
* Scraper: Invoke API consistency
scraper.Cache.ScrapeByName -> ScrapeName
* Fix scraping scenes by URL
Simple typo. While here, also make a single marshaller nil-aware.
* Introduce scraper groups, consolidate loadByURL
Rename `scraper_s` into `group`. A group is a group of scrapers with
the same identity. This corresponds to a single YAML file for a scraper
configuration. It defines a group which supports different types of
scraping contexts.
Move config into the group, and lift txnManager and globalConfig to
the group.
Because we now return models.ScrapedContent we can use interfaces to
get variance from the different underlying scrapers. Use a type
switch for the URL matcher candidates. And then again for the scrapers.
This consolidates all URL scraping paths into one.
While here, remove the urlMatcher interface which isn't needed. Also
clean up the remaining interfaces for url scraping and delete code
which has no purpose anymore.
* Consolidate fragment scraping in one code path
While here, abide the linters checks.
* Refactor loadByFragment
Give it the same treatment as loadByURL:
Step 1: find a scraperActionImpl which works for the data.
Step 2: use that to scrape
Most of this is simple analysis on the data at hand. It can be pushed
down further in a later commit, but for now we leave it here.
* Remove configScraper, autotag is a scraper
Remove the remains of the configScraper struct. It now lives on in the
group struct. Kill the remaining interfaces from the old implementation
while here.
Remove group.specification since it can now be handled by a simple
func call to spec().
Work through the autotag scraper. It now implements the scraper
interface, so it can be used as a scraper. This also simplifies the
autotag scraper quite a bit since it doens't have to implement a number
of unsupported func calls.
* Simplify the fragment scraper flow
* Pass the context
Eliminate a round of context.TODO() in the scraper code by passing
the calling context down into the subsystem. This will gracefully
allow for termination of remote calls if the client goes away for some
reason in GraphQL requests.
* Improve listScrapers in the schema
Support lists of types we accept.
* Be graceful on nil values in conversion
Supporting nil-values make the API more robust in the
case of partial results in a multi-scrape situation.
* Improve listScrapers: output at-most-once
Use the ID of a scraper to reduce the output set. If a scraper has
been included, don't include it again.
* Consolidate all API level errors into resolver.go
* Reorder files and functions:
scrapers.go -> cache.go:
It almost contains nothing but the cache code.
Move errors into scraper.go from here because
It is a better place to have them living right now
group.go:
All of the group structure. This can now go from
scraper.go, making it more lean. Move group create
from config_scraper to here.
config.go:
Move the `(c config) spec()` call to here.
config_scraper.go:
Empty file by now
* Name-update the scraper interfaces
Use 'via' rather than 'loadBy'.
The scrape happens via a given scrape method, so I think this is a nice
name for it.
* Rename scrapers for consistency.
While here, improve the error formatting, so different errors come
back differently.
* Nuke the freeones field from the GraphQL schema
* Fix autotag interfacing, refactor
The autotag scraper uses a pointer receiver, but the rest of the code
we use for scraping doesn't expect a pointer-receiver. Hence, to fix
the autotag scraper, we change it to be a value receiver, like the
rest of the code.
Fix: viaScene, and viaGallery.
While here, remove a couple of pointer-receiver methods which can be
trivially rewritten into plain functions.
* Protect against pointer interfaces
The underlying code can be a bit inconsistent in what it returns.
Introduce pointer-types in the postprocessing layer and handle them
accordingly for now. Once a better understanding of the lower levels
are understood, we can lift this.
* Move ErrConversion into the models package.
The conversion error pertains to the logic of converting models.
Because of this, it should move there, so it is centralized.
* Be consistent in scraper resolver error handling
If we have a static error
Err = errors.New(..)
Then use it wrapped at the start:
fmt.Errorf("%w: ...context...", Err)
This reads better.
While here, avoid using the underlying Atoi errors: they are verbose,
and like 99% of the time, the user know what is wrong from the input
string, so just give that back.
Also, remove the scraper id from the error contexts: it is implicit,
and the error wouldn't change if we used a different scraper, which
the error message would imply.
* Mark the list*Scrapers() API as deprecated
The same functionality is now present in listScrapers.
* Improve error formatting
Think about how each error is going to be used and tweak them to be
nicer.
* Return a sorted list of scrapers
This helps testing, it's closer to what we had, caches like stable data,
and it is easier for humans. It also makes the output stable, because
map iteration is randomized.
* Fix listScrapers calls to return in ID-order
Since we need the ordering to be by ID in all situations, it is easier
to just generalize the cache listScrapers call to support multiple
scraper types.
This avoids a de-dupe map up the chain, since every scraper is only
considered once. Sorting now happens in the cache listScrapers call.
Use this generalized function in all resolvers, which are now simple
passthroughs.
* Remove UpdateConfig from the scraper cache.
This isn't needed, so get rid of it.
* Pull a context into identify
Scraping scenes in the identify tasks now use a context from up the
call chain.
* Do not store the scraper cache in the resolver.
Scraper caches are updated through
manager.singleton•RefreshScraperCache, so we can't keep a pointer to
it in the resolver. Instead, solve this by adding a fetcher method to
the resolver type. This keeps it local to the resolver, while handling
the problem of updating caches in the configuration.
2021-11-18 23:55:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return c.postScrape(ctx, ret)
|
|
|
|
}
|
2022-05-19 07:49:32 +00:00
|
|
|
|
|
|
|
func (c Cache) getScene(ctx context.Context, sceneID int) (*models.Scene, error) {
|
|
|
|
var ret *models.Scene
|
2023-10-16 03:26:34 +00:00
|
|
|
r := c.repository
|
|
|
|
if err := r.WithReadTxn(ctx, func(ctx context.Context) error {
|
|
|
|
qb := r.SceneFinder
|
|
|
|
|
2022-05-19 07:49:32 +00:00
|
|
|
var err error
|
2023-10-16 03:26:34 +00:00
|
|
|
ret, err = qb.Find(ctx, sceneID)
|
2023-06-15 02:46:09 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
if ret == nil {
|
|
|
|
return fmt.Errorf("scene with id %d not found", sceneID)
|
|
|
|
}
|
|
|
|
|
2024-06-11 03:12:45 +00:00
|
|
|
if err := ret.LoadURLs(ctx, qb); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
if err := ret.LoadFiles(ctx, qb); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
2022-05-19 07:49:32 +00:00
|
|
|
}); err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
return ret, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func (c Cache) getGallery(ctx context.Context, galleryID int) (*models.Gallery, error) {
|
|
|
|
var ret *models.Gallery
|
2023-10-16 03:26:34 +00:00
|
|
|
r := c.repository
|
|
|
|
if err := r.WithReadTxn(ctx, func(ctx context.Context) error {
|
|
|
|
qb := r.GalleryFinder
|
|
|
|
|
2022-05-19 07:49:32 +00:00
|
|
|
var err error
|
2023-10-16 03:26:34 +00:00
|
|
|
ret, err = qb.Find(ctx, galleryID)
|
2023-06-15 02:46:09 +00:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2022-09-01 07:54:34 +00:00
|
|
|
|
2023-06-15 02:46:09 +00:00
|
|
|
if ret == nil {
|
|
|
|
return fmt.Errorf("gallery with id %d not found", galleryID)
|
2022-09-01 07:54:34 +00:00
|
|
|
}
|
|
|
|
|
2024-06-11 03:12:45 +00:00
|
|
|
if err := ret.LoadURLs(ctx, qb); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
if err := ret.LoadFiles(ctx, qb); err != nil {
|
2023-09-30 00:43:57 +00:00
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2024-06-11 03:12:45 +00:00
|
|
|
return nil
|
2022-05-19 07:49:32 +00:00
|
|
|
}); err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
return ret, nil
|
|
|
|
}
|