stash/pkg/database/custom_migrations.go

73 lines
1.8 KiB
Go
Raw Normal View History

Add indexes for path and checksum to images (#1740) * Add indexes for path and checksum to images The scenes table has unique indexes/constraints on path and checksum colums. The images table doesn't, which doesn't really make sense, as scanning uses these colums extensively which warrents an index, and both should be unique as well. Adding these indexes thus heavily improves the scanning tasks performance. On a database containing 4700 images a (re)scan of those 4700 files, which thus shouldn't do anything, took 1.2 seconds, with the indexes added this only takes 0.4 seconds. Taking the same test on a generated database containing 4M images + the actual 4700 images took 26 minutes for a rescan, and with the index existing also only takes 0.4 seconds. * Add images.checksum unique constraint in code with fallback Work around the issue where in some cases duplicate images (/checksums on images) might exist. This as discussed in #1740 by creating the index on startup and in case of an error logging the duplicates. This so the users where this scenario exists can correct the database (by searching on the logged checksum(s) and removing the duplicates) and after a restart the unique index / constraint will still be created. In case when creating the unique index fails a "normal" / non-unique index is created as surrogate so the user will still get the performance benefit (for example during scanning) without being forced to remove the duplicates and restart beforehand. This surrogate is also automatically cleaned up after the unique index is succesfully created.
2021-09-21 01:48:52 +00:00
package database
import (
"database/sql"
Lint checks phase 2 (#1747) * Log 3 unchecked errors Rather than ignore errors, log them at the WARNING log level. The server has been functioning without these, so assume they are not at the ERROR level. * Log errors in concurrency test If we can't initialize the configuration, treat the test as a failure. * Undo the errcheck on configurations for now. * Handle unchecked errors in pkg/manager * Resolve unchecked errors * Handle DLNA/DMS unchecked errors * Handle error checking in concurrency test Generalize config initialization, so we can initialize a configuration without writing it to disk. Use this in the test case, since otherwise the test fails to write. * Handle the remaining unchecked errors * Heed gosimple in update test * Use one-line if-initializer statements While here, fix a wrong variable capture error. * testing.T doesn't support %w use %v instead which is supported. * Remove unused query builder functions The Int/String criterion handler functions are now generalized. Thus, there's no need to keep these functions around anymore. * Mark filterBuilder.addRecursiveWith nolint The function is useful in the future and no other refactors are looking nice. Keep the function around, but tell the linter to ignore it. * Remove utils.Btoi There are no users of this utility function * Return error on scan failure If we fail to scan the row when looking for the unique checksum index, then report the error upwards. * Fix comments on exported functions * Fix typos * Fix startup error
2021-09-23 07:15:50 +00:00
"fmt"
Add indexes for path and checksum to images (#1740) * Add indexes for path and checksum to images The scenes table has unique indexes/constraints on path and checksum colums. The images table doesn't, which doesn't really make sense, as scanning uses these colums extensively which warrents an index, and both should be unique as well. Adding these indexes thus heavily improves the scanning tasks performance. On a database containing 4700 images a (re)scan of those 4700 files, which thus shouldn't do anything, took 1.2 seconds, with the indexes added this only takes 0.4 seconds. Taking the same test on a generated database containing 4M images + the actual 4700 images took 26 minutes for a rescan, and with the index existing also only takes 0.4 seconds. * Add images.checksum unique constraint in code with fallback Work around the issue where in some cases duplicate images (/checksums on images) might exist. This as discussed in #1740 by creating the index on startup and in case of an error logging the duplicates. This so the users where this scenario exists can correct the database (by searching on the logged checksum(s) and removing the duplicates) and after a restart the unique index / constraint will still be created. In case when creating the unique index fails a "normal" / non-unique index is created as surrogate so the user will still get the performance benefit (for example during scanning) without being forced to remove the duplicates and restart beforehand. This surrogate is also automatically cleaned up after the unique index is succesfully created.
2021-09-21 01:48:52 +00:00
"strings"
"github.com/jmoiron/sqlx"
"github.com/stashapp/stash/pkg/logger"
)
func runCustomMigrations() error {
if err := createImagesChecksumIndex(); err != nil {
return err
}
return nil
}
func createImagesChecksumIndex() error {
return WithTxn(func(tx *sqlx.Tx) error {
row := tx.QueryRow("SELECT 1 AS found FROM sqlite_master WHERE type = 'index' AND name = 'images_checksum_unique'")
err := row.Err()
if err != nil && err != sql.ErrNoRows {
return err
}
if err == nil {
var found bool
Lint checks phase 2 (#1747) * Log 3 unchecked errors Rather than ignore errors, log them at the WARNING log level. The server has been functioning without these, so assume they are not at the ERROR level. * Log errors in concurrency test If we can't initialize the configuration, treat the test as a failure. * Undo the errcheck on configurations for now. * Handle unchecked errors in pkg/manager * Resolve unchecked errors * Handle DLNA/DMS unchecked errors * Handle error checking in concurrency test Generalize config initialization, so we can initialize a configuration without writing it to disk. Use this in the test case, since otherwise the test fails to write. * Handle the remaining unchecked errors * Heed gosimple in update test * Use one-line if-initializer statements While here, fix a wrong variable capture error. * testing.T doesn't support %w use %v instead which is supported. * Remove unused query builder functions The Int/String criterion handler functions are now generalized. Thus, there's no need to keep these functions around anymore. * Mark filterBuilder.addRecursiveWith nolint The function is useful in the future and no other refactors are looking nice. Keep the function around, but tell the linter to ignore it. * Remove utils.Btoi There are no users of this utility function * Return error on scan failure If we fail to scan the row when looking for the unique checksum index, then report the error upwards. * Fix comments on exported functions * Fix typos * Fix startup error
2021-09-23 07:15:50 +00:00
if err := row.Scan(&found); err != nil && err != sql.ErrNoRows {
return fmt.Errorf("error while scanning for index: %w", err)
}
Add indexes for path and checksum to images (#1740) * Add indexes for path and checksum to images The scenes table has unique indexes/constraints on path and checksum colums. The images table doesn't, which doesn't really make sense, as scanning uses these colums extensively which warrents an index, and both should be unique as well. Adding these indexes thus heavily improves the scanning tasks performance. On a database containing 4700 images a (re)scan of those 4700 files, which thus shouldn't do anything, took 1.2 seconds, with the indexes added this only takes 0.4 seconds. Taking the same test on a generated database containing 4M images + the actual 4700 images took 26 minutes for a rescan, and with the index existing also only takes 0.4 seconds. * Add images.checksum unique constraint in code with fallback Work around the issue where in some cases duplicate images (/checksums on images) might exist. This as discussed in #1740 by creating the index on startup and in case of an error logging the duplicates. This so the users where this scenario exists can correct the database (by searching on the logged checksum(s) and removing the duplicates) and after a restart the unique index / constraint will still be created. In case when creating the unique index fails a "normal" / non-unique index is created as surrogate so the user will still get the performance benefit (for example during scanning) without being forced to remove the duplicates and restart beforehand. This surrogate is also automatically cleaned up after the unique index is succesfully created.
2021-09-21 01:48:52 +00:00
if found {
return nil
}
}
_, err = tx.Exec("CREATE UNIQUE INDEX images_checksum_unique ON images (checksum)")
if err == nil {
_, err = tx.Exec("DROP INDEX IF EXISTS index_images_checksum")
if err != nil {
logger.Errorf("Failed to remove surrogate images.checksum index: %s", err)
}
logger.Info("Created unique constraint on images table")
return nil
}
_, err = tx.Exec("CREATE INDEX IF NOT EXISTS index_images_checksum ON images (checksum)")
if err != nil {
logger.Errorf("Unable to create index on images.checksum: %s", err)
}
var result []struct {
Checksum string `db:"checksum"`
}
err = tx.Select(&result, "SELECT checksum FROM images GROUP BY checksum HAVING COUNT(1) > 1")
if err != nil && err != sql.ErrNoRows {
logger.Errorf("Unable to determine non-unique image checksums: %s", err)
return nil
}
checksums := make([]string, len(result))
for i, res := range result {
checksums[i] = res.Checksum
}
logger.Warnf("The following duplicate image checksums have been found. Please remove the duplicates and restart. %s", strings.Join(checksums, ", "))
return nil
})
}