Improve 'swh db' commands for easier usage in test environments and better consistency
Warning: I still have revisions to push in this MR to fit the description below. should be ok now
The main idea is to make it easy to manage the whole lifecycle of swh.core.db based backends getting configuration from the config file.
Using this configuration file:
$ cat conf.yml
storage:
cls: pipeline
steps:
- cls: masking
masking_db: postgresql:///?service=swh-masking-proxy
- cls: buffer
- cls: postgresql
db: postgresql://user:passwd@pghost:5433/swh-storage
objstorage:
cls: memory
scheduler:
cls: postgresql
db: postgresql:///?service=swh-scheduler
For each swh db
command, the main argument (previously only the backend 'module') can be:
- a (swh) package name without the
--all
option: this is the bw compat mode, for which the config entry is looked for in the config file using the current logic (esp. use the last entry of a pipeline, if any) - a (swh) package name with the
--all
option: run the command for all the backends found in the config file under the package section, - the actual backend (using the syntax ':', e.g.
storage:masking
), - a "path" to target which config entry (from the config file) should be used, like
storage.steps.0
(see the config example above), - when the
--dbname
is used, the config file is not used at all: it's an explicit db connection (libpq) string.
Examples:
$ export SWH_CONFIG_FILENAME=conf.yml
$ swh db init -p storage.steps.0
$ swh db init --all storage
$ swh db init --dbname postgresql:///?service=swh-scrubber scrubber # this has no entry in the config file
$ swh db init --dbname postgresql:///?service=masking swh:masking # this does not look into the config file
Or
$ swh db version -a storage
module: storage:masking
current code version: 194
version: 194
module: storage:postgresql
flavor: default
current code version: 193
version: 193
This should help deployment in integration testing environments (like docker), especially with cases like the example above where the storage consists in several postgresql-backend layers.
It de facto deprecates the usage of the --module-config-key
option.
This MR also change the way the dbmodule
table is managed: we used to only store the package name (e.g. 'storage' for the postgresql
backend of the swh.storage
) with some implicit business logic, or the module name (e.g. 'storage.proxies.masking'; without the swh.
prefix). We now store the backend as <package>:<cls>
where <package>
is the swh.<package>
in which the backend is defined, and <cls>
is the value of the cls
config entry for said backend. It generalizes the idea of declaring these cls
in swh.<package>.classes
entry points.
The swh db upgrade
should take care of updating the dbmodule
table accordingly.
Merge request reports
Activity
Jenkins job DCORE/gitlab-builds #236 succeeded in 1 min 11 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DCORE/gitlab-builds #237 succeeded in 55 sec.
See Console Output, Blue Ocean and Coverage Report for more details.- Resolved by David Douard
For what I understand of the diff, the default behavior doesn't seem to be impacted, but if it's the case in the future, please notify the sysadmin team.
This piece of script is used to detect and/or migrate the databases when a pod is started (ping @ardumont). For example: https://gitlab.softwareheritage.org/swh/infra/ci-cd/swh-charts/-/blob/2c0faa911e3a6736c92e1e67452d0b43dd98fd3d/swh/templates/utils/backend-utils.yaml#L311
Do you think it would be possible to implement an equivalent to list all the versions of all the modules for the
version
command?What to you think too about standardizing the database configuration entry in
db:
. It will allow to identify the module by the class to have something like:$ cat conf.yml storage: cls: pipeline steps: - cls: masking db: postgresql:///?service=swh-masking-proxy <--- Changed here - cls: buffer - cls: postgresql db: postgresql://user:passwd@pghost:5433/swh-storage objstorage: cls: memory $ swh db -C conf.yml init masking <---- Changed here
It only works if all the steps always have different classes.
mentioned in merge request swh-scrubber!74 (merged)
After this discussion this MR probably needs more work.
FTR the comment was:
Then I think you need two changes to the entrypoint:
- rename it, so it's in the swh.core namespace
- actually point to get_datastore, so swh.core does not introspect it someone from the class
- it's not just a matter of renaming the entrypoint group, because (as it is today) a "swh.xxx.classes" entry point capture both the "it's a backend for the xxx package" (the side used by the
get_xxx()
functions in some swh packages) and is used for the fact that some of these backend are indeedswh.db
based, and can thus profit from common lifecycle management features and tools. This MR does not really change this "design". - use a better backend declaration API (using new entrypoints rather than existing ones) to depend less on introspection and base the logic on a more declarative approach would be nice indeed, but it's a much more involved work, not sure I really want to start it now (I played a bit with the idea, but I'm afraid it's a too deep rabbit hole).
So I suggest we stick to this far-from-ideal proposal which should nonetheless greatly help and come back to a better solution if needed in the future.
Edited by David Douardadded 6 commits
- 793ef062 - Make 'swh db init(-admin)' support reading the db cnx string from anywhere in the config file
- 4d122d09 - Make 'swh db version' and 'updgrade' also read cnx from config file
- d61cc496 - cli: replace '--all' option in 'db version' with '--history'
- cb5ea3c2 - cli: also look for cnx string in the config file in 'db create'
- 8be91f87 - cli: use the same '--all' flag for all db commands acting on all known backends
- 69d77ff4 - db: lower a warning statement to an info
Toggle commit listJenkins job DCORE/gitlab-builds #238 succeeded in 56 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DCORE/gitlab-builds #239 succeeded in 54 sec.
See Console Output, Blue Ocean and Coverage Report for more details.- Resolved by David Douard
So now:
- all the db commands (but a couple of very niche) can be used with a '--all' options to operate on all the db backends found in the config file under the given swh package (module).
- the module name registered in the dbmodule table is now the fully qualified module name implementing the backend (i.e. 'swh.storage.postgresql.storage' instead of just 'storage')
- for the time being, it should be completely bw compatible with existing versions of swh packages using swh.core.db
- it's not clear yet if / how to execute the "db migration" needed to update the content of the dbmodule table.
But now that I'm looking at this MR I feel like it would make more sense to use entries like "swh.storage:postgresql" as entries in this table (i.e. the backend as declared in entry points and used when reading config files rather than the module full path). It would make it easier to detect an entry in this dbmodule table needs be updated at the very least.
Edited by David Douard
- Resolved by David Douard
I have not analyzed the code in details but this looks better that what we have now. If I got the intent right (but the MR summary should probably be edited), at least the database initialization phase in the swh/devel/docker> will be easier to understand without affecting the rest.