Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?
Where is the list of topics that need to be created?
I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.
It's unclear what the prefix should be. swh.storage uses swh.journal.objects, we can either use that one too, or a new one, eg. swh.journal.indexed
I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.
so, heads up, the topic prefix swh.journal.indexed has been elected and declared in the current staging diff infra/puppet/puppet-swh-site!267
! In #2780 (closed), @olasd wrote:
Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?
I'd say transient, as we can always recompute it. But this means backfilling the journal every time we add a new client that needs to get all the messages, so I don't know.
Where is the list of topics that need to be created?
I propose meeting in the middle and having the following policies:
content topics: transient, bound by volume
revision / origin topics: persistent
I expect the content topics to be the most "volatile" and heavy, and the revision / origin topics to be the most useful to keep in the long term for third party clients.
Backup the offsets as well just in case (migrated/migration$941). Puppet will start back swh-search-journal-client@objects...
(so eventually if something goes wrong, we'll reset those alongside the snapshot index to install back).
Now, after landing and deploying the diff ^, apply and check everything runs fine:
Snaphot status detail on current indices:
ardumont@search0:~% curl http://search-esnode0.internal.staging.swh.network:9200/_cat/indices\?vhealth status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin xBl67YKsQbWAt7V78UeDLA 80 0 496619 5145 348.7mb 348.7mbgreen open origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 496619 0 156.6mb 156.6mb
After deployment, everything is going fine.
BUT the index is growing quite large and fast...
health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin xBl67YKsQbWAt7V78UeDLA 80 0 622296 54024 1gb 1gb
Note: consumer group's lag is subsiding (expectedly):
Note: Regarding the partition (only 1 here), we'll need to create first-hand the consumer group to have a better partition configuration for the production.
And the index size stabilized at 1Gb (out of an inital 156mb).
health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin xBl67YKsQbWAt7V78UeDLA 80 0 786803 85285 1gb 1gbgreen open origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 496619 0 156.6mb 156.6mb
Note that the "docs.count" grew though (from 496619 to 786803) and the reason are
unclear.
The same index is used to store the metadata out of the indexer with the same origin url
as key [1] and we are computing index metadata on origins already seen (thus already present
in the index afaiui). So I would have expect the docs.count stay roughly (or even
exactly?) the same as before?
[1] well the sha1 of the origin computed by search but still
Note that the "docs.count" grew though (from 496619 to 786803) and the reason are
unclear.
The same index is used to store the metadata out of the indexer with the same origin url
as key [1] and we are computing index metadata on origins already seen (thus already present
in the index afaiui). So I would have expect the docs.count stay roughly (or even
exactly?) the same as before?
root@kafka1:~# for topic in content_mimetype content_language content_ctags content_fossology_license content_metadata revision_intrinsic_metadata origin_intrinsic_metadata; do> /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --create --config cleanup.policy=compact --partitions 256 --replication-factor 2 --topic "swh.journal.indexed.$topic"> doneWARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.content_mimetype.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.content_language.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.content_ctags.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.content_fossology_license.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.content_metadata.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.revision_intrinsic_metadata.WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.Created topic swh.journal.indexed.origin_intrinsic_metadata.
Only noticing now that we have only one indexer currently running in staging
(so only 1 topic is currently written there).
So some more indexer got deployed there to check the journal is holding up ok (it does [1]).