reset the offset of the swh-search client on the swh.journal.objects.origin_visit_status topic
This is correct, but doing like this, the search will not be completely accurate until the backfill of the origin_visit_status topic is done.
It's not important currently as the swh-search is not really used in production, but it's the opportunity to test a way to do such reindexation smoothly.
Regarding the index rebuilding process, using a naive approach with aliases with the old and the new index[1] returns duplicated results when the search is done.
Using an alias with only the old index, rebuilding a new index and switching the alias to the new index[2] can be a first approach with the default the old index will not be updated until the alias is switched to the new index.
It also requires the swh-search code is able to use different names for the read and write operations.
export SERVER=journal0.internal.staging.swh.network:9092./kafka-consumer-groups.sh --bootstrap-server $SERVER --delete --group swh.search.journal_client Deletion of requested consumer groups ('swh.search.journal_client') was successful.
The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin
% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200% curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin Mq8dnlpuRXO4yYoC6CTuQw 90 1 151716299 38861934 260.8gb 131gb% curl -XDELETE http://$ES_SERVER/origin{"acknowledged":true}% % swh search --config-file /etc/softwareheritage/search/server.yml initializeINFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s]INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s]Done.% curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 0 0 36.5kb 18.2kb
journal client's consumer group delete:
% export SERVER=kafka1.internal.softwareheritage.org:9092 % ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_clientDeletion of requested consumer groups ('swh.search.journal_client') was successful.
journal client restarted
puppet enabled
The journal client is in progress :
% curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.sizegreen open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 42184 0 45.6mb 42.7mb
The journal_client has almost ingested the topics[1] it listens. It took some more time because a backfill of the origin_visit_status was launched for #2993 (closed).
It should be done by the end of the day.