Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
Migrate origin_visit_status records to add the type value
Once the new scheduler pattern is deployed.
We'll need to align the origin_visit_status data records to actually have their type filled in.
That means:
modify the schema to add the visit_type column
sql file to migrate the data from origin_visit to origin_visit_status
Check impacts on storage clients (i.e. loaders) (already dealt with indeed)
Data to migrate 1239705800 rows out of 1259171054:
Production status (mirror):#+begin_src sqlsoftwareheritage=> select now(), count(*) from origin_visit_status; now | count-------------------------------+------------ 2021-02-02 09:47:15.904791+00 | 1259171054(1 row)softwareheritage=> select now(), count(*) from origin_visit_status where type is null; now | count-------------------------------+------------ 2021-02-02 09:53:08.903614+00 | 1239705800(1 row)#+end_src
Deliverables migrate both:
staging db (as it will be used to exercise the migration script and it must be up-to-date as well)
explain update origin_visit_status as ovsset type=ov.typefrom origin_visit ovwhere ov.visit=ovs.visit and ov.origin=ovs.originand ovs.type is null;
But that will take too much time in one big transaction so, we need to split this and run it in an incremental fashion:
So either looping over the ranges on origin (which might be more or less fast, depending
on the number of visits per origin and that can vary a bit):
explain update origin_visit_status as ovsset type=ov.typefrom origin_visit ovwhere ov.visit=ovs.visit and ov.origin=ovs.originand ovs.type is null;and 0 <= ov.origin and ov.origin < 10000;
or, using the following command which will only work as long as there is stuff to do:
explain with selected_origin as ( select ov.origin, ov.visit, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is null limit 100000)update origin_visit_status as ovsset type=s.typefrom selected_origin swhere ovs.origin=s.origin and ovs.visit=s.visit;
Out of staging, status:
swh=> select ov.type, count(*) from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is null group by ov.type; type | count---------+---------- deb | 2901943 deposit | 58 git | 118004 hg | 2 nixguix | 3 npm | 165363 pypi | 12829210 svn | 1(8 rows)
And most importantly, after some runs on staging, everything seems to stay consistent:
swh=> select ov.origin, ov.visit, ovs.type, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is not null and ovs.type != ov.type limit 10; origin | visit | type | type--------+-------+------+------(0 rows)
Projecting this, if no blocking transactions gets in the way, migration should take roughly 6 hours:
#!/usr/bin/env bash# set -xset -eSQL_FILE=/tmp/t2968-migrate-ovs.sqlLOG_FILE=/tmp/t2968-migrate-ovs.logLIMIT=100000cat > $SQL_FILE <<EOF \timing with selected_origin as ( select ov.origin, ov.visit, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is null limit ${LIMIT} ) update origin_visit_status as ovs set type=s.type from selected_origin s where ovs.origin=s.origin and ovs.visit=s.visit;EOFwhile true; do psql service=admin-staging-swh -f $SQL_FILE | tee $LOG_FILE grep -q "UPDATE 0" $LOG_FILE && breakdone
The select subquery to determine data to migrate is taking more time at each loop
it runs ([1] to see the ever increasing reading time).
So the initial 5 min projection was without taking this into account.
Nonetheless, everything got migrated in staging now without issues [2]
We may have yet some improvments on that query so it passes more smoothly in production.
But that should not block starting migrating this way first.
[1] migrated/migration$932
[2]
swh=> select ov.type, count(*) from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is null group by ov.type; type | count------+-------(0 rows)swh=> select ov.origin, ov.visit, ovs.type, ov.type from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type is not null and ovs.type != ov.type limit 10; origin | visit | type | type--------+-------+------+------(0 rows)
Improved version (moving the join part within the update part of the query):
#!/usr/bin/env bash# set -xset -eSQL_FILE=/tmp/t2968-migrate-ovs.sqlLOG_FILE=/tmp/t2968-migrate-ovs.logcat > $SQL_FILE <<EOF \timing with selected_origin as ( select distinct origin, visit from origin_visit_status where type is null limit 100000 ) update origin_visit_status as ovs set type=ov.type from selected_origin s inner join origin_visit ov on s.origin=ov.origin and s.visit=ov.visit where ovs.origin=s.origin and ovs.visit=s.visit;EOFwhile true; do psql service=admin-swh -f $SQL_FILE | tee $LOG_FILE grep -q "UPDATE 0" $LOG_FILE && breakdone
So the "improved query keeps on taking more time from 15s initially to cranking up
around 2 min now (after running from yesterday up to now).
Timing is on.UPDATE 102309Time: 15301.318 ms (00:15.301)Timing is on.UPDATE 103198Time: 15691.569 ms (00:15.692)Timing is on.UPDATE 108097Time: 15471.194 ms (00:15.471)Timing is on.UPDATE 108052Time: 15596.098 ms (00:15.596)...... night ......Timing is on.UPDATE 107813Time: 146027.771 ms (02:26.028)Timing is on.UPDATE 107355Time: 144460.339 ms (02:24.460)Timing is on.UPDATE 108683Time: 138898.824 ms (02:18.899)Timing is on.UPDATE 107206Time: 143704.396 ms (02:23.704)Timing is on.UPDATE 108893Time: 139246.971 ms (02:19.247)
We got ~200M rows migrated now.
So looking into the first approach (early on this ticket) and testing:
update origin_visit_status as ovsset type=ov.typefrom origin_visit ovwhere ov.visit=ovs.visit and ov.origin=ovs.originand ovs.type is null-- and 190000 <= ov.origin and ov.origin < 200000; -- stagingand 100000000 <= ov.origin and ov.origin < 100010000; -- prod
In a transaction, this was almost instantaneous
... UPDATE 10002 Time: 184.027 ms
So on to some adaptations:
#!/usr/bin/env bash# select min(origin) from origin_visit_status where type is null;first=23000000SQL_FILE=/tmp/t2968-migrate-ovs-v2.sqlLOG_FILE=/tmp/t2968-migrate-ovs-v2.loginc=100000# select max(origin) from origin_visit_status where type is null;FULL_STOP=151900000start=$firstwhile true; do end=$(( $start + $inc )) echo "### range: [$start, $end]" cat > $SQL_FILE <<EOF\timingupdate origin_visit_status as ovsset type=ov.typefrom origin_visit ovwhere ov.visit=ovs.visit and ov.origin=ovs.originand ovs.type is nulland $start <= ov.origin and ov.origin < $end;EOF start=$(( $start + $inc )) psql service=admin-swh -f $SQL_FILE | tee $LOG_FILE # only $FULL_STOP origins so break if [ $start -gt $FULL_STOP ]; then break fidone
It's currently running and it's migrating faster now:
...### range: [29700000, 29800000]Timing is on.UPDATE 842734Time: 12607.146 ms (00:12.607)### range: [29800000, 29900000]Timing is on.UPDATE 704757Time: 11942.186 ms (00:11.942)### range: [29900000, 30000000]Timing is on.UPDATE 787165Time: 48050.921 ms (00:48.051)### range: [30000000, 30100000]Timing is on.
### range: [151800000, 151900000]Timing is on.UPDATE 0Time: 3085.767 ms (00:03.086)### range: [151900000, 152000000]Timing is on.UPDATE 0Time: 6.366 ms
And the replication kept up and we have no more visit status without type:
softwareheritage=> \conninfoYou are connected to database "softwareheritage" as user "guest" on host "somerset.internal.softwareheritage.org" (address "192.168.100.103") at port "5432".SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)softwareheritage=> select now(), count(*) from origin_visit_status where type is null; now | count-------------------------------+--------- 2021-02-04 07:11:23.214145+00 | 7702062(1 row)softwareheritage=> select now(), count(*) from origin_visit_status where type is null; now | count-------------------------------+------- 2021-02-04 07:51:41.390543+00 | 0(1 row)
Another final check is running to ensure we have no discrepancy between origin-visit and
origin-visit-status (run executed already once in a while with no discrepancy so far).
softwareheritage=> select now(), count(*) from origin_visit_status ovs inner join origin_visit ov using (origin, visit) where ovs.type != ov.type; now | count-------------------------------+------- 2021-02-04 08:50:19.833201+00 | 0(1 row)