Ingest more Mbed mercurial repositories
The approach used in #5363 (closed) discovered 10k repos on os.mbed.com, but the ArchiveBot job 7bjndtsczcrvgksnc6f9r3dwb discovered 60k repos. So please compare the list from #5363 (closed) with the attached list generated by the commands below, then archive the repos that are missing.
wget https://archive.org/download/archiveteam_archivebot_go_20241115062543_eb298f37/os.mbed.com-inf-20240711-052514-7bjnd-meta.warc.gz
zgrep -oE 'https?://os\.mbed\.com/(teams|users)/[^/’ ]+/code/[^/’ ]+/' os.mbed.com-inf-20240711-052514-7bjnd-meta.warc.gz | sed 's@^http://@https://@' | awk '!seen[$0]++' > os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt
sort -u os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt | sponge os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt
zstd -0 os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt
You can see the recent activity on the site too. It is fairly inactive, but I'll do SCN if there are any further changes on that site.
PS: some of the repos will be inaccessible, since ArchiveBot discovered some private repos that had public forks.
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Paul Wise changed the description
changed the description
- Vincent Sellier assigned to @vsellier
assigned to @vsellier
- Owner
No need to bother with a comparison. Let launch a full reimport, the unchanged repositories will be quickly skipped.
It's also the opportunity to launch it in the new bulk ingest api.
Tested for staging on a subset of 10 repositories
cat os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt | awk '{print "\""$1"\",\"hg\""}' > repos.csv head -n 10 repos.csv > staging.csv curl -H "Content-type: text/csv" -H "Authorization: Bearer ${TOKEN}" --data-binary @./staging.csv https://webapp.staging.swh.network/api/1/origin/save/bulk/
The result:
curl -s -H "Authorization: Bearer ${TOKEN}" https://webapp.staging.swh.network/api/1/origin/save/bulk/request/5f16da3b-0818-41c3-aa94-7896ae14c124/ | jq -r '.[]|"\(.origin_url) \(.status) \(.last_visit_date) \(.last_visit_status)"' https://os.mbed.com/teams/00011001/code/Lab2_2/ accepted 2024-11-19T15:16:45.864608+00:00 successful https://os.mbed.com/teams/00011001/code/Project1/ accepted 2024-11-19T15:16:49.090563+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/10_2_ifelse_lighter/ accepted 2024-11-19T15:16:50.851204+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/10_if_else_control/ accepted 2024-11-19T15:16:52.651444+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/3_1/ accepted 2024-11-19T15:16:54.511381+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/5_3/ accepted 2024-11-19T15:16:56.349540+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/7_1_traveling_ver3/ accepted 2024-11-19T15:16:58.167917+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/7_1_traveling_ver3_2/ accepted 2024-11-19T15:17:00.027956+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/7_1_traveling_ver3_3_copy/ accepted 2024-11-19T15:17:01.737845+00:00 successful https://os.mbed.com/teams/12_han_meiji/code/7_1_traveling_ver3_3_ver1114/ accepted 2024-11-19T15:17:03.583537+00:00 successful
Let's launch the full list in production:
wc -l repos.csv 60531 repos.csv
- Vincent Sellier mentioned in issue swh/devel/swh-web#4818 (closed)
mentioned in issue swh/devel/swh-web#4818 (closed)
- Owner
Currently, the process is blocked due to swh/devel/swh-web#4818 (closed).
Let's wait for a fix.
curl -H "Content-type: text/csv" -H "Authorization: Bearer ${TOKEN}" --data-binary @./repos.csv https://archive.softwareheritage.org/api/1/origin/save/bulk/ {"exception":"DataError","reason":"value too long for type character varying(200)\n"}%
Those really long lines look like they are double-URL-encoded, the following set of commands confirms that the correct encoding is also present, and that the correct encoding is shorter than the limit. So I think you can drop the long lines from the list.
$ zgrep -F '%25' os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt.zst | sed s/%25/%/g | while read -r line ; do zgrep -q -F "$line" os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd^Cbjndtsczcrvgksnc6f9r3dwb.txt.zst || echo "$line missing" done $ zgrep -P '.{200}' os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt.zst https://os.mbed.com/teams/YKNCT/code/%25E3%2582%25A4%25E3%2583%25B3%25E3%2582%25AF%25E3%2583%25AA%25E3%2583%25A1%25E3%2583%25B3%25E3%2582%25BF%25E3%2583%25AB%25E3%2582%25A8%25E3%2583%25B3%25E3%2582%25B3%25E3%2583%25BC%25E3%2582%25BF%25E3%2583%25BC/ https://os.mbed.com/users/suriyon/code/%25E3%2582%25BF%25E3%2583%25B3%25E3%2582%25B9%25E3%2583%25AA%25E3%2583%25A4%25E3%2583%259B%25E3%2583%25B3-%25E3%2582%25B9%25E3%2583%25AA%25E3%2583%25A8%25E3%2583%25B3/ https://os.mbed.com/users/taisyou/code/%25E3%2582%25A4%25E3%2583%25B3%25E3%2582%25AF%25E3%2583%25AA%25E3%2583%25A1%25E3%2583%25B3%25E3%2582%25BF%25E3%2583%25AB%25E3%2582%25A8%25E3%2583%25B3%25E3%2582%25B3%25E3%2583%25BC%25E3%2582%25BF%25E3%2583%25BC/ $ zgrep -F '%25' os.mbed.com-mercurial-repos-from-archivebot-job-202407110525147bjnd-7bjndtsczcrvgksnc6f9r3dwb.txt.zst | sed s/%25/%/g | grep -P '.{200}'
Collapse replies - Owner
thanks. The problematic urls had been identified but the goal is to use the bulk-save feature as a lambda user to battle test it and identify edge cases, which is effective ;)
- Owner
The fix is deployed in production.
The checking of the urls is in progress
... https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/2019-1-6_mqtt_receiving3/ rejected null null https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/2019-12-31mqtt_thingsboard1/ rejected null null https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/2020_09_10mqtt_thingsboard1/ rejected null null https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/3/ rejected null null https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/4/ rejected null null https://os.mbed.com/teams/%25E9%25BB%2583%25E5%25BA%25AD/code/test/ rejected null null https://os.mbed.com/teams/%E3%83%AD%E3%83%9B%E3%82%B9%E3%83%868%E6%9C%9F/code/denjiben/ accepted null null https://os.mbed.com/teams/%E3%83%AD%E3%83%9B%E3%82%B9%E3%83%868%E6%9C%9F/code/teratermtest/ accepted null null https://os.mbed.com/teams/%E4%BD%8D%E7%BD%AE%E8%AA%8D%E8%AD%98%E8%A3%85%E7%BD%AE2/code/test1/ accepted null null https://os.mbed.com/teams/%E4%BD%8D%E7%BD%AE%E8%AA%8D%E8%AD%98%E8%A3%85%E7%BD%AE21/code/receiver_Original_10TimesSaved_position_/ accepted null null https://os.mbed.com/teams/%E4%BD%8D%E7%BD%AE%E8%AA%8D%E8%AD%98%E8%A3%85%E7%BD%AE212/code/receiver_Original_10TimesSaved_position_/ accepted null null https://os.mbed.com/teams/%E5%AE%9F%E9%A8%93%E7%94%A8/code/bunaitaikou_2/ accepted null null https://os.mbed.com/teams/%E5%AE%9F%E9%A8%93%E7%94%A8/code/bunaitaikou_2019/ accepted null null https://os.mbed.com/teams/%E9%9B%BB%E6%B0%97/code/%E9%9B%BB%E6%B0%97/ accepted null null https://os.mbed.com/teams/%E9%9B%BB%E6%B0%97/code/l/ accepted null null ...
Edited by Vincent Sellier - Vincent Sellier marked this issue as related to #5499 (closed)
marked this issue as related to #5499 (closed)
- Vincent Sellier mentioned in commit swh/infra/ci-cd/swh-charts@43284e7f
mentioned in commit swh/infra/ci-cd/swh-charts@43284e7f
- Vincent Sellier mentioned in commit swh/infra/ci-cd/swh-charts@7646a9ad
mentioned in commit swh/infra/ci-cd/swh-charts@7646a9ad
- Owner
The listing took ~12 hours. The loading is in progress:
7 accepted not_found 104 accepted null 6407 accepted successful 51709 pending null 2304 rejected null
- Vincent Sellier marked this issue as related to swh/devel/swh-web#4819 (closed)
marked this issue as related to swh/devel/swh-web#4819 (closed)
- Owner
bulk ingest done:
60531 listed 4 accepted failed 34 accepted not_found 58188 accepted successful 1 pending null 2304 rejected null
the last pending one
https://os.mbed.com/users/mbed_official/code/mbed/
was triggered via a scn request - Vincent Sellier closed
closed
Collapse replies - Owner
sure, we talked them last week with @anlambert. It seems most of the errors are temporary connectivity issues.
I triggered a new loading from the last statuses (unfortunaltely I don't have access to the initial statuses, just the list of repositories):
158 listed 5 accepted failed 11 accepted not_found 140 accepted successful 2 pending null
I'm checking the last ones in error.
- Owner
Another batch was relaunched and the remaining repositories were manually loaded.
The issues were indeed mbed availability issues.
Edited by Vincent Sellier @vsellier here is an example of a failure, it isn't present in
unsuccessful.csv
:$ hg clone https://os.mbed.com/users/e2137/code/SBDBT_lib___muratani/ http authorization required for https://os.mbed.com/users/e2137/code/SBDBT_lib___muratani/ realm: mbed.org user: password: abort: authorization failed
Here is the full list of HTTP 403 URLs, all of them should be failures because they ask for a user/password, and SWH will not have an account to use.
wget -q https://archive.org/download/archiveteam_archivebot_go_20241115062543_eb298f37/os.mbed.com-inf-20240711-052514-7bjnd-meta.warc.gz zcat os.mbed.com-inf-20240711-052514-7bjnd-meta.warc.gz | sed -nE 's@.*‘(https?://os\.mbed\.com/[^’]+)’: 403 .*@\1@p' > os.mbed.com-403-urls.txt
- Maintainer
I extracted the full list of Mbed origins that got rejected by the bulk save operation rejected_mbed_origins.csv.
2304 origins were rejected due to:
-
302 - Found
(infinite loop of redirects): 1 origin -
403 - Forbidden
: 1671 origins -
404 - Not Found
: 223 origins -
500 - Internal Server Error
: 17 origins -
not a mercurial repository
(project is on GitHub): 392 origins
You can find the loading statuses of each submitted MBed origins in that JSON file: mbed_origins_save_report.json
We rescheduled with SCN those with status
accepted
but with visit statusnot_found
as some transient network errors happened when attempting to clone them. 1 -
- Owner
the repositories returning a 403 were in the initial rejected list so I didn't tried to load them manually.
All the others, except https://os.mbed.com/users/mbed_official/code/mbed-dev/ which look corrupted, seem to be loaded correctly.
For the record, all the accepted repositories are added in the list of origin that are regularly scheduled to be loaded. so if it failed at one time, the loader retried a couple of days later.
here is the full result of the first ingestion: result.lst.gz
The format is 'url "listing status" "last visit date" "last visit status"
It's only accurate for the "accepted/succeed" origins but can give you a better view
Edited by Vincent Sellier - Owner
arf cross posting ;)
- Vincent Sellier reopened
reopened
- Vincent Sellier closed
closed