repo: asf via http connection (as mentioned in description)
machine: worker01 with a local remote storage plugged to softwareheritage-test-svn db.
1st tryout: using svn connection over http.
worker did ~40k revisions.
A bad http connection occurred, preventing the job to finish.
Error message was: Unexpected HTTP status 400 'Bad Request' on '/repos/asf'\n
As a consequence, the code was adapted with a retry policy (basic retry of 3) on the sensible part that broke.
This was packaged and redeployed on worker01.
The same job has been triggered.
Since it did not finish the first time (no new occurrence were created), worker restarted back from scratch (except for the origin which is the same).
What i mean by restart is:
will start from revision 1 and hash each svn revision tree up to the svn repository's HEAD revision
send missing data for storage
In effect, the first 40k revisions (and their contents/directories) won't be sent back since they already are stored.
Still, we'll lose the 40k hashing time.
2nd tryout
This time, the worker did more but still failed around ~94k revisions.
Error message: ConnectionResetError: [Errno 104] Error running context: Connection reset by peer
At the moment, svn update is possible when finishing one pass.
That is, when done, it creates an occurrence which targets a revision.
When triggering a known svn repository, we retrieve that targeted revision which holds the associated svn revision.
And we start from that one (if the history is not altered that is).
So, for very huge repository, this tactic won't be sufficient since we'll still need to hash again even though the data is already in storage.
Possible improvment implementations:
Adding a cache on already seen revision for that origin (unrelated from occurrence)
Improve the loader to reschedule the task with the last seen revision for that origin
Antoine R. Dumontchanged title from Test - Ingest huge repository to Test - Ingest XXL svn repository
changed title from Test - Ingest huge repository to Test - Ingest XXL svn repository
This is the 3rd solution Implementation which is tested and deployed on current worker01.
The repositories impacted are:
swh-scheduler (in a branch, tagged, debian packaged, uploaded on pergamon)
swh-loader-svn (in a branch, tagged, debian packaged, uploaded in pergamon)
The same debian packaging as usual has been used. The only difference is that the git tag are on the respective branches.
Infra detail:
swh-scheduler and swh-loader-svn deployed
db: softwareheritage-test-svn for the storage
db : ardumont-swh-scheduler for the scheduling part.
As explained in the email, a producer produce the asf's svn url repository to load:
this will result in a task to load that repository
If failure along the way, the task will reschedule one shot task with the necessary information (last known swh-revision, svn revision). The task is then stopped and considered done.
(for now) a cron is triggered regularly which triggers a script in charge of loading one shot tasks from the scheduler.
Even svn's own tools break on such cases (svnsync must be iteratively called to continue).
A recent email discussion with Greg Stein, former member of the googlecode team, revealed that some
defenses exists in the server side of the asf repositories.
This explains why i had issues when trying to mirror the svn repositories for tests purposes.
Well, trying to mount the repository on the side seems to be a task on its own:
$ svnadmin create asf-mirror$ 7z x -so svn-asf-public-r0:1164363.7z | svnadmin load ./asf-mirror...------- Committed revision 923 >>><<< Started new transaction, based on original revision 924 * editing path : incubator/directory/ldap/trunk/sandbox0/.cvsignore ... done.------- Committed revision 924 >>><<< Started new transaction, based on original revision 925 * editing path : incubator/directory/ldap/trunk/sandbox0/newbackend/src/java/ldapd/server/jndi/InterceptorPipeline.java ... done.svnadmin: E125005: Invalid property value found in dumpstream; consider repairing the source or using --bypass-prop-validation while loading.svnadmin: E125005: Cannot accept non-LF line endings in 'svn:log' property
Related to #611 (asf repo is a monorepo and it's containing svn:externals properties)
Related to #3839 (closed) (we are dealing with large repositories here)
@anlambert ^ with your current awesome work, that may actually converge at some point ;)