The loader got killed after it starts to consume a lot of memory...
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/srcKilledswh@loader-cvs-manual:~$
INFO:swh.loader.cvs.loader.CvsLoader:Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs'DEBUG:swh.loader.cvs.loader.CvsLoader:lister_not provided, skipping extrinsic origin metadataDEBUG:swh.loader.cvs.loader.CvsLoader:prepare; origin_url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src scheme=ssh path=/cvsroot/srcWarning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/srcERROR:swh.loader.cvs.loader.CvsLoader:Loading failure, updating to `failed` statusTraceback (most recent call last): File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/core/loader.py", line 391, in load self.prepare() File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 502, in prepare self.rlog.parse_rlog(main_rlog_file) File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/rlog.py", line 220, in parse_rlog raise ValueError("No filename found in rlog header")ValueError: No filename found in rlog headerDEBUG:urllib3.connectionpool:Resetting dropped connection: storage1.internal.staging.swh.networkDEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): sentry.softwareheritage.org:443DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/store/ HTTP/1.1" 200 41DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/envelope/ HTTP/1.1" 200 2DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit_status/add HTTP/1.1" 200 26DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /flush HTTP/1.1" 200 1DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /clear/buffer HTTP/1.1" 200 1DEBUG:swh.loader.cvs.loader.CvsLoader:cleanup{'status': 'failed'} for origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src'real 350m49.023suser 134m4.740ssys 4m38.862s
The loader got killed after it starts to consume a lot of memory...
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/srcKilledswh@loader-cvs-manual:~$
Regarding that memory issue, D8682 should help avoid it.
I encountered the same kind of issues with large repos when I was working on the subversion loader
and after applying the same kind of patch the memory consumption was much more reasonable
and overall loader performance was much better.
@vsellier, I found other memory consumption issues in CVS loader implementation while testing the parsing of the huge rlog output of NetBSD (script got OOM killed). I fixed those in D8683.
A new attempt was launched with the 0.5.1 version and its improvements. It looks better in terms of memory consumption but the process failed with a new error
DEBUG:swh.loader.cvs.loader.CvsLoader:checkout to b'/tmp/swh.loader.cvs.io8fcdjp-429/src/usr.bin/rcs/doc/rcs.ms'ERROR:swh.loader.cvs.loader.CvsLoader:Exception in fetch_data:Traceback (most recent call last): File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 616, in fetch_data data = next(self.swh_revision_gen) File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 312, in process_cvs_changesets self.checkout_file_with_cvsclient(k, f, self.cvsclient) File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 264, in checkout_file_with_cvsclient fp = cvsclient.checkout(path, f.rev, dirname, expand_keywords=True) File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/cvsclient.py", line 435, in checkout raise CVSProtocolError("Error from CVS server: %s" % response)swh.loader.cvs.cvsclient.CVSProtocolError: Error from CVS server: b"E cvs checkout: Skipping `$Log$' keyword due to excessive comment leader.\n"
All repository files are dumped to disk first and loading is performed without any network requests afterwards
so it should be faster and we should not hit the issue above.