- May 07, 2021
-
-
Antoine Lambert authored
This ensures the mocked sleep will work with all tenacity versions. Related to T3310
-
Antoine Lambert authored
It fixes debian package build of swh-lister on buster.
-
- May 06, 2021
-
-
Raphaël Gomès authored
SourceForge's sitemaps (1 main one + many sharded) give us a "last modified" date for every subsitemap and project, allowing us to perform an incremental listing. We store the subsitemaps' "last modified" dates in the lister state, as well as those of the empty projects (projects which don't have any VCS registered), and the rest comes from the already visited origins from the database. The tests try to cover the possible cases of a subsitemap that has changed, one that hasn't, a project that has change, one that hasn't, and same for an empty project.
-
- Apr 28, 2021
-
-
Antoine Lambert authored
Enable to check package documentation can be built without producing sphinx warnings. The sphinx environment is designed to be used in continuous integration in order to prevent breaking documentation build when committing changes. The sphinx-dev environment is designed to be used inside a full swh development environment. Related to T3258
-
- Apr 27, 2021
-
-
vlorentz authored
Bitbucket's API kind of supports REST workflows, but the clearly use it like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT` make it particularly clear)
-
- Apr 13, 2021
- Apr 04, 2021
-
-
Hezekiah Maina authored
-
Hezekiah Maina authored
-
- Mar 23, 2021
-
-
Raphaël Gomès authored
Following zack's work on T735, this change introduces an actual SWH lister for SourceForge. SourceForge provides a main sitemap that lists sharded sitemaps, which themselves list pages. Each page belongs to a project (or sub-project, though those are rare), information about which can be found by querying a REST API, which gives us the list of any and all VCS used for said project. Both sitemaps and pages have a "last modified" timestamp that will be used in a future patch to implement incremental listing. More precise information can be found as inline comments or docstrings.
-
- Mar 19, 2021
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
These errors happen, sometimes, when requesting large pages of results.
-
Nicolas Dandrimont authored
This makes the logic easier to test.
-
Nicolas Dandrimont authored
These happen, sometimes, when the connection to the GitHub server resets, e.g. because of congestion on a slow link.
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
This will help us to break the retry logic for the listing requests themselves to a separate function too.
-
vlorentz authored
-
- Feb 26, 2021
-
-
This adds a new tutorial which details how to currently write the new listers (both incremental or stateless). This proposes a python template file to start a new lister. At last, this renames the previous tutorial into tutorial-2017. Related to T3073
-
- Feb 08, 2021
-
-
Antoine Lambert authored
Some distributions (e.g. debian-security) have a slightly different URL for retrieving source packages metadata. So add a new URL template to process when trying to download such data. Related to T3032#58239
-
- Feb 05, 2021
-
-
Antoine Lambert authored
Remove outdated part about listers database and use swh CLI in README for executing a lister instead of raw Python code.
-
Antoine Lambert authored
A CRAN package can appear twice in the JSON list returned by the list_all_packages.R script, most recent version of the package appearing first. So handle that edge case to avoid error when sending origins to the scheduler.
-
Antoine Lambert authored
-
Antoine Lambert authored
xmltodict now raises an error while trying to parse the HTML content of https://pypi.org/simple/ page. So use BeautifulSoup HTML parser instead as it is aleady a requirement of swh-lister and it does not fail parsing the PyPI HTML page. Also drop no longer used xmltodict in requirements.
-
- Feb 02, 2021
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Legacy Lister classes from the swh.lister.core mdule are no longer used in swh-lister codebase so it is time to remove them. Also remove lister CLI options related to legacy Lister API. As a consequence, the following requirements are no longer needed: arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql. Closes T2442
-
Antoine Lambert authored
The previous implementation was generating tasks for a non implemented Packagist loader. The new implementation extracts source repository URL, VCS type and last update date for each package referenced by Packagist and send those info to the scheduler. Packages metadata are retrieved using Packagist API endpoints whose responses are served from static files, which are guaranteed to be efficient on the Packagist side (no dymamic queries). Furthermore, subsequent listing will send the "If-Modified-Since" HTTP header to only retrieve packages metadata updated since the previous listing operation in order to save bandwidth and return only origins which might have new released versions. Closes T2991
-
Antoine Lambert authored
UTC timezone settings can be obtained from the datetime.timezone module from Python standard library so remove dependency on external pytz module.
-
- Feb 01, 2021
-
-
Vincent Sellier authored
Ensure the behavior is the same when a base url is provided or not Related to T3013#57810
-
- Jan 29, 2021
-
-
Antoine R. Dumont authored
Listers like github and bitbucket should not be impacted as they already list 1000 records per page.
-
Antoine R. Dumont authored
This adds a second behavior to the cgit lister to actually compute origin urls instead of parsing them out of another http request on git detailed page. This new behavior is expected to be the default behavior. The old behavior is kept for now and is expected to be used as fallback if too much false negatives are returned. Related to T2999
-
Antoine Lambert authored
ISO functionalities port of the stateless GNU lister to the new swh.lister.pattern.Lister API. Closes T2990
-
- Jan 28, 2021
-
-
Antoine Lambert authored
This generates an error due to the datetime type field, so manually build the dict instead. Related to T3003#57551
-
Antoine Lambert authored
launchpadlib can list the last modified repository twice so ensure to yield a single ListedOrigin model for that special case. Related to T3003#57551
-
Antoine R. Dumont authored
In effect, it just allows to add credentials to cgit, cran and pypi listers. This fixes instances of error [1] [1] https://sentry.softwareheritage.org/share/issue/a5fb50f8e43e4b328c4917771576c6b0/ Related to T2998
-
Antoine Lambert authored
An exception is raised when registering task types in scheduler database otherwise.
-
Antoine R. Dumont authored
As origins is a generator, the previous behavior would try to consume the overall generator to send the records. This groups and sends batch of 100 origins to the scheduler for writing. Related to T3003
-
Antoine Lambert authored
Port launchpad lister to the swh.lister.pattern.Lister API. Last update date of each listed git repositories is now sent to the scheduler. The lister can work in incremental mode, only modified repositories since the last listing operation will be returned in that case. Closes T2992
-
Antoine R. Dumont authored
In effect, it just allows to add credentials to cgit, cran and pypi listers. This fixes instances of error [1] [1] https://sentry.softwareheritage.org/share/issue/2c35a9f129cf4982a2dd003a232d507a/ Related to T2998
-
Antoine R. Dumont authored
-