- Feb 10, 2025
-
-
Antoine Lambert authored
Latest beautifulsoup4 release (4.13) seems to have fixed issues related to unexpected encodings in XML files so a test that was passing previously is now failing. Update that test to check origin URL and visit type can be successfully extracted from a POM file with unexpected encoding.
-
- Jan 22, 2025
-
-
-
Antoine Lambert authored
The bitbucket Web API to list repositories has buggy pages that needs to be skipped to continue the listing. Previously the request to get the next page when a buggy page is detected was missing the after query parameter so the request was always returning the second page of repositories listing endpoint. Also refine buggy page detection by considering all HTTP status code >= 500.
-
- Dec 11, 2024
-
-
Antoine Lambert authored
Scheduler temporary backend has been removed in favor of a more efficient memory backend.
-
- Nov 07, 2024
-
-
Antoine Lambert authored
-
- Oct 29, 2024
-
-
Antoine Lambert authored
Check multiple origins in parallel using the concurrent.futures module to greatly speedup the whole listing process. Related to #4709
-
- Oct 28, 2024
-
-
Recent changes in base Lister class implementation turn the call to self.scheduler.update_lister mandatory to update the last termination date for a lister. It has some side effects on the GitHub lister as there is one incremental instance plus multiple range ones relisting previously discovered repos executed in parallel. Range GitHub listers should not override the shared incremental lister state as StaleData exceptions might be raised otherwise, so override the set_state_in_scheduler Lister method to ensure that.
-
David Douard authored
The former has been deprecated for ages...
-
- Oct 24, 2024
-
-
Antoine Lambert authored
Previously it could be set by any call to the `set_state_in_scheduler` method. This was leading to side effects on the save bulk lister while updating the scheduler state when encountering an invalid or not found origin, and thus the listing failed. Fixes #4712.
-
- Oct 14, 2024
-
-
Antoine Lambert authored
Related to swh-scheduler#4687.
-
Antoine Lambert authored
It enables to declare a lister whose first visits of listed origins must be scheduled with high priority. Related to swh-scheduler#4687.
-
Antoine Lambert authored
It enables to track last lister execution date and will be used to schedule first visits with high priority for listed origins. Related to swh-scheduler#4687.
-
- Sep 05, 2024
-
-
Antoine Lambert authored
The sourceforge lister sends various HTTP requests to get info about a project, for instance to get the branch name of a Bazaar project. If HTTP errors occurred during these steps, they were discarded in order for the listing to continue but connection errors were not and as a consequence the listing was failing when encountering such error. Currently, the legacy Bazaar project hosted on sourceforge seems down and connection errors are raised when attempting to fetch branch names so the lister does not process all projects as it crashes in mid-flight.
-
- Sep 04, 2024
-
-
Antoine Lambert authored
This new and special lister enables to verify a list of origins to archive provided by users (for instance through the Web API). Its purpose is to avoid polluting the scheduler database with origins that cannot be loaded into the archive. Each origin is identified by an URL and a visit type. For a given visit type the lister is checking if the origin URL can be found and if the visit type is valid. The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus the one for loading a tarball content into the archive. Accepted origins are inserted or upserted in the scheduler database. Rejected origins are stored in the lister state. Related to #4709
-
- Sep 02, 2024
-
-
Antoine Lambert authored
-
- Aug 27, 2024
-
-
David Douard authored
-
David Douard authored
-
Antoine Lambert authored
Those extrinsic metadata can be directly fetched by the loader through the crates Web API, plus it contains more metadata fields.
-
Antoine Lambert authored
Instead of having a single crate and its versions info per page, prefer to have up to 1000 crates per page to significantly speedup the listing process.
-
Antoine Lambert authored
Previously, the lister state was recorded regardless if errors occurred when listing crates as the finalize method is called regardless of raised exception during listing. As a consequence some crates could be missed as the incremental listing restarts from the dump date of the last processed crate database. So ensure all crates have been processed by the lister before recording its state.
-
Antoine Lambert authored
packaging.version.parse is dedicated to parse Python package version numbers but crate versions do not necessarily respect Python version number conventions and thus some crate versions cannot be parsed. Prefer to use looseversion.LooseVersion2 instead which in a drop-in replacement for deprecated distutils.version.LooseVersion and enables to parse all kind of version numbers.
-
Antoine Lambert authored
A size limit of 1000000 was not enough to properly process all CSV crates data so bump to a higher value.
-
- Jul 18, 2024
-
-
Nicolas Dandrimont authored
For now this information is not used downstream, but it can be useful for specific analysis or one-shot scheduling.
-
- Jun 28, 2024
-
-
Antoine Lambert authored
Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).
-
- Jun 05, 2024
-
-
Antoine Lambert authored
Gitea API return next pagination link with all query parameters provided to an API request. As we were also passing a dict of fixed query parameters to the page_request method, some query parameters ended up having multiple instances in the URL for fetching a new page of repositories data. So each time a new page was requested, new instances of these parameters were appended to the URL which could result in a really long URL if the number of pages to retrieve is high and make the request fail. Also remove a debug log already present in http_request method.
-
- May 22, 2024
-
-
Antoine Lambert authored
The oldest part of the scheduler API was updated to use model classes (based on attr package) instead of dictionaries in order to improve typing.
-
- Apr 24, 2024
-
-
Antoine Lambert authored
Redirection URLs can be long and quite obscure in some cases (GitHub CDN for instance) so ensure to use the redirected URL as origin URL. Related to swh/meta#5090.
-
- Apr 16, 2024
-
-
Antoine Lambert authored
As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag | NavigableString | None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.
-
- Mar 29, 2024
-
-
David Douard authored
-
- Mar 14, 2024
-
-
Antoine Lambert authored
Some Guix packages correspond to subset exports of a subversion source tree at a given revision, typically the Tex Live ones. In that case, we must pass an extra parameter to the svn-export loader to specify the sub-paths to export but also use a unique origin URL for each package to archive as otherwise the same one would be used and only a single package would be archived. Related to swh/infra/sysadm-environment#5263.
-
- Mar 13, 2024
-
-
Antoine Lambert authored
Remove use of --import-mode=importlib pytest option and use new option consider_namespace_packages to fix tests execution with latest pytest release.
-
Antoine Lambert authored
It fixes installation of dependencies required by swh-scheduler pytest plugin.
-
- Feb 05, 2024
-
-
Antoine Lambert authored
Related to swh/meta#5075.
-
- Jan 18, 2024
-
-
Antoine Lambert authored
In addition to query parameters also check if any part of URL path contains a tarball filename. It fixes the detection of some tarball URLs provided in Guix manifest. Related to swh/meta#3781.
-
- Jan 17, 2024
-
-
David Douard authored
Link to the user documentation instead. Also add a section on required binary tools.
-
- Jan 10, 2024
-
-
Jérémy Bobbio (Lunar) authored
Commit c2402f40 renamed the entry points from `lister.*` without updating the rest of the framework. Revert the changes (and sort the list alphabetically).
-
- Jan 09, 2024
-
-
Franck Bret authored
Use another Api endpoint that helps the lister to be stateful. The Api endpoint used needs a ``since`` value that represents a sequential index in the history. The ``all_packages_count`` state helps in storing a count which will be used as ``since`` argument on the next run.
-
Franck Bret authored
'url' and 'instance' are mandatory Add elm lister entry to pyproject.toml
-
Franck Bret authored
The Elm Lister lists Elm packages origins from the Elm lang registry. It uses an http api endpoint to list packages origins. Origins are Github repositories, releases take advantages of Github relase Api.
-
- Jan 08, 2024
-
-
Antoine Lambert authored
Guix now provides a "submodule" info in the sources.jon file it produced so exploit it to set the new "submodules" parameter of the git-checkout loader in order to retrieve submodules only when it is required. Related to swh-loader-git#4751.
-