Skip to content
Snippets Groups Projects

Storage: Enable to paginate, filter and count snapshot content

First draft in order to enable pagination and filtering when one wants to query the content of a snapshot.

That diff adds new optional parameters to the Storage methods returning the content of a snapshot (snapshot_get, snapshot_get_by_origin_visit, snapshot_get_latest):

  • branches_offset: enable to skip a given amount of branches before returning the snapshot content

  • branches_limit: enable to restrain the amount of branches returned in the snapsphot content

  • branches_targets: enable to filter the target type of branches returned in the snapshot branches

I have also added a new method snapshot_branches_count in order to only return the amount of branches according to their target type.

Plugin these new features in swh-web effectively remove the slowdowns observed when browsing an origin with a large amount of branches like for instance https://github.com/v8/v8

Related swh-web#1207 (closed)

Test Plan

Tests showing how to use these new parameters have been added.


Migrated from D487 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Maintainer

    Rebase to master and fix a couple of typos

  • Thanks!

    I have made a few comments inline.

    I also have a more generic doubt about the interface : I don't like multiplying the arguments to all functions that access snapshots; I'd rather we limit the results to a sensible value by default (e.g. the first 100 branches), and give callers the information that more data is available;

    The current functions would be unchanged, they would return a snapshot with its id and a list of branches. We would just add one field to the return value, next_branch, defaulting to None, that the caller would have to check to see if the list of branches was complete or not.

    The full scrolling list of branches would then be available through a single (new) endpoint referencing the snapshot id and the first branch to fetch:

    def snapshot_get_branches(id, first_branch=None, count=None, target_types=None):
       ...

    This keeps the API more regular and avoids us doing more joins than necessary on the backend when a client wants the full list of branches for a given snapshot.

  • Merge request was returned for changes

  • Author Maintainer

    First diff update adressing @olasd comments:

    • remove the use of offset in the sql query used to return snapshot branches

    • do not modify parameters list of method snapshot_get, snapshot_get_by_origin_visit, snapshot_get_latest

    • snapshot content returned by these methods only contains the first 1000 branches (this seems a good default value imho, as I did not notice any performance issues and it should returned the whole set of branches for a majority of snapshots), a new field called next_branch is now present in the returned dict possibly containing the name of the first branch not returned.

    • add method snapshot_get_branches with the following optional parameters:

      • branches_from: skip branches to return whose name is lesser than this value

      • branches_count: maximum amount of branches to return

      • target_types: list of target types to return while filtering out the others

    • rename method snapshot_branches_count to snapshot_count_branches

  • Merge request was accepted

  • Nicolas Dandrimont approved this merge request

    approved this merge request

  • Author Maintainer

    Merge request was merged

Please register or sign in to reply
Loading