Storage: Enable to paginate, filter and count snapshot content
First draft in order to enable pagination and filtering when one wants to query the content of a snapshot.
That diff adds new optional parameters to the Storage methods returning
the content of a snapshot (snapshot_get
, snapshot_get_by_origin_visit
,
snapshot_get_latest
):
-
branches_offset
: enable to skip a given amount of branches before returning the snapshot content -
branches_limit
: enable to restrain the amount of branches returned in the snapsphot content -
branches_targets
: enable to filter the target type of branches returned in the snapshot branches
I have also added a new method snapshot_branches_count
in order to
only return the amount of branches according to their target type.
Plugin these new features in swh-web effectively remove the slowdowns observed when browsing an origin with a large amount of branches like for instance https://github.com/v8/v8
Related swh-web#1207 (closed)
Test Plan
Tests showing how to use these new parameters have been added.
Migrated from D487 (view on Phabricator)
Merge request reports
Activity
Thanks!
I have made a few comments inline.
I also have a more generic doubt about the interface : I don't like multiplying the arguments to all functions that access snapshots; I'd rather we limit the results to a sensible value by default (e.g. the first 100 branches), and give callers the information that more data is available;
The current functions would be unchanged, they would return a snapshot with its id and a list of branches. We would just add one field to the return value,
next_branch
, defaulting toNone
, that the caller would have to check to see if the list of branches was complete or not.The full scrolling list of branches would then be available through a single (new) endpoint referencing the snapshot id and the first branch to fetch:
def snapshot_get_branches(id, first_branch=None, count=None, target_types=None): ...
This keeps the API more regular and avoids us doing more joins than necessary on the backend when a client wants the full list of branches for a given snapshot.
Some references in the commit message have been migrated:
- T1207 is now swh-web#1207 (closed)
First diff update adressing @olasd comments:
-
remove the use of
offset
in the sql query used to return snapshot branches -
do not modify parameters list of method
snapshot_get
,snapshot_get_by_origin_visit
,snapshot_get_latest
-
snapshot content returned by these methods only contains the first 1000 branches (this seems a good default value imho, as I did not notice any performance issues and it should returned the whole set of branches for a majority of snapshots), a new field called
next_branch
is now present in the returned dict possibly containing the name of the first branch not returned. -
add method
snapshot_get_branches
with the following optional parameters:-
branches_from
: skip branches to return whose name is lesser than this value -
branches_count
: maximum amount of branches to return -
target_types
: list of target types to return while filtering out the others
-
-
rename method
snapshot_branches_count
tosnapshot_count_branches
-