Implement a consistent mechanism for pagination in SWH codebase
Goals
Pagination using cursors
Possibility to move backwards and forwards from a cursor (with a limit)
Possibility to start from any arbitrary item
Possibility to get deterministic results with the same filters and orderbys (other than any deleted or a trailing additional item)
Possibility to get the total number of available items with the applied filters
Possibility to get has_previous_pages and has_next_pages
High level generic validations
Generic way to filter items
Generic way to order (sort) items
Possibility to cache items in SWH-storage
Possibility to encode/decode to/from uniform string cursors
Possibility for a pagination client to abstract details and generate opaque universal cursors
Current state of storage pagination
Using cursors
Limited in functionality, too hard to go back
Too hard to start a page from an arbitrary item
Limited information about the source data
Consistent in origin_visits, visit_statuses and snapshot_branches
Partial or in-consistent in all the other places eg: origin_list
No pagination in places like directory_entry
Not very friendly for caching
Solution: steps
Create a generic response format (Improve the current PagedResult)
Implement a paginator class and base classes (can be used for any dataset with a possible ordering)
Paginate in storage (including filters and sorts)
Fix all the clients
Pagination response object
In SWH.core
class PageInfo: # end cursor is the computed last item cursor end_cursor: tuple | None # internal cursor, not opaque # start cursor is the computed first item cursor start_cursor: tuple | None # internal cursor, not opaque has_next_page: bool has_previous_page: boolclass PagedResult: results: [<SWH obj>, ..] cursor_key: tuple | None # Field(s) to be used as the item cursor (metadata) page_info: pageInfo total_count: int | None
data = { "results": [<origin obj1>, <origin obj2>, ...], "cursor_key": ("created_at", "sha1"), # CreatedAt is not available in origins "total_count": 43434343, "page_info": { end_cursor: (3232232, b'dsdsds'), start_cursor: (434333, b'ffsfsd'), has_next_page: True, has_previous_page: True }}
table data Requirements
A single or combination key that can be used to consistently order items.
eg: visit in origin_visits
eg: There is no such field in origin_list (sha1Git is not consistent)
Adding a createdAt field will make (createdAt, sha1Git) a possible combination
Possible implementations
Hacking the existing code (minimal changes to the storage interface)
Using a wrapper (using slices and a cache, eg in GraphQL, relatively easy to do with minimal changes to the interface)
Using a pagination manager class and individual implementations (with base classes) (will change the interface)
Sample implementation using a paginator manager
Keep this stateless
def paginator(paginator_type: str, next: int, previous: int, cursor: str, *args, **kw): # generic validations egs: if both next and previous: raise Error if not is_a_valid_cursor(cursor): # check for the format raise Error if next > max_limit or previous > max_limit: # common limits, can be different for individual paginators raise Error # other validations instance = get_instance(paginator_type, direction, filtes, order, ..) # returns the right instance for the backend if not instance: raise Error # cache lookup with cursor, filters and orderby try: query = instance.get_query() except Error: # log raise Error try: response = instance.execute_query(query) except Error: # log raise Error # cache update # Build paged data from the std response paged_results = PagedResult( results=response["results"], cursor_key=instace._pagination_key, page_info=<generate page info from response> ) return paged_results
Example (visit_origin)
Pseudo code for for Postgres
class OriginVisitPaginator: _pagination_key = ("visit", ) # Will be the cluster key for Cassandra _table_alias = "v" # to handle joins using v.visit def __init__(self, cursor, direction, limit, filters.., order..): # Validations for cursor, filters etc ... def get_pagination_clause(self): # < or > depends on the direction of pagination # This function will go to a base class s = "" for each in cursor: s += """WHERE _pagination_key[i] > each""" return s def get_query(self): cl = get_pagination_clause() return """SELECT visit, FROM ...WHERE... ORDER BY <cursor fields>""" def execute(query): return { "results": [<vist obj>], "cursor" (visit, ) }
Yet to find the right way to run an aggregate query for count(*) outside the pagination clause
Yet to find the right way to find has_next and has_previous
Pagination client
Client to abstract the complexities
In SWH.core, storage or in SWH-commons? (maybe)
To be used by a client (eg Web or GraphQL) for pagination
Talks using opaque tokens and hides the complexity of internal tokens
Same cursor everywhere and can be shared
Code mostly written in the GraphQL layer
Class PaginatorClient: def __init__(self, paginator_type, cursor, filters, sort): # cursor will be None to start self.paginator_type = paginator_type ... self.data = [] <!-- def __iter__(self): --> <!-- return self --> def encode_cursor() -> str: # Create uniform cursors (same length, type etc) # std algo to convert a tuple to base64 encoded str # algo can be as simple as serialize the tuple and encode # same algo will also be available as functions ... def decode_cursor() -> tuple: ... def next(limit): # make the call # update the cursor and metadata # raise NoMorePages error ... self.data = [<obj1>, <obj2>] def previous(limit): ... self.data = [<obj1>] def get_data(self) -> list: return self.data def get_data_with_cursors(self) -> list[(<obj>, str)]: # loop over the item to generate cursor return [(x, get_item_cursor(i)) for (i, x) in enumerate(self.data)] def get_item_cursor(i) -> str: # returns opaque cusror # read the ith item and generate the cursor return cursor def get_end_cursor(self) -> str: # returns opaque cusror retrun cursor of the last item def get_first_cursor(self) -> str: # returns opaque cusror retrun cursor of the first item def get_total_count(): # Yet to find the right way to implement this # run the query if this is the first one called ...