Design a solution for consistent storage pagination

changed milestone to %Setup efficient and consistent swh-storage pagination [Roadmap - Technical debt]

added activity::Design label

assigned to @jayeshv

changed the description

Storage pagination

Implement a consistent mechanism for pagination in SWH codebase

Goals

Pagination using cursors
Possibility to move backwards and forwards from a cursor (with a limit)
Possibility to start from any arbitrary item
Possibility to get deterministic results with the same filters and orderbys (other than any deleted or a trailing additional item)
Possibility to get the total number of available items with the applied filters
Possibility to get has_previous_pages and has_next_pages
High level generic validations
Generic way to filter items
Generic way to order (sort) items
Possibility to cache items in SWH-storage
Possibility to encode/decode to/from uniform string cursors
Possibility for a pagination client to abstract details and generate opaque universal cursors

Current state of storage pagination

Using cursors
Limited in functionality, too hard to go back
Too hard to start a page from an arbitrary item
Limited information about the source data
Consistent in origin_visits, visit_statuses and snapshot_branches
Partial or in-consistent in all the other places eg: origin_list
No pagination in places like directory_entry
Not very friendly for caching

Solution: steps

Create a generic response format (Improve the current PagedResult)
Implement a paginator class and base classes (can be used for any dataset with a possible ordering)
Paginate in storage (including filters and sorts)
Fix all the clients

Pagination response object

In SWH.core

class PageInfo:
      # end cursor is the computed last item cursor
      end_cursor: tuple | None   # internal cursor, not opaque
      
      # start cursor is the computed first item cursor
      start_cursor: tuple | None  # internal cursor, not opaque
      
      has_next_page: bool
      
      has_previous_page: bool


class PagedResult:
      results: [<SWH obj>, ..]
      
      cursor_key: tuple | None     # Field(s) to be used as the item cursor (metadata)
      
      page_info: pageInfo
      
      total_count: int | None

eg responses
For origin visit

data = {
    "results": [<visit obj1>, <visit obj2>,...],
    "cursor_key": ("visit", ),
    "total_count": 43,
    "page_info": {
        end_cursor: (20, ),
        start_cursor: (10, ),
        has_next_page: False,
        has_previous_page: True
    }
}

For origin list

data = {
    "results": [<origin obj1>, <origin obj2>, ...],
    "cursor_key": ("created_at", "sha1"),   # CreatedAt is not available in origins
    "total_count": 43434343,
    "page_info": {
        end_cursor: (3232232, b'dsdsds'),
        start_cursor: (434333, b'ffsfsd'),
        has_next_page: True,
        has_previous_page: True
    }
}

table data Requirements

A single or combination key that can be used to consistently order items.
eg: visit in origin_visits
eg: There is no such field in origin_list (sha1Git is not consistent)

Adding a createdAt field will make (createdAt, sha1Git) a possible combination

Possible implementations

Hacking the existing code (minimal changes to the storage interface)
Using a wrapper (using slices and a cache, eg in GraphQL, relatively easy to do with minimal changes to the interface)
Using a pagination manager class and individual implementations (with base classes) (will change the interface)

Sample implementation using a paginator manager

Keep this stateless

def paginator(paginator_type: str, next: int, previous: int, cursor: str, *args, **kw):
    # generic validations egs:
    if both next and previous:
       raise Error
    if not is_a_valid_cursor(cursor):  # check for the format
       raise Error
    if next > max_limit or previous > max_limit:  # common limits, can be different for individual paginators
       raise Error

    # other validations

    instance = get_instance(paginator_type, direction, filtes, order, ..)  # returns the right instance for the backend
    if not instance:
       raise Error

    # cache lookup with cursor, filters and orderby

    try:
        query = instance.get_query()
    except Error:
       # log
       raise Error
    try:
        response = instance.execute_query(query)
    except Error:
       # log
       raise Error

    # cache update

    # Build paged data from the std response
    
    paged_results = PagedResult(
        results=response["results"],
        cursor_key=instace._pagination_key,
        page_info=<generate page info from response>
    )

    return paged_results

Example (visit_origin)

Pseudo code for for Postgres

class OriginVisitPaginator:

    _pagination_key = ("visit", )     # Will be the cluster key for Cassandra
    _table_alias = "v"                # to handle joins using v.visit

    def __init__(self, cursor, direction, limit, filters.., order..):
        # Validations for cursor, filters etc
        ...

    def get_pagination_clause(self):
        # < or > depends on the direction of pagination
        # This function will go to a base class
        s = ""
        for each in cursor:
            s += """WHERE _pagination_key[i] > each"""
        return s

    def get_query(self):
        cl = get_pagination_clause()
        return """SELECT visit, FROM ...WHERE... ORDER BY <cursor fields>"""

    def execute(query):
        return {
            "results": [<vist obj>],
            "cursor" (visit, )
        }

Yet to find the right way to run an aggregate query for count(*) outside the pagination clause
Yet to find the right way to find has_next and has_previous

Pagination client

Client to abstract the complexities
In SWH.core, storage or in SWH-commons? (maybe)
To be used by a client (eg Web or GraphQL) for pagination
Talks using opaque tokens and hides the complexity of internal tokens
Same cursor everywhere and can be shared
Code mostly written in the GraphQL layer

Class PaginatorClient:


      def __init__(self, paginator_type, cursor, filters, sort):
          # cursor will be None to start
          self.paginator_type = paginator_type
          ...

          self.data = []

      <!-- def __iter__(self): -->
      <!--     return self -->

      def encode_cursor() -> str:
          # Create uniform cursors (same length, type etc)
          # std algo to convert a tuple to base64 encoded str
          # algo can be as simple as serialize the tuple and encode
          # same algo will also be available as functions
          ...

      def decode_cursor() -> tuple:
          ...
          

      def next(limit):
          # make the call
          # update the cursor and metadata
          # raise NoMorePages error
          ...
          self.data = [<obj1>, <obj2>]

      def previous(limit):
          ...
          self.data = [<obj1>]

      def get_data(self) -> list:
          return self.data
          
      def get_data_with_cursors(self) -> list[(<obj>, str)]:
          # loop over the item to generate cursor
          return [(x, get_item_cursor(i)) for (i, x) in enumerate(self.data)]

      def get_item_cursor(i) -> str:
          # returns opaque cusror
          # read the ith item and generate the cursor
          return cursor

      def get_end_cursor(self) -> str:
          # returns opaque cusror
          retrun cursor of the last item

      def get_first_cursor(self) -> str:
          # returns opaque cusror
          retrun cursor of the first item

      def get_total_count():
          # Yet to find the right way to implement this
          # run the query if this is the first one called
          ...

eg usage

cl = PaginationClient(paginator_type="visit", cursor=None, origin="str")
cl.next(10)
data = cl.get_data_with_cursors()
data = [
    (<obj1>, <opaque obj1_curso>),
    ...
]
total = cl.get_total_count()

cl.previous(100)
data = cl.get_data()
data = [<obj1>, ..]

Issues with this approach

Hard to type filters and orderby fields in the generic code
Hard to type cursor keys
- Both can be done in individual classes, so may not be an issue

Storage interface changes

I don't know very well yet
Storage interface should be as close as possible to the backend eg: redundant logic in origin_visits for Cassandra and Postgres
Can we split these to two separate interfaces, one for backends and other for functionalities? (maybe not)

Questions

Will there be more backends with very different data models?
How big a deal is to add a field for consistent sorting? (eg: createdAt in origin)
Effects of adjusting cluster keys in Cassandra?

Cache

Can be implemented along with pagination
Will include nodes as well (easier to cache)
Can add a smart pagination cache with not a lot of logic
Code mostly written in the GraphQL layer

Performance improvements

Cache
Yet to design queries

Tasks

Define a generic query builder in postgres DB (Done)
- for cassandra (Tested)
Refactor paginated functions using the builder (total 7 for pagination, 2 done)
- for cassandra
Add a generic select for cursor (called item-cursor) in the query builder (tested)
- for Cassandra
Define the new pagination response object (maybe move to swh-core, Done)
Define generic pagination base class (Done)
- Make cursors opaque and consistent for clients
- Define RPC encoders and decoders
Define generic pagination client class (to be used by a client, maybe in swh-core) (Done)
Refactor interface to use pagination class (Part done, Complex, hard without breaking existing clients)
- Re-factor clients to use Pagination client (One place change)
  - swh-web
  - graphql
  - ...
Add missing paginations (directory-entry, part done)
- for Cassandra
Improve pagination in snapshot branches
- for Cassandra

marked this issue as related to #4695

marked this issue as related to #4696 (closed)

Design a solution for consistent storage pagination

Define the pagination requirements.

Tasks

Proposed solution (first draft)

Designs

Child items ...

Activity

Storage pagination

Goals

Current state of storage pagination

Solution: steps

Pagination response object

table data Requirements

Possible implementations

Sample implementation using a paginator manager

Example (visit_origin)

Pagination client

Issues with this approach

Storage interface changes

Questions

Cache

Performance improvements

Tasks