Skip to content
Snippets Groups Projects

collections: Add index to ImmutableDict to speedup look up by key

Merged Antoine Lambert requested to merge anlambert/swh-model:immutable-dict-add-index into master
1 unresolved thread

Previously when looking up data by key in an ImmutableDict, the inner tuple storing keys and values was iterated until finding the requested key.

This is not really efficient when the ImmutableDict contains a lot of entries, typically for an origin snapshot containing a lot of branches.

So add an index to speedup look up by key operations and improve loader performances.

Before these changes, this is the timing we obtain when performing a new visit of the v8 repository (for the record I am using a homemade storage proxy that reads from production storage and writes in a memory storage):

14:04 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/v8/v8' with type 'git'
Enumerating objects: 106, done.
Counting objects: 100% (106/106), done.
Compressing objects: 100% (96/96), done.
Total 106 (delta 48), reused 56 (delta 8), pack-reused 0
INFO:swh.loader.git.loader:Listed 30888 refs for repo https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Fetched 107 objects; 107 are new
{'status': 'eventful'} for origin 'https://github.com/v8/v8'

real    0m59,440s
user    0m33,682s
sys     0m0,847s

And this is the timing we obtain after applying these changes:

14:07 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/v8/v8' with type 'git'
Enumerating objects: 106, done.
Counting objects: 100% (106/106), done.
Compressing objects: 100% (96/96), done.
Total 106 (delta 48), reused 56 (delta 8), pack-reused 0
INFO:swh.loader.git.loader:Listed 30888 refs for repo https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Fetched 107 objects; 107 are new
{'status': 'eventful'} for origin 'https://github.com/v8/v8'

real    0m28,338s
user    0m3,613s
sys     0m0,945s

So a x2 speedup !

Merge request reports

Pipeline #607 passed

Pipeline passed for d6d17dad on anlambert:immutable-dict-add-index

Merged by Antoine LambertAntoine Lambert 2 years ago (Feb 14, 2023 12:59pm UTC)

Loading

Pipeline #633 passed

Pipeline passed for d6d17dad on master

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
32 32 self.data = data.data
33 33 else:
34 34 self.data = tuple(data)
35 self.idx = {k: i for i, k in enumerate(k for k, _ in self.data)}
35 36
36 37 def __repr__(self):
37 38 return f"ImmutableDict({dict(self.data)!r})"
38 39
39 40 def __getitem__(self, key):
40 for (k, v) in self.data:
41 if k == key:
42 return v
41 idx = self.idx.get(key)
42 if idx is not None:
43 return self.data[idx][1]
43 44 raise KeyError(key)
  • Nicolas Dandrimont approved this merge request

    approved this merge request

  • Ah, yeah, of course, this was created for "write only" key/value storage initially, so we never thought to improve this...

    Thanks!

  • added 1 commit

    • 9565f7bf - collections: Add index to ImmutableDict to speedup look up by key

    Compare with previous version

  • Author Maintainer

    ^ Apply @olasd comment.

  • Jenkins job DMOD/gitlab-builds #3 succeeded .
    See Console Output and Coverage Report for more details.

  • As discussed IRL, there is no need to have both the data tuple and idx dict. The reason for the tuple was to guarantee being read-only, but we clearly can't do that anymore.

    Instead, let's make data a read-only property that returns a tuplification of the underlying dict

  • I mean, it's not more or less immutable than before (one could always have changed the value of the data attribute). The idx attribute could be made a read-only property as well.

  • Author Maintainer

    What is the proper fix then ? Using @vlorentz approach, I came up with the code below and tests are still happy:

    class ImmutableDict(Mapping, Generic[KT, VT]):
        """A frozen dictionary.
    
        This class behaves like a dictionary, but internally stores objects in a tuple,
        so it is both immutable and hashable."""
    
        _data: Dict[KT, VT]
    
        def __init__(
            self,
            data: Union[
                Iterable[Tuple[KT, VT]], "ImmutableDict[KT, VT]", Dict[KT, VT]
            ] = {},
        ):
            if isinstance(data, dict):
                self._data = data
            elif isinstance(data, ImmutableDict):
                self._data = data._data
            else:
                self._data = {k: v for k, v in data}
    
        @property
        def data(self):
            return tuple(self._data.items())
    
        def __repr__(self):
            return f"ImmutableDict({dict(self.data)!r})"
    
        def __getitem__(self, key):
            return self._data[key]
    
        def __iter__(self):
            for (k, v) in self.data:
                yield k
    
        def __len__(self):
            return len(self._data)
    
        def items(self):
            yield from self.data
    
        def __hash__(self):
            return hash(tuple(sorted(self.data)))
    
        def copy_pop(self, popped_key) -> Tuple[Optional[VT], "ImmutableDict[KT, VT]"]:
            """Returns a copy of this ImmutableDict without the given key,
            as well as the value associated to the key."""
            new_items = copy.deepcopy(self._data)
            popped_value = new_items.pop(popped_key, None)
            return (popped_value, ImmutableDict(new_items))
    • Resolved by Nicolas Dandrimont

      Yeah, sure, that's probably fine. I just meant to say that immutability in Python has always been more of a convention than something you can reliably enforce.

      How/when/why did we introduce __hash__ without __eq__? Seems like these __hash__ semantics are incomplete: they hash ImmutableDicts with different key orders the same way, when I'm not sure they'd really be __eq__ual to one another.

  • added 1 commit

    • d6d17dad - collections: Improve ImmutableDict look up by key performance

    Compare with previous version

  • Author Maintainer

    ^ Apply @vlorentz comment

    Instead, let's make data a read-only property that returns a tuplification of the underlying dict

  • Jenkins job DMOD/gitlab-builds #4 succeeded .
    See Console Output and Coverage Report for more details.

  • Please register or sign in to reply
    Loading