collections: Add index to ImmutableDict to speedup look up by key
Previously when looking up data by key in an ImmutableDict, the inner tuple storing keys and values was iterated until finding the requested key.
This is not really efficient when the ImmutableDict contains a lot of entries, typically for an origin snapshot containing a lot of branches.
So add an index to speedup look up by key operations and improve loader performances.
Before these changes, this is the timing we obtain when performing a new visit of the v8 repository (for the record I am using a homemade storage proxy that reads from production storage and writes in a memory storage):
14:04 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/v8/v8' with type 'git'
Enumerating objects: 106, done.
Counting objects: 100% (106/106), done.
Compressing objects: 100% (96/96), done.
Total 106 (delta 48), reused 56 (delta 8), pack-reused 0
INFO:swh.loader.git.loader:Listed 30888 refs for repo https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Fetched 107 objects; 107 are new
{'status': 'eventful'} for origin 'https://github.com/v8/v8'
real 0m59,440s
user 0m33,682s
sys 0m0,847s
And this is the timing we obtain after applying these changes:
14:07 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/v8/v8' with type 'git'
Enumerating objects: 106, done.
Counting objects: 100% (106/106), done.
Compressing objects: 100% (96/96), done.
Total 106 (delta 48), reused 56 (delta 8), pack-reused 0
INFO:swh.loader.git.loader:Listed 30888 refs for repo https://github.com/v8/v8
INFO:swh.loader.git.loader.GitLoader:Fetched 107 objects; 107 are new
{'status': 'eventful'} for origin 'https://github.com/v8/v8'
real 0m28,338s
user 0m3,613s
sys 0m0,945s
So a x2 speedup !
Merge request reports
Activity
Jenkins job DMOD/gitlab-builds #2 succeeded .
See Console Output and Coverage Report for more details.32 32 self.data = data.data 33 33 else: 34 34 self.data = tuple(data) 35 self.idx = {k: i for i, k in enumerate(k for k, _ in self.data)} 35 36 36 37 def __repr__(self): 37 38 return f"ImmutableDict({dict(self.data)!r})" 38 39 39 40 def __getitem__(self, key): 40 for (k, v) in self.data: 41 if k == key: 42 return v 41 idx = self.idx.get(key) 42 if idx is not None: 43 return self.data[idx][1] 43 44 raise KeyError(key) - Comment on lines 39 to 43Edited by Nicolas Dandrimont
changed this line in version 2 of the diff
added 1 commit
- 9565f7bf - collections: Add index to ImmutableDict to speedup look up by key
^ Apply @olasd comment.
Jenkins job DMOD/gitlab-builds #3 succeeded .
See Console Output and Coverage Report for more details.What is the proper fix then ? Using @vlorentz approach, I came up with the code below and tests are still happy:
class ImmutableDict(Mapping, Generic[KT, VT]): """A frozen dictionary. This class behaves like a dictionary, but internally stores objects in a tuple, so it is both immutable and hashable.""" _data: Dict[KT, VT] def __init__( self, data: Union[ Iterable[Tuple[KT, VT]], "ImmutableDict[KT, VT]", Dict[KT, VT] ] = {}, ): if isinstance(data, dict): self._data = data elif isinstance(data, ImmutableDict): self._data = data._data else: self._data = {k: v for k, v in data} @property def data(self): return tuple(self._data.items()) def __repr__(self): return f"ImmutableDict({dict(self.data)!r})" def __getitem__(self, key): return self._data[key] def __iter__(self): for (k, v) in self.data: yield k def __len__(self): return len(self._data) def items(self): yield from self.data def __hash__(self): return hash(tuple(sorted(self.data))) def copy_pop(self, popped_key) -> Tuple[Optional[VT], "ImmutableDict[KT, VT]"]: """Returns a copy of this ImmutableDict without the given key, as well as the value associated to the key.""" new_items = copy.deepcopy(self._data) popped_value = new_items.pop(popped_key, None) return (popped_value, ImmutableDict(new_items))
- Resolved by Nicolas Dandrimont
Yeah, sure, that's probably fine. I just meant to say that immutability in Python has always been more of a convention than something you can reliably enforce.
How/when/why did we introduce
__hash__
without__eq__
? Seems like these__hash__
semantics are incomplete: they hash ImmutableDicts with different key orders the same way, when I'm not sure they'd really be__eq__
ual to one another.
added 1 commit
- d6d17dad - collections: Improve ImmutableDict look up by key performance
^ Apply @vlorentz comment
Instead, let's make
data
a read-only property that returns a tuplification of the underlying dictJenkins job DMOD/gitlab-builds #4 succeeded .
See Console Output and Coverage Report for more details.