Skip to content
Snippets Groups Projects
  1. Aug 10, 2021
  2. Jul 27, 2021
  3. Jul 09, 2021
  4. Jun 19, 2021
  5. Jun 17, 2021
  6. Jun 15, 2021
  7. Jun 09, 2021
  8. Jun 02, 2021
  9. May 12, 2021
  10. May 10, 2021
  11. May 08, 2021
  12. May 07, 2021
  13. May 05, 2021
  14. May 04, 2021
  15. Apr 27, 2021
  16. Apr 23, 2021
  17. Apr 16, 2021
    • Antoine Lambert's avatar
      cli: Fix sphinx warning · 04a1656c
      Antoine Lambert authored
      This fixes the following warning:
      
      <generated>:1: WARNING: Inline emphasis start-string without end-string.
      
      Related to T2265
      04a1656c
  18. Apr 15, 2021
  19. Apr 09, 2021
    • Antoine Pietri's avatar
    • Antoine Pietri's avatar
      NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID · 53bbd5c6
      Antoine Pietri authored
      Right now we are generating two different binary mappings on disk for
      the translation between SWHID <-> webgraph node ID:
      
      1. The node2swh.bin reverse map, which contains a list of binary SWHID.
      It allows O(1) access to the SWHID of a given node n by seek()ing to the
      position (n * record size).
      
      2. The swhid2node.bin map, which contains a list of <SWHID, node ID>
         ordered by SWHID. The node ID of a given SWHID can be found in
         O(log N) by doing a binary search on this list.
      
      Because the swhid -> node binary search requires multiple seek() on the
      disk, it is very slow and cannot be used for performance sensitive code.
      
      The alternative route is to compute the node from the minimal perfect
      hash function and then lookup the equivalent node in the permuted graph
      using the .order file containing the permutation, which can be done in
      O(1) and is extremely fast. However, this has two caveats:
      
      - The .order file is ~150 GB, which would be too big to load in memory.
      
      - MPH cannot check whether their input is valid. They could do so
        probabilistically if we signed them, but when replying to a query in
        the graph service, we want to be absolutely certain that a node is or
        is not present in the graph.
      
      This code mitigates these problems in two ways. First, it memory-maps
      the permutation file to avoid loading it in main memory. Second, it uses
      a roundtrip check to detect invalid SWHIDs: we hash + permute the
      SWHID, then use the reverse map to check that the obtained node ID is
      associated to the original SWHID.
      
      This is a big performance gain (basic benchmarks show a ~x3 speedup).
      To go even faster, we offer a boolean option to skip the roundtrip
      check, to use when we know that the input is always valid.
      
      This will also allow us in the future to remove the swhid2node map
      completely, however it is currently still in use by the Python frontend
      to encode the SWHIDs. This will be done directly in the Java side in the
      future.
      53bbd5c6
Loading