Skip to content
Snippets Groups Projects

Documentation overhaul

Large rework of the entire documentation of swh-graph.

  • New tutorial on how to use the Java API.
  • New page on disk/memory tradeoffs
  • Update the compression pipeline documentation with extensive details
  • Fix the compression steps diagram
  • Merge use-cases "blueprint" documentation in the HTTP API page as usage examples

Migrated from D7839 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
docs/memory.rst 0 → 100644
59 virtual address space. The Linux kernel will then be free to arbitrarily cache
60 the file, either partially or in its entirety, depending on the available
61 memory space.
62
63 In our experiments, memory-mapping a small graph from a SSD only incurs a
64 relatively small slowdown (about 15-20%). However, when the graph is too big to
65 fit in RAM, the kernel has to constantly invalidate pages to cache newly
66 accessed sections, which incurs a very large performance penalty. A full
67 traversal of a large graph that usually takes about 20 hours when loaded in
68 main memory could take more than a year when mapped from a hard drive!
69
70 When deciding what to direct-load and what to memory-map, here are a few rules
71 of thumb:
72
73 - If you don't need random access to the graph edges, you can consider using
74 the "offline" loading mode. The offsets won't be loaded which will save
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • docs/memory.rst 0 → 100644
    110 we load the graph using the memory-mapped loading mode, which makes it use the
    111 shared memory stored in the tmpfs under the hood.
    112
    113 Here is a systemd service that can be used to perform this task automatically:
    114
    115 .. code-block:: ini
    116
    117 [Unit]
    118 Description=swh-graph memory sharing in tmpfs
    119
    120 [Service]
    121 Type=oneshot
    122 RemainAfterExit=yes
    123 ExecStart=mkdir -p /dev/shm/swh-graph/default
    124 ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default"
    125 ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 69 .. code:: console
    95 70
    71 (venv) $ pip install awscli
    72 [...]
    73 (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed
    74 (venv) $ cd 2021-03-23-popular-3k-python/
    75 (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed
    96 76
    97 (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/
    98 77
    99 [...]
    78 You can also retrieve larger graphs, but note that these graphs are generally
    79 intended to be loaded fully in RAM, and do not fit on ordinary desktop
    80 machines. The server we use in production to run the graph service has more
    81 than 700 GiB of RAM. These memory considerations are discussed in more details
    82 in :ref:`swh-graph-memory`.
  • very nice!

  • Review fixes

  • Build is green

    Patch application report for D7839 (id=28333)

    Rebasing onto 579f5a9e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 2141188fb34414bb58d42231b5cc5f0d16cb9d6a
    Author: Antoine Pietri <antoine.pietri1@gmail.com>
    Date:   Tue May 17 01:49:41 2022 +0200
    
        Documentation overhaul
        
        Large rework of the entire documentation of swh-graph.
        
        - New tutorial on how to use the Java API.
        - New page on disk/memory tradeoffs
        - Update the compression pipeline documentation with extensive details
        - Fix the compression steps diagram
        - Merge use-cases "blueprint" documentation in the HTTP API page as
          usage examples

    See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/189/ for more details.

  • Build is green

    Patch application report for D7839 (id=28334)

    Rebasing onto 579f5a9e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 7a1901476f0dc662629ff05878f39430efc6837a
    Author: Antoine Pietri <antoine.pietri1@gmail.com>
    Date:   Tue May 17 01:49:41 2022 +0200
    
        Documentation overhaul
        
        Large rework of the entire documentation of swh-graph.
        
        - New tutorial on how to use the Java API.
        - New page on disk/memory tradeoffs
        - Update the compression pipeline documentation with extensive details
        - Fix the compression steps diagram
        - Merge use-cases "blueprint" documentation in the HTTP API page as
          usage examples

    See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/190/ for more details.

  • Build is green

    Patch application report for D7839 (id=28335)

    Rebasing onto 579f5a9e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit a0c974f5215e273ab04a29554456548f4b9b8316
    Author: Antoine Pietri <antoine.pietri1@gmail.com>
    Date:   Tue May 17 01:49:41 2022 +0200
    
        Documentation overhaul
        
        Large rework of the entire documentation of swh-graph.
        
        - New tutorial on how to use the Java API.
        - New page on disk/memory tradeoffs
        - Update the compression pipeline documentation with extensive details
        - Fix the compression steps diagram
        - Merge use-cases "blueprint" documentation in the HTTP API page as
          usage examples

    See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/191/ for more details.

  • Jared
    Jared @JaredR26 started a thread on the diff
  • 15 to install it if you want to hack the code or install it from this git
    16 repository. To compress a graph, you will need zstd_ compression tools.
    17
    18 It is highly recommended to install this package in a virtualenv.
    19
    20 On a Debian stable (buster) system:
    21
    22 .. code:: bash
    23
    24 $ sudo apt install python3-virtualenv default-jre zstd
    13 JRE. On a Debian system:
    25 14
    15 .. code:: console
    26 16
    27 .. _zstd: https://facebook.github.io/zstd/
    17 $ sudo apt install python3 python3-venv default-jre
  • Jared
    Jared @JaredR26 started a thread on the diff
  • 29
    30 Install
    31 -------
    19 Installing swh.graph
    20 --------------------
    32 21
    33 22 Create a virtualenv and activate it:
    34 23
    35 .. code:: bash
    24 .. code:: console
    36 25
    37 ~/tmp$ mkdir swh-graph-tests
    38 ~/tmp$ cd swh-graph-tests
    39 ~/t/swh-graph-tests$ virtualenv swhenv
    40 ~/t/swh-graph-tests$ . swhenv/bin/activate
    26 $ python3 -m venv .venv
    • This command didn't work for me, "/usr/bin/python3: Relative module names not supported"

      python3 -m venv workingDir did work. Then the next line needed to be 'source workingDir/bin/activate'

      In addition, venv did something weird on my first try, but on my second after doing a full apt update ; apt upgrade it did not (create an extra venv dir?). I'd highlight that package upgrades need to be run to ensure commands work as written.

    • Please register or sign in to reply
  • Jared
    Jared @JaredR26 started a thread on the diff
  • 24 .. code:: console
    36 25
    37 ~/tmp$ mkdir swh-graph-tests
    38 ~/tmp$ cd swh-graph-tests
    39 ~/t/swh-graph-tests$ virtualenv swhenv
    40 ~/t/swh-graph-tests$ . swhenv/bin/activate
    26 $ python3 -m venv .venv
    27 $ source .venv/bin/activate
    41 28
    42 29 Install the ``swh.graph`` python package:
    43 30
    44 .. code:: bash
    31 .. code:: console
    45 32
    46 (swhenv) ~/t/swh-graph-tests$ pip install swh.graph
    33 (venv) $ pip install swh.graph
  • Jared
    Jared @JaredR26 started a thread on the diff
  • 112 .. code:: bash
    98 In our example:
    99
    100 .. code:: console
    113 101
    114 (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example
    115 Loading graph output/example ...
    102 (venv) $ swh graph rpc-serve -g compressed/graph
    103 Loading graph compressed/graph ...
    116 104 Graph loaded.
    117 105 ======== Running on http://0.0.0.0:5009 ========
    118 106 (Press CTRL+C to quit)
    119 107
    120 108 From there you can use this endpoint to query the compressed graph, for example
    121 with httpie_ (``sudo apt install``) from another terminal:
    109 with httpie_ (``sudo apt install httpie``):
    • Two things; 1. I think the "from another terminal" addition is helpful. I would EXPECT people to understand that, but they might not notice the prompt below and think they can pass these commands directly into the graph terminal after they run rpc-serve.

      And 2. Following these instructions was giving me 400 errors. I'm not sure why it isn't able to read the SWHID correctly?

      http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 400 Bad Request Content-Length: 18 Content-Type: text/plain; charset=utf-8 Date: Tue, 17 May 2022 18:03:55 GMT Server: Python/3.8 aiohttp/3.8.1

      Unknown SWHID: swh

      This is on a brand new ubuntu t3a.small instance on AWS, after doing apt update;upgrade.

    • FYI this was what the other terminal showed as recieved. It all looks right, I'm not sure why it didn't work.

      ~/2021-03-23-popular-3k-python$ swh graph rpc-serve -g compressed/graph
      INFO:root:using swh-graph JAR: /home/ubuntu/workingDir/share/swh-graph/swh-graph-0.5.2.jar
      Loading graph compressed/graph ...
      Graph loaded.
      ======== Running on http://0.0.0.0:5009 ========
      (Press CTRL+C to quit)
      INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:10 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
      INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:27 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
      INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:38 +0000] "GET /graph/leaves/dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 180 "-" "HTTPie/1.0.3"
      INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:11 +0000] "GET /graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
      INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:55 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
      
      
    • Please register or sign in to reply
  • Jared
    Jared @JaredR26 started a thread on the diff
  • 57 queried by the ``swh-graph`` library.
    84 58
    85 Own datasets
    86 ^^^^^^^^^^^^
    59 All the publicly available datasets are documented on this page:
    60 https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
    87 61
    88 A graph is described as both its adjacency list and the set of nodes
    89 identifiers in plain text format. Such graph example can be found in the
    90 ``swh/graph/tests/dataset/`` folder.
    62 A good way of retrieving these datasets is to use the `AWS S3 CLI
    63 <https://docs.aws.amazon.com/cli/latest/reference/s3/>`_.
    91 64
    92 You can compress the example graph on the command line like this:
    65 Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has
    66 a relatively reasonable size (~15 GiB including property data, with
    • May be worth noting, this graph will NOT load on an aws nano instance (lack of ram). I'm not sure if a micro can run it or not, but I succeeded with a t3a.small instance. Might save some frustration for someone trying to do this on the AWS free tier if they knew the requirements in advance.

    • Please register or sign in to reply
  • Still reviewing the rest but went through the quickstart. I wanted to submit these comments before it got too late over there.

  • Jared
    Jared @JaredR26 started a thread on the diff
  • 405 417 "avg": 0.6107127825377487
    406 418 }
    407 419 }
    420
    421
    422 Use-case examples
    423 -----------------
    424
    425 This section showcases how to leverage the endpoints of the HTTP API described
    • For at least one of the below, I'd give a full example command with full output. Like the httpie example in the quickstart. Only needs to be shown once but can help prompt someone who is speeding through.

      Also, something somewhere needs to mention the wierdness/problem of the compressed graph wanting the swh:1:ori:HASH identifier. Or does the rpc-api convert uri's given into proper ID's? What does it spit back out if you traverse to an origin?

      I can test it on my system but I don't know why it is not liking SWHID's given in the http request.

    • FYI, the ORC graph dataset now has an "id" column in the origin table, specifically to convert back from these sha1s to the URLs. It's now very similar to the other nodes, and it's already documented in the documentation of the dataset (which is the correct place to put this, imo)

    • Regarding the output, the above page has a ton of examples already. I just put this here to remove the outdated use-cases page, but it still feels a bit clumsy. Not sure what a better way to present this would be.

    • Please register or sign in to reply
  • Jared
    Jared @JaredR26 started a thread on the diff
  • docs/java-api.rst 0 → 100644
    23 `ImmutableGraph
    24 <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_,
    25 the abstract class providing the core API to manipulate and iterate on graphs.
    26 Under the hood, compressed graphs are stored as the `BVGraph
    27 <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_
    28 class, which contains the actual codec used to compress and decompress
    29 adjacency lists.
    30
    31 Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]`
    32 where *n* is the total number of nodes in the graph.
    33 Each node has an associated *adjacency list*, i.e., a list of nodes going from
    34 that source node to a destination node. This list represents the **edges** (or
    35 **arcs**) of the graph.
    36
    37 **Note**: edges are always directed. Undirected graphs are internally stored as
    38 a pair of directed edges (src → dst, dst → src), and are called "symmetric"
    • Does this ever apply to SWH, anywhere? It is interesting to know but might confuse people who haven't yet read that everything in SWH is unidirectional.

    • For any algorithm where you don't care about the direction, yes. For instance if you want to compute connected components, for LLP, or even for a BFS if you want.

    • Please register or sign in to reply
  • Jared
    Jared @JaredR26 started a thread on the diff
  • docs/java-api.rst 0 → 100644
    37 **Note**: edges are always directed. Undirected graphs are internally stored as
    38 a pair of directed edges (src → dst, dst → src), and are called "symmetric"
    39 graphs.
    40
    41 On disk, a simple BVGraph with the basename ``graph`` would be represented as
    42 the following set of files:
    43
    44 - ``graph.graph``: contains the compressed adjacency lists of each node, which
    45 can be decompressed by the BVGraph codec.
    46 - ``graph.properties``: contains metadata on the graph, such as the number of
    47 nodes and arcs, as well as additional loading information needed by the
    48 BVGraph codec.
    49 - ``graph.offsets``: a list of offsets of where the adjacency list of each node
    50 is stored in the main graph file.
    51 - ``graph.obl``: optionally, an "offset big-list file" which can be used to
    52 load graphs faster.
    • I think this section needs to be lower down., or maybe in an advanced details page. It's an implementation detail that is useful to know when someone is down to the nuts and bolts, but not information they need to know or understand when doing more basic or early work. And when someone gets that far down, they probably want to know about more of the files than just these.

    • I disagree, it's useful to know in advance because it tells you what are the files you need to download for your particular use case.

    • Please register or sign in to reply
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading