Documentation overhaul
Large rework of the entire documentation of swh-graph.
- New tutorial on how to use the Java API.
- New page on disk/memory tradeoffs
- Update the compression pipeline documentation with extensive details
- Fix the compression steps diagram
- Merge use-cases "blueprint" documentation in the HTTP API page as usage examples
Migrated from D7839 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D7839 (id=28316)
Rebasing onto 579f5a9e...
Current branch diff-target is up to date.
Changes applied before test
commit 7ebc5416d3660c73fd07a1bb07e2caad8964c857 Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Tue May 17 01:49:41 2022 +0200 Documentation overhaul Large rework of the entire documentation of swh-graph. - New tutorial on how to use the Java API. - New page on disk/memory tradeoffs - Update the compression pipeline documentation with extensive details - Fix the compression steps diagram - Merge use-cases "blueprint" documentation in the HTTP API page as usage examples
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/188/ for more details.
- docs/memory.rst 0 → 100644
59 virtual address space. The Linux kernel will then be free to arbitrarily cache 60 the file, either partially or in its entirety, depending on the available 61 memory space. 62 63 In our experiments, memory-mapping a small graph from a SSD only incurs a 64 relatively small slowdown (about 15-20%). However, when the graph is too big to 65 fit in RAM, the kernel has to constantly invalidate pages to cache newly 66 accessed sections, which incurs a very large performance penalty. A full 67 traversal of a large graph that usually takes about 20 hours when loaded in 68 main memory could take more than a year when mapped from a hard drive! 69 70 When deciding what to direct-load and what to memory-map, here are a few rules 71 of thumb: 72 73 - If you don't need random access to the graph edges, you can consider using 74 the "offline" loading mode. The offsets won't be loaded which will save - docs/memory.rst 0 → 100644
110 we load the graph using the memory-mapped loading mode, which makes it use the 111 shared memory stored in the tmpfs under the hood. 112 113 Here is a systemd service that can be used to perform this task automatically: 114 115 .. code-block:: ini 116 117 [Unit] 118 Description=swh-graph memory sharing in tmpfs 119 120 [Service] 121 Type=oneshot 122 RemainAfterExit=yes 123 ExecStart=mkdir -p /dev/shm/swh-graph/default 124 ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default" 125 ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default 69 .. code:: console 95 70 71 (venv) $ pip install awscli 72 [...] 73 (venv) $ mkdir -p 2021-03-23-popular-3k-python/compressed 74 (venv) $ cd 2021-03-23-popular-3k-python/ 75 (venv) $ aws s3 cp --recursive s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ compressed 96 76 97 (swhenv) ~/t/swh-graph-tests$ swh graph compress --graph swh/graph/tests/dataset/example --outdir output/ 98 77 99 [...] 78 You can also retrieve larger graphs, but note that these graphs are generally 79 intended to be loaded fully in RAM, and do not fit on ordinary desktop 80 machines. The server we use in production to run the graph service has more 81 than 700 GiB of RAM. These memory considerations are discussed in more details 82 in :ref:`swh-graph-memory`. Build is green
Patch application report for D7839 (id=28333)
Rebasing onto 579f5a9e...
Current branch diff-target is up to date.
Changes applied before test
commit 2141188fb34414bb58d42231b5cc5f0d16cb9d6a Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Tue May 17 01:49:41 2022 +0200 Documentation overhaul Large rework of the entire documentation of swh-graph. - New tutorial on how to use the Java API. - New page on disk/memory tradeoffs - Update the compression pipeline documentation with extensive details - Fix the compression steps diagram - Merge use-cases "blueprint" documentation in the HTTP API page as usage examples
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/189/ for more details.
Build is green
Patch application report for D7839 (id=28334)
Rebasing onto 579f5a9e...
Current branch diff-target is up to date.
Changes applied before test
commit 7a1901476f0dc662629ff05878f39430efc6837a Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Tue May 17 01:49:41 2022 +0200 Documentation overhaul Large rework of the entire documentation of swh-graph. - New tutorial on how to use the Java API. - New page on disk/memory tradeoffs - Update the compression pipeline documentation with extensive details - Fix the compression steps diagram - Merge use-cases "blueprint" documentation in the HTTP API page as usage examples
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/190/ for more details.
Build is green
Patch application report for D7839 (id=28335)
Rebasing onto 579f5a9e...
Current branch diff-target is up to date.
Changes applied before test
commit a0c974f5215e273ab04a29554456548f4b9b8316 Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Tue May 17 01:49:41 2022 +0200 Documentation overhaul Large rework of the entire documentation of swh-graph. - New tutorial on how to use the Java API. - New page on disk/memory tradeoffs - Update the compression pipeline documentation with extensive details - Fix the compression steps diagram - Merge use-cases "blueprint" documentation in the HTTP API page as usage examples
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/191/ for more details.
15 to install it if you want to hack the code or install it from this git 16 repository. To compress a graph, you will need zstd_ compression tools. 17 18 It is highly recommended to install this package in a virtualenv. 19 20 On a Debian stable (buster) system: 21 22 .. code:: bash 23 24 $ sudo apt install python3-virtualenv default-jre zstd 13 JRE. On a Debian system: 25 14 15 .. code:: console 26 16 27 .. _zstd: https://facebook.github.io/zstd/ 17 $ sudo apt install python3 python3-venv default-jre 29 30 Install 31 ------- 19 Installing swh.graph 20 -------------------- 32 21 33 22 Create a virtualenv and activate it: 34 23 35 .. code:: bash 24 .. code:: console 36 25 37 ~/tmp$ mkdir swh-graph-tests 38 ~/tmp$ cd swh-graph-tests 39 ~/t/swh-graph-tests$ virtualenv swhenv 40 ~/t/swh-graph-tests$ . swhenv/bin/activate 26 $ python3 -m venv .venv This command didn't work for me, "/usr/bin/python3: Relative module names not supported"
python3 -m venv workingDir did work. Then the next line needed to be 'source workingDir/bin/activate'
In addition, venv did something weird on my first try, but on my second after doing a full apt update ; apt upgrade it did not (create an extra venv dir?). I'd highlight that package upgrades need to be run to ensure commands work as written.
24 .. code:: console 36 25 37 ~/tmp$ mkdir swh-graph-tests 38 ~/tmp$ cd swh-graph-tests 39 ~/t/swh-graph-tests$ virtualenv swhenv 40 ~/t/swh-graph-tests$ . swhenv/bin/activate 26 $ python3 -m venv .venv 27 $ source .venv/bin/activate 41 28 42 29 Install the ``swh.graph`` python package: 43 30 44 .. code:: bash 31 .. code:: console 45 32 46 (swhenv) ~/t/swh-graph-tests$ pip install swh.graph 33 (venv) $ pip install swh.graph 112 .. code:: bash 98 In our example: 99 100 .. code:: console 113 101 114 (swhenv) ~/t/swh-graph-tests$ swh graph rpc-serve -g output/example 115 Loading graph output/example ... 102 (venv) $ swh graph rpc-serve -g compressed/graph 103 Loading graph compressed/graph ... 116 104 Graph loaded. 117 105 ======== Running on http://0.0.0.0:5009 ======== 118 106 (Press CTRL+C to quit) 119 107 120 108 From there you can use this endpoint to query the compressed graph, for example 121 with httpie_ (``sudo apt install``) from another terminal: 109 with httpie_ (``sudo apt install httpie``): Two things; 1. I think the "from another terminal" addition is helpful. I would EXPECT people to understand that, but they might not notice the prompt below and think they can pass these commands directly into the graph terminal after they run rpc-serve.
And 2. Following these instructions was giving me 400 errors. I'm not sure why it isn't able to read the SWHID correctly?
http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1 400 Bad Request Content-Length: 18 Content-Type: text/plain; charset=utf-8 Date: Tue, 17 May 2022 18:03:55 GMT Server: Python/3.8 aiohttp/3.8.1
Unknown SWHID: swh
This is on a brand new ubuntu t3a.small instance on AWS, after doing apt update;upgrade.
FYI this was what the other terminal showed as recieved. It all looks right, I'm not sure why it didn't work.
~/2021-03-23-popular-3k-python$ swh graph rpc-serve -g compressed/graph INFO:root:using swh-graph JAR: /home/ubuntu/workingDir/share/swh-graph/swh-graph-0.5.2.jar Loading graph compressed/graph ... Graph loaded. ======== Running on http://0.0.0.0:5009 ======== (Press CTRL+C to quit) INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:10 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:27 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:02:38 +0000] "GET /graph/leaves/dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 180 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:11 +0000] "GET /graph/visit/nodes/swh:1:rel:0000000000000000000000000000000000000010 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3" INFO:aiohttp.access:127.0.0.1 [17/May/2022:18:03:55 +0000] "GET /graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323 HTTP/1.1" 400 178 "-" "HTTPie/1.0.3"
57 queried by the ``swh-graph`` library. 84 58 85 Own datasets 86 ^^^^^^^^^^^^ 59 All the publicly available datasets are documented on this page: 60 https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html 87 61 88 A graph is described as both its adjacency list and the set of nodes 89 identifiers in plain text format. Such graph example can be found in the 90 ``swh/graph/tests/dataset/`` folder. 62 A good way of retrieving these datasets is to use the `AWS S3 CLI 63 <https://docs.aws.amazon.com/cli/latest/reference/s3/>`_. 91 64 92 You can compress the example graph on the command line like this: 65 Here is an example with the dataset ``2021-03-23-popular-3k-python``, which has 66 a relatively reasonable size (~15 GiB including property data, with 405 417 "avg": 0.6107127825377487 406 418 } 407 419 } 420 421 422 Use-case examples 423 ----------------- 424 425 This section showcases how to leverage the endpoints of the HTTP API described For at least one of the below, I'd give a full example command with full output. Like the httpie example in the quickstart. Only needs to be shown once but can help prompt someone who is speeding through.
Also, something somewhere needs to mention the wierdness/problem of the compressed graph wanting the swh:1:ori:HASH identifier. Or does the rpc-api convert uri's given into proper ID's? What does it spit back out if you traverse to an origin?
I can test it on my system but I don't know why it is not liking SWHID's given in the http request.
- docs/java-api.rst 0 → 100644
23 `ImmutableGraph 24 <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/ImmutableGraph.html>`_, 25 the abstract class providing the core API to manipulate and iterate on graphs. 26 Under the hood, compressed graphs are stored as the `BVGraph 27 <https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html>`_ 28 class, which contains the actual codec used to compress and decompress 29 adjacency lists. 30 31 Graphs **nodes** are mapped to a contiguous set of integers :math:`[0, n - 1]` 32 where *n* is the total number of nodes in the graph. 33 Each node has an associated *adjacency list*, i.e., a list of nodes going from 34 that source node to a destination node. This list represents the **edges** (or 35 **arcs**) of the graph. 36 37 **Note**: edges are always directed. Undirected graphs are internally stored as 38 a pair of directed edges (src → dst, dst → src), and are called "symmetric" - docs/java-api.rst 0 → 100644
37 **Note**: edges are always directed. Undirected graphs are internally stored as 38 a pair of directed edges (src → dst, dst → src), and are called "symmetric" 39 graphs. 40 41 On disk, a simple BVGraph with the basename ``graph`` would be represented as 42 the following set of files: 43 44 - ``graph.graph``: contains the compressed adjacency lists of each node, which 45 can be decompressed by the BVGraph codec. 46 - ``graph.properties``: contains metadata on the graph, such as the number of 47 nodes and arcs, as well as additional loading information needed by the 48 BVGraph codec. 49 - ``graph.offsets``: a list of offsets of where the adjacency list of each node 50 is stored in the main graph file. 51 - ``graph.obl``: optionally, an "offset big-list file" which can be used to 52 load graphs faster. I think this section needs to be lower down., or maybe in an advanced details page. It's an implementation detail that is useful to know when someone is down to the nuts and bolts, but not information they need to know or understand when doing more basic or early work. And when someone gets that far down, they probably want to know about more of the files than just these.