Skip to content
v3.0.0

Breaking changes:

* Use CRLF in output CSV instead of LF
* FindEarliestRevision: switch from TSV to CSV and rename columns
* origin_contributors: Add final task checking integrity of the dataset
* origin_contributors: Change table format/layout to be more compact,
  improve performance, add contribution years

Minor changes:

* Match swh-dataset's rename of 'object_type' to 'object_types'
* luigi: Make grpc API globally configurable, and remove the default value
* Bump requirements on protobuf

New derived datasets:

* Import the "blobs datasets" (license, citation) generation script as a
  luigi workflow
* Add scripts to find the most popular name(s)/path of content nodes
* Add script to count the total number of paths to any node
* Add ListEarliestRevisions, which computes the earliest revision of all dir/cnt objects at once

New features:

* Export naive_graph_client and remote_graph_client fixtures in pytest plugin
* Add INITIAL_ORIGIN and FORKED_ORIGIN to example dataset
* Make example dataset available for other modules
* Add export_{started,ended}_at to Stats response
* Make Luigi tasks declare their RAM usage (and auto-tune when possible)
* Add support for compressing the graph with only some node types
* Add step stamp to each step's list of output files
* Add support for making the graph dataset name differ from the export name
* webgraph.py: Display log path in error messages
* Add a script to count paths leading to each node
* TopoSort: Default to DFS instead of BFS
* TopoSort: Add support for running forward
* luigi: Add an option to define the maximum RAM used by graph compression

Documentation:

* Move the doc for the example dataset to its own page
* Include the representation of the example dataset in the documentation
* Add some more style to the example dataset graph
* Remove the figure from the example dataset documentation
* docs/compression: Fix inaccuracies in the dependency graph
* DownloadGraphFromS3: Fix incorrect docstring

Bug fixes:

* getMessage: Fix crash on origins with no URL property
* luigi/misc_datasets: Fix _clean_s3_directory() when directory is empty
* Add a flyweight copy() to SwhGraphProperties to make it threadsafe
* FindEarliestRevision: Fix crash on revisions with no committer timestamp
* compressed_graph: Fix data race to .obl files in Transpose command
* Check in constructor instead of size64()
* NodeIdMap: Fix incorrect implementation of size64()
* TopoSort: Fix discard of the last node while looking for leaves

Performance improvements:

* FindEarliestRevision: Run traversals in parallel
* FindEarliestRevision, TopoSort: Use Apache Commons CSV
* luigi/compressed_graph: Tune -Xmx per task
* TopoSort: Various optimizations

Misc:

* assembly: Remove some transitive dependencies from the final uber jar
* luigi: Rewrite compression pipeline as small Luigi tasks