-
v3.4.08791e8e2 · ·
v3.4.0 - Add MPHF to the Python extension - provenance: - Add support for Hive partitioning on sha1_git column - Remove remaining references to topological_order_dir - Replace .csv.zst output with .parquet - Compress paths with zstd - Order rows in final files by the column they will be queried on - Move find_frontiers_from_root_directory to frontier-directories-in-revisions - Make dependency on 'arrow' optional, even when 'dataset-writer' is used - docs: Fix reference to SWHIDs - webgraph: Set RUST_MIN_STACK to avoid stack overflows - naive client: add max_matching_nodes for neighbors method - Add timestamp and is_full_visit bit to ori->snp edges' label - Update webgraph - Fix task dependencies - Switch LLP compression step to use the Rust implementation - Fix sort_batch_size and input_batch_size values being swapped in 'permute' and 'transpose' commands - transform: Log computed configuration - Rewrite PopularContentPaths in Rust - Misc. fixes and code improvements
-
v3.3.15867279b · ·
v3.3.1 * permute-and-symmetrize: Sort arc lists in parallel * rust and java: Enforce max_edges *before* traversing edges
-
v3.3.075acaa6b · ·
v3.3.0 * Rewrite most dataset generation scripts from Java to Rust * Make most dataset generation scripts produce sharded files instead of a single .csv.zst that cannot be processed in parallel * Improve ergonomics of Rust library * Switch some early compression steps to Rust (BV, BFS, {PERMUTE,TRANSPOSE,SIMPLIFY}_BFS) * Replace Athena with datafusion * Finish Rust rewrite of the gRPC server (but Java remains the default)
-
-
-
-
-
-
-
-
v3.1.0159f5343 · ·
v3.1.0 Misc: * Prevent timestamps in node properties from being shifted according to the timezone WriteNodeProperties is being run in. * Add scripts to generate an index for swh-provenance HTTP API: * Raise an error when RemoteGraphClient URL is wrong * Fix DeprecationWarnings in http_client * http_rpc_server: Remove duplicate request * Properly handle empty results in HTTP client CLI: * compress: Hide traceback on subprocess exception * Print the end of the log Rust: * java utils: add Mph2Cmph to convert from java to rust version of MPH files * initial skeleton to ship a rust crate * Implement the first Rust data structures * codespell: move excluded word list from toml to precommit conf * Add example BFS implementation * added Node2Type, its tests, and an bin to convert the .node2swhid.bin file to the new .node2type.bin Luigi: * Avoid silencing exceptions thrown by worker threads * Document how to run Luigi tasks * luigi: Add missing output files to ExtractNodes. * luigi, docs: MAPS step does not depend on LLP * luigi: Increase MPH_LABELS memory allowance * Tune MPH maximum memory to avoid OOMs * Honor JAVA_HOME when locating the default Java binary PopularContentPaths: * Avoid concurrent updates to ProgressLogger * Fix missing 'sha1' column for contents with no path * Skip loading the forward graph * Allow zstdcat more memory * Print corrupt records Internal: * Prevent flake8 from finding issues in build/ directory * Migrate to copier-based swh-py-template * Move the jar under swh/graph so we can get rid of deprecated "date_files" in setup * Fix documentation build
-
-
-
-
v3.0.020a76683 · ·
v3.0.0 Breaking changes: * Use CRLF in output CSV instead of LF * FindEarliestRevision: switch from TSV to CSV and rename columns * origin_contributors: Add final task checking integrity of the dataset * origin_contributors: Change table format/layout to be more compact, improve performance, add contribution years Minor changes: * Match swh-dataset's rename of 'object_type' to 'object_types' * luigi: Make grpc API globally configurable, and remove the default value * Bump requirements on protobuf New derived datasets: * Import the "blobs datasets" (license, citation) generation script as a luigi workflow * Add scripts to find the most popular name(s)/path of content nodes * Add script to count the total number of paths to any node * Add ListEarliestRevisions, which computes the earliest revision of all dir/cnt objects at once New features: * Export naive_graph_client and remote_graph_client fixtures in pytest plugin * Add INITIAL_ORIGIN and FORKED_ORIGIN to example dataset * Make example dataset available for other modules * Add export_{started,ended}_at to Stats response * Make Luigi tasks declare their RAM usage (and auto-tune when possible) * Add support for compressing the graph with only some node types * Add step stamp to each step's list of output files * Add support for making the graph dataset name differ from the export name * webgraph.py: Display log path in error messages * Add a script to count paths leading to each node * TopoSort: Default to DFS instead of BFS * TopoSort: Add support for running forward * luigi: Add an option to define the maximum RAM used by graph compression Documentation: * Move the doc for the example dataset to its own page * Include the representation of the example dataset in the documentation * Add some more style to the example dataset graph * Remove the figure from the example dataset documentation * docs/compression: Fix inaccuracies in the dependency graph * DownloadGraphFromS3: Fix incorrect docstring Bug fixes: * getMessage: Fix crash on origins with no URL property * luigi/misc_datasets: Fix _clean_s3_directory() when directory is empty * Add a flyweight copy() to SwhGraphProperties to make it threadsafe * FindEarliestRevision: Fix crash on revisions with no committer timestamp * compressed_graph: Fix data race to .obl files in Transpose command * Check in constructor instead of size64() * NodeIdMap: Fix incorrect implementation of size64() * TopoSort: Fix discard of the last node while looking for leaves Performance improvements: * FindEarliestRevision: Run traversals in parallel * FindEarliestRevision, TopoSort: Use Apache Commons CSV * luigi/compressed_graph: Tune -Xmx per task * TopoSort: Various optimizations Misc: * assembly: Remove some transitive dependencies from the final uber jar * luigi: Rewrite compression pipeline as small Luigi tasks
-
-
-