-
v6.2.03216ab1b · ·
v6.2.0 Possibly breaking changes: * Make Stats::avg_locality optional (gRPC protocol) * Remove support for Fmph (library) New features (Rust library): * Relax bound on load_properties closure Fixes (gRPC server): * Make Stats::avg_locality optional * Generate graph.stats as part of graph compression New features (gRPC server): * Add StatsD metric traversal_returned_nodes_total Fixes (misc): * origin-contributors: Fix crash on origins with unknown URL * Update scancode * compressed_graph: Cleanup new temporary files before upload New features (misc): * Add swh-graph-convert to serialize small graphs to JSON * provenance: Make executables writing Parquet files reusable as a library * Add aggregated content dataset * Add 'generations' index as a compact and faster alternative to the topological order * Automate subdataset creation Improvements: * Update DeanonymizeOriginContributors to the file layout used by >= 2024-08 graphs Documentation: * Add examples to GraphBuilder documentation * Add missing descriptions of top-level modules * Document other services implemented by the gRPC server
-
v6.1.014cf465a · ·
v6.1.0 New features (gRPC server): * Add Sentry integration * Add support for the gRPC Health Checking Protocol * Export StatsD metrics * Log time spent streaming/traversing Other new features: * blobs_dataset: Add a node-filter that uses a known list of SWHIDs CLI fixes: * `swh graph download`: * Fix download of export.json * Stream zstd files to zstdmt instead of writing a temporary file * `swh graph grpc-serve`: Allow graph path to be optional Other fixes: * grpc-server: Log requests resulting in gRPC errors Tweaks: * provenance: Increase row group size * DownloadBlobs: Default to S3 instead of archive.softwareheritage.org * Update Tonic Documentation: * Document that the swh-graph crate's executables are needed to reindex
-
v6.0.094fbe8f8 · ·
v6.0.0 Breaking changes: * Move gRPC server to its own crate (swh-graph-grpc-server) * Move dataset-writer to its own crate (dataset-writer) * Switch from stderrlog to env_logger (`-vv` on executables is now the default, and log level is tuned with `RUST_LOG=debug`) * Switch MPH algorithms for new compressed graphs from GOV to PTHash * Move the FCL from java_compat/fcl.rs to front_coded_list/read.rs * Removed all Java code New features: * Add support for reading graphs with PTHash instead of GOV * Switch from stderrlog to env_logger * Add load_full() shorthand for loading a bidirectional labeled graph with all properties * Replace load_{uni,bi}directional with Swh{Uni,Bi}directionalGraph::new * Add SwhFullGraph trait "alias" to simplify types * Prioritize target/ over global installs when running Rust executables * stdlib: add find_head_revision() * Log gRPC requests Soundness fixes (Rust) * NodeBuilder: Remove empty data struct when another node type's data is requested * Replace Write::write() with Write::write_all() Soundness fixes (Python): * Add missing "node." prefix to FieldMasks passed to FindPath{To,Between} Performance improvements: * Parallelize merging in par_sort_arcs * EDGE_LABELS: Remove unnecessary dependencies * PERMUTE_LLP: The base graph does not need to be loaded in memory * Replace GNU sort with custom sorters * Switch swh-graph-extract to jemalloc * Don't materialize vec of values while building MPHs * Update default gammas Documentation: * Update compression documentation * Improve error when .labeloffsets file is missing * Make quickstart and gRPC documentations easier to understand
-
v5.1.04102a5ea · ·
v5.1.0 No changes to the Python code in this release, only Rust and Javz. Additions (Rust): * stdlib: add fs_ls_tree, implementing a recursive ls of a FS tree * Switch NODE_PROPERTIES to Rust implementation * Switch EXTRACT_PERSONS to Rust implementation * Add 'compare-graphs' tool Soundness fixes (Rust): * Check CSV inputs are not a single non-header line, when expecting a header * Switch default port of the Rust gRPC server to 50091 Compilation fixes (Rust): * Lower MSRV to 1.79 Documentation: * rust doc: Move Crash Course to its own page + add Tutorial * Update gRPC documentation to run the Rust implementation * Fix documentation of DirEntry::permission * Fix docstring to mention inputs are CSV * Add diagnostic hint for when properties are not loaded Internal changes: * stdlib: port find_latest_snp() to typed labeled successor iterator * Remove unused Java utils and SWH-specific Java compression code * Make DefaultUnderlyingGraph and SwhLabeling newtypes * Fix warnings * Use ar_row's own decimal -> timestamp decoding * pytest.ini: Ignore Rust's target/
-
v5.0.073120521 · ·
v5.0.0 No changes to the Python code in this release, only Rust. This is the first release of swh-graph on crates.io. Breaking changes (Rust): * Rename "labelled" -> "labeled", for consistency with webgraph * Rename SWHType to NodeType * Moved find_root_dir to the "stdlib" * Make labeled_{predecessor,successors} return typed labels * Rename Rust all "::new_with" constructors to just "::with" * Rust: implement FromStr (rather than TryInto<&str>) for NodeType Additions (Rust): * Added a "stdlib" of algorithms (find_latest_snp, path resolution functions, generic node visit operator) * GraphBuilder: Add support for Visit (and Branch) labels * Add support for multi-arcs in VecGraph and GraphBuilder * Add `.flatten_labels()` on labeled arc iterator * Rewrite EDGE_LABELS compression step in Rust * Subgraph: add builder based on NodeConstraint * Add tool to dump edges from a compressed graph Improvements (Rust): * Switch to released versions of webgraph and ar_row * Various performance improvements in properties compression * GraphBuilder: add BuiltGraph type alias for done() return type * Rewrite CountPaths to produce sharded Parquet files directly * Small improvement of the rust executable dir handling in check_config() Soundness fixes (Rust): * Fixed all bugs NODE_PROPERTIES compression (or at least, have a strict subset of the Java implementation's bugs). In particular: * properties: Update arrays atomically and remove LongArrayBitVector * Remove redundant 'datasets' path component * Rust: use create_new() to create on-disk maps * Subgraph: fix has_arc(), which was transposing the passed arc Crash/compilation fixes (Rust): * Fix detection of missing node2type.bin/content.is_skipped.bits file * Fix loading node2type.bin larger than 2^31 * 8 bytes * Fix debian requirements * blobs_dataset: Make Datafusion write to a single file * root_directory.rs: do not require (unneeded) SwhBackwardGraph trait Documentation: * Subgraph: document that num_nodes/arcs() return non-filtered values * Add a 'minimal build for tests' section in rust/README.md * review *.rs file headers: add missing Copyright decl
-
v4.0.05c51b6a5 · ·
v4.0.0 Breaking: * Switch default gRPC server from Java to Rust * Remove sorted list of nodes from graph output Bug fixes: * Fix pyo3 extension build on recent Maturin versions * compute-directory-frontier: Use correct timestamp to decide if node is frontier Ergonomics: * Add CLI to regenerate graph files for the current version * Add CLI to download the graph (and decompress .zst files) Performance optimizations: * Start BFS from (sorted) origins instead of random nodes * http_rpc_server.VisitEdgesView: Do not fetch edge labels * provenance: Switch from glibc malloc to jemalloc * contents-in-directories: Do not traverse nodes not reachable from a frontier directory * Make dependency on swh-storage optional Rust rewrite: * Look for rust swh-graph-grpc-serve execuctable in user's PATH and 'rust_executable_dir' config * EXTRACT_NODES, MAPS, COMPOSE_ORDERS, TRANSPOSE: Switch to Rust implementation * Rewrite ListOriginContributors in Rust * Rewrite ListFilesByName in Rust * Rewrite MPHTranslate in Rust * Add support for computing node ids from SWHIDs * Update list of temporary files to clean after compression is done * java: Add support for reading is_skipped.bits and node2type.bin (produced by the Rust compression pipeline) Other improvements: * CountPaths: Add support for 2024-05-16 graph * provenance: Add support for '--node-filter all' * Document how to get the URL of an origin node * Document dependency on protoc * grpc-server: Add --masked-nodes option * pytest_plugin: Move server config to its own fixture * model: adapt to the renaming of model.TargetType to model.SnapshotTargetType
-
v3.4.08791e8e2 · ·
v3.4.0 - Add MPHF to the Python extension - provenance: - Add support for Hive partitioning on sha1_git column - Remove remaining references to topological_order_dir - Replace .csv.zst output with .parquet - Compress paths with zstd - Order rows in final files by the column they will be queried on - Move find_frontiers_from_root_directory to frontier-directories-in-revisions - Make dependency on 'arrow' optional, even when 'dataset-writer' is used - docs: Fix reference to SWHIDs - webgraph: Set RUST_MIN_STACK to avoid stack overflows - naive client: add max_matching_nodes for neighbors method - Add timestamp and is_full_visit bit to ori->snp edges' label - Update webgraph - Fix task dependencies - Switch LLP compression step to use the Rust implementation - Fix sort_batch_size and input_batch_size values being swapped in 'permute' and 'transpose' commands - transform: Log computed configuration - Rewrite PopularContentPaths in Rust - Misc. fixes and code improvements
-
v3.3.075acaa6b · ·
v3.3.0 * Rewrite most dataset generation scripts from Java to Rust * Make most dataset generation scripts produce sharded files instead of a single .csv.zst that cannot be processed in parallel * Improve ergonomics of Rust library * Switch some early compression steps to Rust (BV, BFS, {PERMUTE,TRANSPOSE,SIMPLIFY}_BFS) * Replace Athena with datafusion * Finish Rust rewrite of the gRPC server (but Java remains the default)
-
v3.1.0159f5343 · ·
v3.1.0 Misc: * Prevent timestamps in node properties from being shifted according to the timezone WriteNodeProperties is being run in. * Add scripts to generate an index for swh-provenance HTTP API: * Raise an error when RemoteGraphClient URL is wrong * Fix DeprecationWarnings in http_client * http_rpc_server: Remove duplicate request * Properly handle empty results in HTTP client CLI: * compress: Hide traceback on subprocess exception * Print the end of the log Rust: * java utils: add Mph2Cmph to convert from java to rust version of MPH files * initial skeleton to ship a rust crate * Implement the first Rust data structures * codespell: move excluded word list from toml to precommit conf * Add example BFS implementation * added Node2Type, its tests, and an bin to convert the .node2swhid.bin file to the new .node2type.bin Luigi: * Avoid silencing exceptions thrown by worker threads * Document how to run Luigi tasks * luigi: Add missing output files to ExtractNodes. * luigi, docs: MAPS step does not depend on LLP * luigi: Increase MPH_LABELS memory allowance * Tune MPH maximum memory to avoid OOMs * Honor JAVA_HOME when locating the default Java binary PopularContentPaths: * Avoid concurrent updates to ProgressLogger * Fix missing 'sha1' column for contents with no path * Skip loading the forward graph * Allow zstdcat more memory * Print corrupt records Internal: * Prevent flake8 from finding issues in build/ directory * Migrate to copier-based swh-py-template * Move the jar under swh/graph so we can get rid of deprecated "date_files" in setup * Fix documentation build