Skip to content
Snippets Groups Projects

SelectBlobs: Replace Athena with pyspark.sql

Merged vlorentz requested to merge SelectBlobs-pyspark into master

The growing dataset size means we now hit Athena's 30 min timeout. Using pyspark takes longer (3h) but spares the slow download from S3 and does not have a timeout.

This adds a dependency on pyspark, which is a 300MB package full of JARs; but we might use it for other workloads in the future.

Edited by vlorentz

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading