We still have a bit of repositories failing with too big packfile.
I've extracted the raw list [1] (running since this morning). I'm gonna massage that list,
entertaining the tool crafted by roberto [2] to push the resulting list in the large
repositories queue.
#!/usr/bin/env bashset -ex[ -z "${GH_TOKEN}" ] \ && echo "Missing env variable GH_TOKEN set with a gh bearer token." \ && exit 1[ -z "${SWH_TOKEN}" ] \ && echo "Missing env variable SWH_TOKEN set with a swh bearer token." \ && exit 1tmpdir="swh-check-repositories-$$"tmppath="/tmp/${tmpdir}"mkdir -p $tmppath# once debug is done, uncomment the following to remove temporary working directory# trap 'rm -rf "$tmppath"' EXITINPUT=${1-fulldata.txt}OUTPUT=${2-priority.list.github}FULLDATA_LOG=$tmppath/fulldata.logFORKED_LOG=$tmppath/forked.logFORKED_DATA=$tmppath/forked.dataLIST_WORKING_DATA=$tmppath/fulldata.dataLIST_NON_FORKS=$tmppath/nonfork.listLIST_FORKS=$tmppath/forked.listLIST_PRIORITY=$tmppath/priority.listpython3 get-repos-info.py -t "$GH_TOKEN" -a "$SWH_TOKEN" \ $INPUT > $LIST_WORKING_DATA 2> $FULLDATA_LOG# Extract the nonfork repositories that are still in GitHub, and sort them by number of starsgrep -v ISFORK $LIST_WORKING_DATA | sed 's/;.*;/;/' | sort -t \; -k 2 -n -r \ | grep -v NOTINGITHUB > $LIST_NON_FORKS# Extract original repository from fork projectsgrep ISFORK $LIST_WORKING_DATA | sed 's/.*ISFORK;//' | sed 's/[^;]*;//' \ | sed 's/;.*//' | sort -u > $LIST_FORKS# Process the forked list, extract repos still to be archived, sort them by number of# starspython3 get-repos-info.py -t "$GH_TOKEN" -a "$SWH_TOKEN" \ $LIST_FORKS > $FORKED_DATA 2> $FORKED_LOGegrep -v "NOTING|UPTODATE|NOWPRIVATE|TOUPDATE" $FORKED_DATA | sed 's/;.*;/;/' | \ sort -t \; -k 2 -n -r# Merge with the nonfork.listcat $LIST_FORKS $LIST_NON_FORKS | sort -u > $LIST_PRIORITY# Remove number of starssed 's/;.*//' $LIST_PRIORITY > $OUTPUT
@rdicosmo jsyk ^ is it worth a PR to have the script within your upstream repository ?