Parallelize merging in par_sort_arcs
This makes the EDGE_LABELS run time go from 2.5 days to 1 day; and TRANSPOSE_EDGE_LABELS probably from 3.5 to 1-1.5 day.
This replaces !566 (closed) which used a different approach (parallelizing consuming+merging+writing, while this MR only parallelizes merging which is the real bottleneck)