The list of all old gitorious repositories as well as their actual content is now available as the gitorious valhalla, maintained by Archiveteam. We should inject all those repositories into Software Heritage.
It's not a lot of content (~120K Git repositories).
Here is what they say to people interested in mirroring:
Please don't try to mirror the contents of this web server. It's 5 terabytes (after deduplication!) and the storage is slow at the moment. If you'd like to copy the data out, please email first and we can arrange something better for everyone.
The contact email address is: gitorious-%25@xrtc.net
Here is the complete list of URL that can be used to "git clone" (via HTTPS) all the repositories available from the Gitorious valhalla: gitorious-list.txt.gz.
(FWIW I'm not suggesting to start using them, we should first try to contact them and see if there are better options. But this remains a viable plan B.)
We are now all set to start (after having automated it properly…) the transfer of Gitorious stuff to SWH.
Below the last exchange with the Gitorious valhalla people, with the needed technical details.
On Fri, Mar 04, 2016 at 11:25:10AM +0100, Stefano Zacchiroli wrote:
[snip] how to go forward with the Gitorious transfer to Software Heritage. Here is a summary of the open issues to discuss before proceeding:
is sending a physical drive back and forth (paid by us) an option?
It's far too much hassle, IP transit is much simpler.
failing that, is paying your bandwidth an option? any idea of how much will it be?
The weird thing is I'm not sure ... the datacenter says that they bill us 95th percentile for it but so far as I know we've never gotten a bill for bandwidth. So don't crank it up too much and it should not be a problem?
failing that, we'll do the month-long transfer. In which case we still need to discuss the following:
a) who will do the traffic shaping? we can do it locally on your machine using something as simple as pv. Would that be OK with you?
That sounds good to me.
b) to avoid interruptions that would force restarting from scratch we propose to split the device in 1GB blocks (with "dd seek=...") and transfer each of them separately (e.g., with nc over an SSH tunnel)
Sure, if you'd like. That sounds like a thing which is best automated on your end. :) If you are able to share, I'm interested in seeing what you come up with.
When I transferred this fs to valhalla, I had to restart the transfer only once. I told dd (sending and receiving both) to seek back to the most recent 10 GB boundary, and it worked just fine.
c) for compression we propose to use lz4. Can you install liblz4-tool on your machine?
Done. Please run it under 'nice'.
d) even with the above precaution, transferring a mounted FS with dd is pretty scary (as there are changes that might happen even with read-only mounted FS). Do you have the option of creating some block-level snapshot, e.g., with LVM?
The filesystem is very much not going to change. In case you're worried still:
It is mounted read-only.
It is served by a network block device server that has been configured to not accept writes.
The fs itself is an ext4 image inside another filesystem, image is chmod 0444, and the outer filesystem mounted read-only.
And the LVM logical volume that contains this all is set to read- only.
As for durability, the volume is replicated with RAID1 (mirror) by LVM. No other data is presently stored on that volume group.
If you're ready to start copying, go ahead. You should have permission to read /dev/nbd0 already.
Here are all the information I have about the on-disk gitorious layout (credit: astrid):
Can you tell me more about the file layout/organization?
(disclaimer: I've never looked into how Gitorious, the software
platform, stores Git repositories) Are the hardlinks just the result
of asynchronous deduplication (e.g., with tools like fdupes) run on
a bunch of bare Git repositories, or is it more complex than that
(e.g., a huge, global Git loose object store)?
It's in fact rather simple, but with some wrinkles.
Each repository is stored as a bare git repository (as is created by
'git clone --bare'), so it can be worked with directly. It is my
understanding that gitorious used to run 'git gc' on a rolling
schedule, but I'm not sure how recently that has been done. I
certainly haven't.
When a user clicks 'clone', a full clone is made with git-clone; they
are on the same filesystem so git automatically uses hardlinks to
avoid copying objects unnecessarily. If the original repository is
named e.g. '/gitorious/mainline.git', and the user who clicks "clone"
is named 'zopa', then the cloned repository is named
'/gitorious/zopas-mainline.git'.
Each user has a wiki, which is named as
'/username/username-gitorious-wiki.git'. It seems that wikis were
created for all users regardless of whether they ever used them, so
there are many empty wiki repositories.
Originally, every repository was named with hashed names, such as
'/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git'
When they were preparing to send me the data, the gitorious folks
started to rename them to the canonical names as I explained above.
However, because they created one directory for each user, they ran
into the maximum hardlinks that you can make in ext4. So about half
of them got renamed and half of them are still in hashed form. They
gave me a list of all the hashed-name mappings, in
'/home/astrid/mapping.txt.gz'.
Because this was a complete mess, I created a directory of symlinks
outside the image with the canonical names, pointing into it:
I have the image mounted on '/mnt/gitorious'. So to return the data
for 'zzn/zzn.git', the webserver transforms the '/' into a ':' and
serves the request with '/srv/gitorious' as the http root directory,
following symlinks.