Gitorious import: ingest repositories

added Archive coverage Format-Git Origin-Gitorious Post FOSDEM'17 consolidation priority:Normal labels

changed title from ingest archived gitorious repositories to ingest gitorious repositories

Here is the complete list of URL that can be used to "git clone" (via HTTPS) all the repositories available from the Gitorious valhalla: gitorious-list.txt.gz.

(FWIW I'm not suggesting to start using them, we should first try to contact them and see if there are better options. But this remains a viable plan B.)

We are now all set to start (after having automated it properly…) the transfer of Gitorious stuff to SWH.

Below the last exchange with the Gitorious valhalla people, with the needed technical details.

On Fri, Mar 04, 2016 at 11:25:10AM +0100, Stefano Zacchiroli wrote:

[snip] how to go forward with the Gitorious transfer to Software Heritage. Here is a summary of the open issues to discuss before proceeding:

is sending a physical drive back and forth (paid by us) an option? It's far too much hassle, IP transit is much simpler.

failing that, is paying your bandwidth an option? any idea of how much will it be? The weird thing is I'm not sure ... the datacenter says that they bill us 95th percentile for it but so far as I know we've never gotten a bill for bandwidth. So don't crank it up too much and it should not be a problem?

failing that, we'll do the month-long transfer. In which case we still need to discuss the following: a) who will do the traffic shaping? we can do it locally on your machine using something as simple as pv. Would that be OK with you? That sounds good to me.

b) to avoid interruptions that would force restarting from scratch we propose to split the device in 1GB blocks (with "dd seek=...") and transfer each of them separately (e.g., with nc over an SSH tunnel) Sure, if you'd like. That sounds like a thing which is best automated on your end. :) If you are able to share, I'm interested in seeing what you come up with.

When I transferred this fs to valhalla, I had to restart the transfer only once. I told dd (sending and receiving both) to seek back to the most recent 10 GB boundary, and it worked just fine.

c) for compression we propose to use lz4. Can you install liblz4-tool on your machine? Done. Please run it under 'nice'.

d) even with the above precaution, transferring a mounted FS with dd is pretty scary (as there are changes that might happen even with read-only mounted FS). Do you have the option of creating some block-level snapshot, e.g., with LVM? The filesystem is very much not going to change. In case you're worried still:

It is mounted read-only.

It is served by a network block device server that has been configured to not accept writes.

The fs itself is an ext4 image inside another filesystem, image is chmod 0444, and the outer filesystem mounted read-only.

And the LVM logical volume that contains this all is set to read- only.

As for durability, the volume is replicated with RAID1 (mirror) by LVM. No other data is presently stored on that volume group.

If you're ready to start copying, go ahead. You should have permission to read /dev/nbd0 already.

added priority:High label and removed priority:Normal label

marked this issue as related to #343 (closed)

added priority:Normal label and removed priority:High label

marked this issue as related to swh/infra/sysadm-environment#360 (closed)

Here are all the information I have about the on-disk gitorious layout (credit: astrid):

Can you tell me more about the file layout/organization? (disclaimer: I've never looked into how Gitorious, the software platform, stores Git repositories) Are the hardlinks just the result of asynchronous deduplication (e.g., with tools like fdupes) run on a bunch of bare Git repositories, or is it more complex than that (e.g., a huge, global Git loose object store)?

It's in fact rather simple, but with some wrinkles.

Each repository is stored as a bare git repository (as is created by 'git clone --bare'), so it can be worked with directly. It is my understanding that gitorious used to run 'git gc' on a rolling schedule, but I'm not sure how recently that has been done. I certainly haven't.

When a user clicks 'clone', a full clone is made with git-clone; they are on the same filesystem so git automatically uses hardlinks to avoid copying objects unnecessarily. If the original repository is named e.g. '/gitorious/mainline.git', and the user who clicks "clone" is named 'zopa', then the cloned repository is named '/gitorious/zopas-mainline.git'.

Each user has a wiki, which is named as '/username/username-gitorious-wiki.git'. It seems that wikis were created for all users regardless of whether they ever used them, so there are many empty wiki repositories.

Originally, every repository was named with hashed names, such as '/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git'

When they were preparing to send me the data, the gitorious folks started to rename them to the canonical names as I explained above. However, because they created one directory for each user, they ran into the maximum hardlinks that you can make in ext4. So about half of them got renamed and half of them are still in hashed form. They gave me a list of all the hashed-name mappings, in '/home/astrid/mapping.txt.gz'.

Because this was a complete mess, I created a directory of symlinks outside the image with the canonical names, pointing into it:

lrwxrwxrwx 1 root root 50 Jun 30 2015 /srv/gitorious/repositories/gitorious:mainline.git -> /mnt/gitorious/repositories/gitorious/mainline.git lrwxrwxrwx 1 root root 74 Jun 30 2015 /srv/gitorious/repositories/zzn:zzn.git -> /mnt/gitorious/repositories/93f/8ba/205e4107d3822f26332a5c42cbd55f39ce.git

I have the image mounted on '/mnt/gitorious'. So to return the data for 'zzn/zzn.git', the webserver transforms the '/' into a ':' and serves the request with '/srv/gitorious' as the http root directory, following symlinks.

I've collapsed the two mappings into a single file: /srv/softwareheritage/mirrors/gitorious.org/full_mapping.txt

I'm now running a git fsck on all the repositories. Output and results in worker01:/tmp/fsck.

The full mapping of gitorious repositories URLs to on-disk location is at uffizi:/srv/storage/space/mirrors/gitorious.org/full_mapping.txt

marked this issue as related to #674 (closed)

mentioned in commit swh/devel/snippets@46e1a7fa

mentioned in commit swh/infra/puppet/puppet-swh-site@1c4e5337

mentioned in commit swh/infra/puppet/puppet-swh-profile@e458f7b4

mentioned in commit swh/infra/puppet/puppet-swh-site@1c7d8444

mentioned in commit swh/infra/puppet/puppet-swh-profile@d8d3fc32

start-date: Fri Feb 10 16:40:00 UTC 2017

Gitorious import: ingest repositories

Child items ...

Activity