In order to request the vault to cook a bundle, or retrieve an already cooked one, the vault have to be accessible through an API, meaning that there is a need for an intern client/server to plug in.
Currently the cookers store their bundles in an objstorage. The current design of the objstorage requires to have the whole object in ram, and it would require significant changes to be able to "stream" big objects to the objstorage. This is a big problems for the cooking requests of big repositories.
I see two solutions for that:
Option 1: objstorage with streaming
This will require a lot of changes (I'm not even sure I've seen everything there is to change, but I think it's a pretty big overhaul, especially for the remote storage).
Advantages:
We can serve the bundle more efficiently and quickly
We don't have to recompute the bundle everytime it is requested. (Will that happen a lot? The master branch of most repositories will change a lot anyway, so people can just.)
Option 2 : stream the response directly to the client while it's cooking
I really like this option. I see a ton of advantages:
We don't need to store anything
We don't need any notification system
It's completely transparent for the user who can see the objects as they are computed
If the user suddenly decides he doesn't want the result, he drops the TCP connection and the task is stopped, and we don't waste valuable DB accesses computing something that won't be retrieved by the user in the end
Maybe I missed some reasons behind having an efficient retrieval for the user? I really like having the user "share the responsibility" of the computing time by maintaining an open connection.
16:46:39 seirl ╡ olasd: i don't remember if you had an opinion on that too16:49:32 olasd ╡ my opinion is that we should try very hard to avoid doing long-running stuff without checkpoints16:50:59 ╡ I don't think it's reasonable to expect a connection to stay open for hours16:53:37 ╡ this disqualifies any client on an unreliable connection, which is maybe half the world?16:54:14 seirl ╡ okay, i'm not excluding trying to find a way to "resume" the download16:55:06 ╡ that way we can just store the state of the cookers, which is pretty small16:55:29 ╡ also, people on unstable connections tend to not want to download 52GB files16:55:38 olasd ╡ except when they do16:55:45 nicolas17 ╡ o/16:56:03 ╡ I don't mind downloading 52GB16:56:32 olasd ╡ swh doesn't intend to serve {people on stable, fast connections}, it intends to serve people16:56:41 nicolas17 ╡ but if it's bigger than 100MB and you don't support resuming then I hate you16:56:44 seirl ╡ okay there's a misunderstanding by what I meant by that16:56:53 ╡ assuming we DO implement checkpoints16:57:04 ╡ (and resuming)16:57:15 ╡ people with unstable connections are usually people with slow download speeds16:57:47 ╡ so they won't be impacted a lot by the fact that streaming the response while it's being cooked has a lower throughtput16:57:55 olasd ╡ I still don't think streaming is a reasonable default16:58:09 seirl ╡ okay16:59:13 olasd ╡ however, I think making the objstorage support chunking is a reasonable goal16:59:34 ╡ even if it's restricted to the local api for now16:59:55 seirl ╡ oh, i hadn't thought of chunking the bundles17:00:02 nicolas17 ╡ if I start downloading and you stream the response, and the connection drops, what happens? will it keep processing and storing the result in the server, or will it abort?17:00:29 seirl ╡ nicolas17: i was thinking about storing the state of the processing (which is small) somewhere17:00:34 ╡ in maybe an LRU cache17:00:48 ╡ if the user reconnects, the state is restored and the processing can continue17:01:15 nicolas17 ╡ would this be a plain HTTP download from the user's viewpoint?17:01:21 seirl ╡ yeah17:01:27 nicolas17 ╡ would the state be restored such that the file being produced is bitwise identical?17:01:33 seirl ╡ that's the idea17:01:45 ╡ we can deduce which state to retrieve from the Range: header17:02:06 nicolas17 ╡ great then17:03:04 olasd ╡ nicolas17: mind if I paste this conversation to the forge ?17:03:15 nicolas17 ╡ go ahead17:03:17 * ╡ olasd is lazy17:03:23 seirl ╡ that said i perfectly understand that wanting the retrieval to be fast and simple for the users is an important goal, if we're not concerned about the storage and we can easily do chunking that might be a good way to go17:03:42 nicolas17 ╡ the bitwise-identical thing is important or HTTP-level resuming would cause a corrupted mess :P
We should also consider that the API server where the public requests a bundle, and the workers that actually cook them, are very likely to be isolated from one another, which would make streaming to clients tricky to implement.
as discussed on IRC, having the object storage fully streaming is a goal per se, no matter what the Vault needs. //If// the vault needs it, its priority it's just higher; but the goal remains nonetheless (please file this as a separate task, so that we can collect knowledge and TODO items about that in a dedicated space.)
I might be wrong, but it seems to me that an underlying assumption of Option 2 above is that we will not cache cooked objects. That's wrong. The Vault is, conceptually, a cache and should remain so. The reason is that we expect Vault usage to be really "spike-y". Most of the content we archive will //never// be requested because it will remain available to its original hosting place most of the time. But when something disappears from there, especially if it's some "famous" content, we will have people looking for it into Software Heritage; possibly //many// people at the same time. To cater for those use cases we will need to be sure we can make the cooking only once, and serve it multiple times subsequently at essentially zero cost. Then, of course, the cache policy and how aggressive in deletion we will be is totally up for discussion and will need some data points (that we don't have yet) for tuning.