|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| Possible solution | Pros | Cons | Question ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 1. Capture OOM event -> force a | - separation of concern (code) | | - is it possible? || cleanup (somehow | | | -> check syslog for killed process || | | | -> listen to dedicated event || | | | (per worker) || | | | -> mem_notify || | | | https://lwn.net/Articles/267013 ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 2. Loader starts -> post a | - separation of concern (code) | - edge case can go bad, | how to determine: || cleanup message to a queue | | generating quite some messages | - which worker to clean? || (as soon as it can) | | in queue until the current | -> worker-name as task message || | | clean up takes place | parameter || | | | - when to clean? || | | | -> pid alive check || | | | (python3-psutil) ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 3. Loader starts -> check for | - short-term: pragmatic (current | - Separation of concern not | || dangling files in a specific | state) | respected (doing much more than | || location | - can be plugged in loader-core ( | loading | || | all workers benefit) | - must implement this | || | - simpler to implement | potentially on other layers | || | | (lister, etc | ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 4. Worker (node?) supervisor | - separation of concern | | 1. How to detect falsy behavior: || | - long-term solution | | -> checks per worker type: || | | | - disk status || | | | - ram usage || | | | - ~task consuming rate || | | | 2. existing techno or ad-hoc? ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 5. Separate nodes for | - More control over the resource | - The main issue could still | || specialized workers | allocation (e.g. more disk for svn | happen (but probably less | || | and hg workers, less for git ones...) | often though) | || | - separation of concern (system | - push the problem at | || | level now) | system/provision/deployment | || | - long-term solution | level | ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
Still asserting the possibilities (and reading documentation):
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| Possible solution | Pros | Cons | Question ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 6. Check before scheduling task | - separation of concern | | How to check node's state? || (same way scheduling is sent to | | | || queue or not depending on size) | | | ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|| 7. Chain temporary folder | - separation of concern (code) | Same as 2. | pre-requisite: || step creation task + loading | - can be applied to all loaders, ... | | force chaining to execute on || task using temporary folder + | | | the same worker (chains [2]) || + cleaning up task (independent | | | || from loading task result | | | ||---------------------------------+---------------------------------------+---------------------------------+------------------------------------|- [2] http://docs.celeryproject.org/en/latest/userguide/canvas.html#canvas-chain
Well, after much digging documentations and some tryouts. Finally gave in to solution 3.
So, now the loaders can implement a pre_cleanup method (loader-core, does nothing by default) to try and clean up after their typed siblings (svn, mercurial).
The loaders already used temporary folders for their computations (which is now sandboxed at the systemd level, so no collision should happen between loaders).
Now, they still use temporary folders, but those follows patterned names (swh.loader.{type}-{unique-noise}-{pid}).
Logic:
Waking up, a task checks for dangling folders in the upper root temporary location (configuration).
They check the folder name pattern matching (according to type) and pid existence:
If (no folder or nothing matches or name matches and pid live) then do nothing and continue with loading
If name matches and pid does not exist then clean up that folder and continue with loading
This has been implemented for svn and mercurial using a common method installed in the loader-core.