failing worker consumes remaining tasks without processing them

mentioned in issue swh-loader-mercurial#964 (closed)

marked this issue as related to swh-loader-mercurial#964 (closed)

added Core Loader priority:Normal labels

So far, the solutions i foresee:

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 1. Capture OOM event -> force a | - separation of concern (code)        |                                 | - is it possible?                  |
| cleanup (somehow                |                                       |                                 | -> check syslog for killed process |
|                                 |                                       |                                 | -> listen to dedicated event       |
|                                 |                                       |                                 | (per worker)                       |
|                                 |                                       |                                 | -> mem_notify                      |
|                                 |                                       |                                 | https://lwn.net/Articles/267013    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 2. Loader starts -> post a      | - separation of concern (code)        | - edge case can go bad,         | how to determine:                  |
| cleanup message to a queue      |                                       | generating quite some messages  | - which worker to clean?           |
| (as soon as it can)             |                                       | in queue until the current      | -> worker-name as task message     |
|                                 |                                       | clean up takes place            | parameter                          |
|                                 |                                       |                                 | - when to clean?                   |
|                                 |                                       |                                 | -> pid alive check                 |
|                                 |                                       |                                 | (python3-psutil)                   |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 3. Loader starts -> check for   | - short-term: pragmatic (current      | - Separation of concern not     |                                    |
| dangling files in a specific    | state)                                | respected (doing much more than |                                    |
| location                        | - can be plugged in loader-core (     | loading                         |                                    |
|                                 | all workers benefit)                  | - must implement this           |                                    |
|                                 | - simpler to implement                | potentially on other layers     |                                    |
|                                 |                                       | (lister, etc                    |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 4. Worker (node?) supervisor    | - separation of concern               |                                 | 1. How to detect falsy behavior:   |
|                                 | - long-term solution                  |                                 | -> checks per worker type:         |
|                                 |                                       |                                 | - disk status                      |
|                                 |                                       |                                 | - ram usage                        |
|                                 |                                       |                                 | - ~task consuming rate             |
|                                 |                                       |                                 | 2. existing techno or ad-hoc?      |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 5. Separate nodes for           | - More control over the resource      | - The main issue could still    |                                    |
| specialized workers             | allocation (e.g. more disk for svn    | happen (but probably less       |                                    |
|                                 | and hg workers, less for git ones...) | often though)                   |                                    |
|                                 | - separation of concern (system       | - push the problem at           |                                    |
|                                 | level now)                            | system/provision/deployment     |                                    |
|                                 | - long-term solution                  | level                           |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|

Note: org-mode table

Still asserting the possibilities (and reading documentation):

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 6. Check before scheduling task | - separation of concern               |                                 | How to check node's state?         |
| (same way scheduling is sent to |                                       |                                 |                                    |
| queue or not depending on size) |                                       |                                 |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 7. Chain temporary folder       | - separation of concern (code)        | Same as 2.                      | pre-requisite:                     |
| step creation task + loading    | - can be applied to all loaders, ...  |                                 | force chaining to execute on       |
| task using temporary folder +   |                                       |                                 | the same worker (chains [2])       |
| + cleaning up task (independent |                                       |                                 |                                    |
| from loading task result        |                                       |                                 |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|

- [2] http://docs.celeryproject.org/en/latest/userguide/canvas.html#canvas-chain

mentioned in commit 7aa7ed92

|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| Possible solution               | Pros                                  | Cons                            | Question                           |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|
| 8. Celery signal on postrun     | - separation of concern               | Not working, in OOM kill        | Documentation [3]                  |
|                                 |                                       | scenario, task is killed        |                                    |
|                                 |                                       | (tested)                        |                                    |
|---------------------------------+---------------------------------------+---------------------------------+------------------------------------|


- [3] http://docs.celeryproject.org/en/latest/userguide/signals.html#task-postrun

mentioned in commit swh-loader-mercurial@6780a698

mentioned in commit swh-loader-mercurial@b8d287da

mentioned in commit d47ef858

mentioned in commit swh-loader-svn@2093816c

mentioned in commit swh-loader-mercurial@a56a9da6

mentioned in commit 92f57536

mentioned in commit swh-loader-svn@e262bfe7

Well, after much digging documentations and some tryouts. Finally gave in to solution 3.

So, now the loaders can implement a pre_cleanup method (loader-core, does nothing by default) to try and clean up after their typed siblings (svn, mercurial).

The loaders already used temporary folders for their computations (which is now sandboxed at the systemd level, so no collision should happen between loaders). Now, they still use temporary folders, but those follows patterned names (swh.loader.{type}-{unique-noise}-{pid}).

Logic: Waking up, a task checks for dangling folders in the upper root temporary location (configuration). They check the folder name pattern matching (according to type) and pid existence: If (no folder or nothing matches or name matches and pid live) then do nothing and continue with loading If name matches and pid does not exist then clean up that folder and continue with loading

This has been implemented for svn and mercurial using a common method installed in the loader-core.

closed

failing worker consumes remaining tasks without processing them

Designs

Child items 0

Activity