Skip to content

pattern: Improve handling of max_origins_per_page parameter

Instead of fully consuming the get_origins_from_page generator into a list and truncate it, prefer to consume the generator origin per origin and abort the process when the max number of origin per page is reached.

Indeed some non trivial listers like the cgit one can perform costly processing, HTTP request for instance, for each origin in a page. So better not consuming the full generator in a row to avoid such side effects.

Before that change:

(swh) ✔ ~/swh/swh-environment/docker [master|…1⚑ 8] 
16:53 $ doco exec swh-lister swh -l DEBUG lister run -l cgit url=https://source.codeaurora.org max_pages=1 max_origins_per_page=5
DEBUG:swh.core.config:Loading config file /lister.yml
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): swh-scheduler:5008
DEBUG:urllib3.connectionpool:http://swh-scheduler:5008 "POST /lister/get_or_create HTTP/1.1" 200 202
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org with params None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): source.codeaurora.org:443
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET / HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/32xsg_ino/eswitch with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/32xsg_ino/eswitch HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/adl with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/adl HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/plasma with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/plasma HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/rnumber with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/rnumber HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/linux with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/linux HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/afd4400/linux : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/linux/?h=git.kernel.org/master with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/linux/?h=git.kernel.org/master HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/meta-fsl-arm with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/meta-fsl-arm HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/afd4400/meta-fsl-arm : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/meta-fsl-arm/?h=github.com/1.4_M3 with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/meta-fsl-arm/?h=github.com/1.4_M3 HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/u-boot with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/u-boot HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/afd4400/u-boot : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/u-boot/?h=git.denx.de/WIP/31Mar2022-next with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/u-boot/?h=git.denx.de/WIP/31Mar2022-next HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/aic/kernel with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/aic/kernel HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/armboard/wxWidgets with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/armboard/wxWidgets HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/armboard/wxWidgets : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/armboard/wxWidgets/?h=github.com/AUTOCONF_ARCHIVE with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/armboard/wxWidgets/?h=github.com/AUTOCONF_ARCHIVE HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/FreeRTOS/FreeRTOS with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/FreeRTOS/FreeRTOS HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/FreeRTOS/FreeRTOS-Kernel with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/FreeRTOS/FreeRTOS-Kernel HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/autobsps32/FreeRTOS/FreeRTOS-Kernel : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/FreeRTOS/FreeRTOS-Kernel/?h=freertos-10.4.1-pruned-nxp with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/FreeRTOS/FreeRTOS-Kernel/?h=freertos-10.4.1-pruned-nxp HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/alb-demos with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/alb-demos HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/autobsps32/alb-demos : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/alb-demos/?h=alb/master with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/alb-demos/?h=alb/master HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/alb-fb-apps with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/autobsps32/alb-fb-apps HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/autobsps32/alb-fb-apps : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/autobsps32/alb-fb-apps/?h=alb/master with params None
...
...

After that change:

(swh) ✔ ~/swh/swh-environment/docker [master|…1⚑ 8] 
16:57 $ doco exec swh-lister swh -l DEBUG lister run -l cgit url=https://source.codeaurora.org max_pages=1 max_origins_per_page=5
DEBUG:swh.core.config:Loading config file /lister.yml
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): swh-scheduler:5008
DEBUG:urllib3.connectionpool:http://swh-scheduler:5008 "POST /lister/get_or_create HTTP/1.1" 200 202
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org with params None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): source.codeaurora.org:443
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET / HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/32xsg_ino/eswitch with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/32xsg_ino/eswitch HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/adl with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/adl HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/plasma with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/plasma HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/adl-tools/rnumber with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/adl-tools/rnumber HTTP/1.1" 200 None
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/linux with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/linux HTTP/1.1" 200 None
DEBUG:swh.lister.cgit.lister:https://source.codeaurora.org/external/afd4400/linux : Active tab is not the summary, trying to load the summary page
DEBUG:swh.lister.pattern:Fetching URL https://source.codeaurora.org/external/afd4400/linux/?h=git.kernel.org/master with params None
DEBUG:urllib3.connectionpool:https://source.codeaurora.org:443 "GET /external/afd4400/linux/?h=git.kernel.org/master HTTP/1.1" 200 None
INFO:swh.lister.pattern:Max origins per page set to 5 and reached, aborting page processing
DEBUG:urllib3.connectionpool:Resetting dropped connection: swh-scheduler
DEBUG:urllib3.connectionpool:http://swh-scheduler:5008 "POST /origins/record HTTP/1.1" 200 1422
INFO:swh.lister.pattern:Reached page limit of 1, terminating

Merge request reports