Implement query to get origin visit types dynamically
The archive content grows continually and new visit types are added over time.
For instance in a near future cvs
and opam
vist types will be added in production.
Those visit types are used in swh-web
in the origin search form but they are currently hardcoded
which implies modifying that list manually each time a new visit type is introduced.
We should have a way to get that visit types list dynamically for commodity of use. Turns out we can extract that list in an efficient way using the following elasticsearch query:
anlambert@carnavalet:~/tmp$ cat es_query.sh
#!/bin/bash
curl -X POST http://localhost:9200/origin-production/_search?pretty -H 'Content-Type: application/json' -d '
{
"aggs" : {
"vist_types" : {
"terms" : { "field" : "visit_types", "size":10000 }
}
},
"size" : 0
}'
anlambert@carnavalet:~/tmp$ time ./es_query.sh
{
"took" : 429,
"timed_out" : false,
"_shards" : {
"total" : 90,
"successful" : 90,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"vist_types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "git",
"doc_count" : 151794432
},
{
"key" : "npm",
"doc_count" : 1660597
},
{
"key" : "svn",
"doc_count" : 678797
},
{
"key" : "hg",
"doc_count" : 381103
},
{
"key" : "pypi",
"doc_count" : 326793
},
{
"key" : "deb",
"doc_count" : 72303
},
{
"key" : "cran",
"doc_count" : 18019
},
{
"key" : "ftp",
"doc_count" : 1205
},
{
"key" : "deposit",
"doc_count" : 1079
},
{
"key" : "tar",
"doc_count" : 389
},
{
"key" : "nixguix",
"doc_count" : 2
}
]
}
}
}
real 0m0,553s
user 0m0,013s
sys 0m0,014s
We could then add a new method to swh-search
interface named origin_visit_types
wrapping that request and returning a dict mapping each visit type to its count.
Migrated from T3441 (view on Phabricator)