origin_search: Filters and sorting for date_{created,modified,published}
intrinsic_metadata often contains date_{created,modified,published} which can be used as sorting options as well as filters.
Migrated from D5964 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D5964 (id=21453)
Rebasing onto 2e1fb863...
Current branch diff-target is up to date.
Changes applied before test
commit 85dd6f27068962df6330a87239989e1c82f25d18 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Fri Jul 2 15:15:04 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/186/ for more details.
177 183 # don't bother indexing tokens in these URIs, as the 178 184 # are used as namespaces 179 185 "type": "keyword", 180 } 186 }, 187 "http://schema": { 188 "properties": { 189 "org/dateCreated": { Afaik, It's important to define date type for date{Published,Modified,Created} for filters/sorting options to work since we've used
"date_detection": False,
(Automatic date detection won't work)But, If I use uncomment these lines of code, most of the
test_elasticsearch.py
tests start to fail.The error that gets thrown in that case looks like : https://forge.softwareheritage.org/swh/meta$1082.
Actually, all the fields in the nested document need to be strings. It seems it's possible to configure the mapping as you want and to add the lenient[1] parameter to the search query.
Some tests will need to be adapted (test_origin_intrinsic_metadata_string_mapping for example) as they are trying to set random text on the dateCreated field and it will fail with the new mapping.
Build has FAILED
Patch application report for D5964 (id=21499)
Rebasing onto f378a989...
Current branch diff-target is up to date.
Changes applied before test
commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/ See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/190/console
! In !57 (closed), @vlorentz wrote: Can you either add tests, or deduplicate this code so we don't need to test every field?
By "deduplicate this code so we don't need to test every field" you mean common code and tests for all the date fields (last_visit, last_release, ... datePublished, dateModified, ...) ?
Build is green
Patch application report for D5964 (id=21505)
Rebasing onto f378a989...
Current branch diff-target is up to date.
Changes applied before test
commit 976c7229d2221cbdc82517ab7a24d121ad0ced62 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 6 09:44:27 2021 +0000 origin_search: Validate intrinsic_metadata date field format before storing commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/191/ for more details.
66 66 # * {"author": [{"@value": "Jane Doe"}]} 67 67 # and JSON-LD expansion will convert them all to the last one. 68 68 if "intrinsic_metadata" in res: 69 res["intrinsic_metadata"] = codemeta.expand(res["intrinsic_metadata"]) 69 intrinsic_metadata = res["intrinsic_metadata"] 70 for date_field in ["dateCreated", "dateModified", "datePublished"]: 71 if date_field in intrinsic_metadata: 72 date = intrinsic_metadata[date_field] 73 74 # If date{Created,Modified,Published} value isn't parsable 75 # It gets rejected and isn't stored (unlike other fields) 76 if not is_date_parsable(date): 77 intrinsic_metadata.pop(date_field) 78 79 res["intrinsic_metadata"] = codemeta.expand(intrinsic_metadata) Build is green
Patch application report for D5964 (id=21531)
Rebasing onto f378a989...
Current branch diff-target is up to date.
Changes applied before test
commit dc52c981fb953add9be87f464a0921ea6601bc02 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Wed Jul 7 11:27:35 2021 +0000 origin_update: Document rejection of metadata date fields if not parsable commit 976c7229d2221cbdc82517ab7a24d121ad0ced62 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 6 09:44:27 2021 +0000 origin_search: Validate intrinsic_metadata date field format before storing commit 31b0f67cc99e49d0c398960ed93ac4d9c5134931 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Mon Jul 5 16:17:23 2021 +0000 origin_search: Filters and sorting for date_{created,modified,published}
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/192/ for more details.
- swh/search/utils.py 0 → 100644
37 if sep: 38 return sep.join(METADATA_FIELDS[field]) 39 40 return METADATA_FIELDS[field] 41 42 43 def is_date_parsable(date_str): 44 """ 45 Return True if date_str is in the format 46 %Y-%m-%d or the standard ISO format. 47 Otherwise return False. 48 """ 49 try: 50 datetime.strptime(date_str, "%Y-%m-%d") 51 return True 52 except Exception: this does not validate it is in "the standard ISO format": https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat
You should use the iso8601 library instead.
- swh/search/utils.py 0 → 100644
117 112 118 if field == "score": 113 119 if reversed: 114 120 return -origin.get(field, 0) 115 121 else: 116 122 return origin.get(field, 0) 117 123 118 datetime_max = datetime.max.replace(tzinfo=timezone.utc) 124 if field in ["date_created", "date_modified", "date_published"]: 125 date = datetime.strptime( 126 _nested_get(origin, get_expansion(field), DATE_MIN)[0], "%Y-%m-%d" 127 ) 128 if reversed: 129 return DATE_OBJ_MAX - date 130 else: 131 return date Build is green
Patch application report for D5964 (id=21612)
Rebasing onto f378a989...
Current branch diff-target is up to date.
Changes applied before test
commit fe7640f71024084554ab4d36209f6da5d1c76267 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Tue Jul 13 14:59:53 2021 +0530 origin_search: Filters and sorting for date_{created,modified,published} intrinsic_metadata often contains date_{created,modified,published} which can be used as sorting options as well as filters.
See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/193/ for more details.