Skip to content
Snippets Groups Projects

codemeta: Fix malformed dates that used to be allowed by the deposit

1 unresolved thread

Closes T4654.

Depends on !476 (closed).


Migrated from D8779 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8779 (id=31645)

    Could not rebase; Attempt merge onto a51cbf39...

    Updating a51cbf3..3bad414
    Fast-forward
     mypy.ini                                           |  3 ++
     requirements.txt                                   |  1 +
     swh/indexer/metadata_dictionary/base.py            | 25 +++++++------
     swh/indexer/metadata_dictionary/cff.py             |  7 +++-
     swh/indexer/metadata_dictionary/codemeta.py        | 32 +++++++++++------
     swh/indexer/metadata_dictionary/github.py          | 13 ++++---
     swh/indexer/metadata_dictionary/maven.py           | 11 +++---
     swh/indexer/metadata_dictionary/npm.py             | 16 ++-------
     swh/indexer/metadata_dictionary/nuget.py           |  4 +--
     swh/indexer/metadata_dictionary/utils.py           | 42 +++++++++++++++++++++-
     .../tests/metadata_dictionary/test_codemeta.py     | 33 +++++++++++++++--
     swh/indexer/tests/metadata_dictionary/test_npm.py  | 11 ++++++
     12 files changed, 144 insertions(+), 54 deletions(-)
    Changes applied before test
    commit 3bad41489c4b5412fbf250d7dd53c3b188956f65
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Wed Oct 26 14:19:26 2022 +0200
    
        codemeta: Fix malformed dates that used to be allowed by the deposit
    
    commit c0052f8e48fa4cf2c0034c48d2e66355558af62a
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Wed Oct 26 14:08:33 2022 +0200
    
        codemeta: Fix incorrect output namespace for dates and URLs
        
        Codemeta reexports schema:url, schema:dateCreated, ... with
        `"@type": "@id"` and `"type": "schema:Date"` so that
        
        ```
        {
            "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
            "url": "http://example.org",
            "dateCreated": "2022-10-26"
        }
        ```
        
        expands to:
        
        ```
        {
            "http://schema.org/url": {
                "@type": "@id",
                "@value": "http://example.org"
            },
            "dateCreated": {
                "@type": "http://schema.org/Date",
                "@value": "2022-10-26"
            }
        }
        ```
        
        However, our translation tried to translate directly to a partially expanded
        form, like this:
        
        ```
        {
            "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
            "url": {
                "@value": "http://example.org"
            },
            "dateCreated": {
                "@value": "2022-10-26"
            }
        }
        ```
        
        which prevents the compaction and expansion algorithms from adding a
        type themselves, causing the document to be compacted to:
        
        ```
        {
            "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
            "schema:url": "http://example.org"
            "schema:dateCreated": "2022-10-26"
        }
        ```
        
        or expanded to:
        
        ```
        {
            "http://schema.org/url": {
                "@value": "http://example.org"
            },
            "http://schema.org/dateCreated": {
                "@value": "2022-10-26"
            }
        }
        ```
        
        which are not what we want.
        
        This commit replaces the hack for `@type` with the right solution that
        works for all properties.
    
    commit a66d5b240ab77e6d8d1b9accf43d571489a3f7f0
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Tue Oct 25 16:02:16 2022 +0200
    
        metadata_dictionary: Systematically check input URLs before adding to graph
        
        This is hopefully the definitive workaround for the PyLD issue.

    See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/521/ for more details.

84 86 # expansion will convert it to a full URI based on
85 87 # "@context": CODEMETA_CONTEXT_URL
86 88 jsonld_child = self.xml_to_jsonld(child)
89 if (
90 localname
91 in (
92 "dateCreated",
93 "dateModified",
94 "datePublished",
95 )
96 and isinstance(jsonld_child, str)
97 and _DATE_RE.match(jsonld_child)
  • maybe add extra condition on string length to avoid useless reformatting ?

    and len(json_child) < 10
  • Author Maintainer

    I don't think it matters, reformatting is fast:

    In [4]: %timeit iso8601.parse_date("2022-10-26").date().isoformat()
    4.56 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  • Please register or sign in to reply
  • Merge request was accepted

  • Antoine Lambert approved this merge request

    approved this merge request

  • Author Maintainer

    Merge request was merged

  • closed

  • Please register or sign in to reply
    Loading