Deal with IRIs
Currently, origins in the archive are identified by their IRI, such as this one: https://archive.softwareheritage.org/browse/origin/https://gitorious.org/systemy-zdarzeniowe%C4%85%C5%9B%C4%87/systemy-zdarzeniowe-gitorious-wiki.git/directory/
However, database schemas (origin-related tables and skipped_content), variables (grep -Ri url swh-environment/*/swh
), APIs (origin-related tables and skipped_content for swh-storage, all origin arguments of swh-web), specifications, and documentation refer to them as URLs.
IMO, the easiest solution would be to keep the current format and names, and just update the specifications/documentation to mention they are actually IRIs
Migrated from T2262 (view on Phabricator)
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- vlorentz mentioned in issue #2379 (closed)
mentioned in issue #2379 (closed)
- vlorentz added Data Model Storage manager priority:Normal labels
added Data Model Storage manager priority:Normal labels
- Maintainer
I'm fine with switching to IRIs in the doc, just please expand what it means on first use (with a mention like "they are like URIs but"), as I don't think the acronym is that well-known yet, especially in the US.
- Stefano Zacchiroli changed title from Dealing with IRIs to SWHID: deal with IRIs
changed title from Dealing with IRIs to SWHID: deal with IRIs
- vlorentz changed title from SWHID: deal with IRIs to Deal with IRIs
changed title from SWHID: deal with IRIs to Deal with IRIs
- vlorentz changed the description
changed the description
- Phabricator Migration user marked this issue as related to #2379 (closed)
marked this issue as related to #2379 (closed)
- Maintainer
I wrote that little script to check the number of origin IRIs and URIs in the archive
from pprint import pprint from rfc3987 import parse from swh.web.common import service batch_size = 100000 nb_iris = 0 nb_uris = 0 nb_origins = 0 iris = [] no_uri_iris = [] def process_origins(origins): global nb_origins, nb_iris, nb_uris, iris nb_origins += len(origins) for origin in origins: try: parse(origin['url'], rule='URI') nb_uris += 1 except ValueError: try: parse(origin['url'], rule='IRI') nb_iris += 1 iris.append(origin['url']) except: no_uri_iris.append(origin['url']) pass print(f'nb_origins = {nb_origins}, nb_iris = {nb_iris}, nb_uris = {nb_uris}') origins = list(service.lookup_origins(origin_count=batch_size)) while len(origins) == batch_size: process_origins(origins) origins = list(service.lookup_origins(origin_from=origins[-1]["id"]+1, origin_count=batch_size)) process_origins(origins) pprint(iris) pprint(no_uri_iris)
There is exactly two origins with IRIs:
nb_origins = 115660314, nb_iris = 2, nb_uris = 115660251 ['https://gitorious.org/systemy-zdarzenioweąść/systemy-zdarzeniowe-gitorious-wiki.git', 'https://gitorious.org/systemy-zdarzenioweąść/systemy-zdarzeniowe.git']
Also a couple of origins have invalid URI/IRI:
['http://code.google.com/eclipselabs/m/mobile-web-development-with-phonegap/sv\\', 'http://code.qt.io/{non-gerrit}/qt-labs/coroutine.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qml1-shadersplugin.git', 'http://code.qt.io/{graveyard}/qt-historical.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtmodularization.git', 'http://code.qt.io/{non-gerrit}/qt-labs/webclient.git', 'http://code.qt.io/{non-gerrit}/qt-labs/simplegl.git', 'http://code.qt.io/{non-gerrit}/qt-labs/webscraps.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtscript-browser-env.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtjambi-awtbridge.git', 'http://code.qt.io/{non-gerrit}/qt-labs/remotecontrolwidget.git', 'http://code.qt.io/{non-gerrit}/qt-labs/itemviews-ng.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtuitest.git', 'http://code.qt.io/{non-gerrit}/qt-labs/modelviewer.git', 'http://code.qt.io/{non-gerrit}/qt-labs/systemtests.git', 'http://code.qt.io/{non-gerrit}/qt-labs/devnet-examples.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtspotify.git', 'http://code.qt.io/{non-gerrit}/qt-labs/segmentedbutton.git', 'http://code.qt.io/{non-gerrit}/qt-labs/graphics-dojo.git', 'http://code.qt.io/{graveyard}/qt-creator-historical.git', 'http://code.qt.io/{graveyard}/quick3d.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qt-compositor.git', 'http://code.qt.io/{non-gerrit}/qt-labs/bm2.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qmlcanvas.git', 'http://code.qt.io/{graveyard}/qlogger.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtestlib-tools.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtscript-remote-debugging.git', 'http://code.qt.io/{non-gerrit}/qt-labs/mobile-demos.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qtcollator.git', 'http://code.qt.io/{non-gerrit}/qt-labs/doxygen2qthelp.git', 'http://code.qt.io/{non-gerrit}/qt-labs/simulator.git', 'http://code.qt.io/{non-gerrit}/qt-labs/kineticscroller.git', 'http://code.qt.io/{non-gerrit}/qt-labs/devdays-graphicssystem-plugin.git', 'http://code.qt.io/{non-gerrit}/qt-labs/maemo5-homescreen.git', 'http://code.qt.io/{non-gerrit}/qt-labs/scxml.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qml-gesturearea.git', 'http://code.qt.io/{graveyard}/qtjsbackend.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qmlogre.git', 'http://code.qt.io/{non-gerrit}/qt-labs/symbian-overlay.git', 'http://code.qt.io/{graveyard}/qtbinaryjson.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qt5-launch-demo.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qml-object-model.git', 'http://code.qt.io/{graveyard}/qtmultimediakit.git', 'http://code.qt.io/{non-gerrit}/qt-labs/scene-graph.git', 'http://code.qt.io/{non-gerrit}/qt-labs/wolfenqt.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qt-autotester.git', 'http://code.qt.io/{non-gerrit}/qt-labs/nacl.git', 'http://code.qt.io/{non-gerrit}/qt-labs/devdays-windowsystem-server.git', 'http://code.qt.io/{non-gerrit}/qt/qt3support.git', 'http://code.qt.io/{non-gerrit}/qt-labs/doctools.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qml-toucharea.git', 'http://code.qt.io/{non-gerrit}/qt-labs/qml-gestures-examples.git', 'http://code.qt.io/{non-gerrit}/qt-labs/widgets-ng.git', 'http://code.qt.io/{graveyard}/qtx11support.git', 'http://code.qt.io/{graveyard}/qtphonon.git', 'http://code.qt.io/{non-gerrit}/qt-labs/opencl.git', 'http://code.qt.io/{graveyard}/qtjsondb.git', 'http://code.qt.io/{graveyard}/v4vm.git', 'http://code.qt.io/{non-gerrit}/qt-labs/bm.git', 'http://code.qt.io/{non-gerrit}/qt-labs/scene-graph-demo.git', 'http://code.qt.io/{non-gerrit}/qt-labs/meespot.git']