We are currently following quite closely our principles of "storing only facts" in the archive, so we never report just statements like "the licence is GPL", but qualified statements like "according to tool X, the licence is GPL".
The JSON format in the above example, though, is not general enough. In the future, we will store several qualified statements for the same property, like "according to tool X, licence is GPL, and according to tool Y, licence is MPL".
Stefano Zacchirolichanged title from Preparer JSON output for storing multiple value/tool entries to {+Web API: make endpoints that expose extracted metadata return lists of factual information+}
changed title from Preparer JSON output for storing multiple value/tool entries to {+Web API: make endpoints that expose extracted metadata return lists of factual information+}
Created ~/.config/swh/storage.yml and ~/.config/swh/webapp/webapp.yml
And
# terminal 0## install external python deps.cd /path/to/swh-environmentvirtualenv --python=python3 .venvsource .venv/bin/activatefor i in `bin/ls-py-modules`; docd $i; pip3 install -r requirements.txt; cd .. done## dump test data into postgresmake rebuild-testdata# terminal 0## start storage servercd /path/to/swh-environmentsource .venv/bin/activatesource pythonpath.shpython3 -m swh.storage.api.server ~/.config/swh/storage.yml# terminal 1## start webappcd /path/to/swh-environmentsource .venv/bin/activatesource pythonpath.sh python3 -m swh.web.manage runserver
Now, If I go to http://127.0.0.1:5004/api/1/content/sha1:1fc6129a692e7a87b5450e2ba56e7669d0c5775d/license/, It returns a 503
{ "exception": "StorageAPIError", "reason": "An unexpected error occurred in the api backend: HTTPConnectionPool(host='127.0.0.1', port=5007): Max retries exceeded with url: /content/fossology_license (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))"}
How do I make swh-indexer available at 127.0.0.1:5007?
Is there anything else that I've to setup in the development environment?
Hi s!, and thanks for your interest in helping us out!
First of all, I should apologize for the lack of "getting started" documentation for developers. It's on my plage, and the stuff you see at https://docs.softwareheritage.org/devel/ is the beginning of it, but it is "not there yet", so your questions are entirely legitimate. We'll try to walk you through what's needed here, hoping in the future the doc will directly address needs like yours.
Not totally so, licenses and tool must be side by side to be factual.
Nomossa actually detected the 4 licenses (we have only one tool for license detection).
Would updating the db_to_* functions in swh.indexer.storage.converters resolve this task?
I'd say along those line, yes, probably one step before.
Starting from the get endpoint.
The storage will return you the content's associated licenses (including tool).
Aggregating those according to my previous comment...
Still, following the sample i proposed, that would then give something like:
[{ // <- that's still a list because there can be other different ids "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b", "facts": [ { "licenses": [ "Dual-license", "GPL", "MIT", "MIT-style" ], "tool": { "configuration": { "command_line": "nomossa <filepath>" }, "id": 1, "name": "nomos", "version": "3.1.0rc2-31-ga2cbb8c" } }, { "licenses": [ "MIT", "MIT-style" ], "tool": { "configuration": { "command_line": "scancode <filepath>" }, "id": 2, "name": "scancode", "version": "9.2.3" } } ]}]
Note that could also be a dict with key and values the current "facts" (that will just diverge from the current pattern we have, list in, list out in storage api).
That could be converted in the web layer (if that makes sense).
I was able to start the swh-indexer at 127.0.0.1:5007.
I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in #782 (closed).
Here's the diff:
diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.pyindex e71523d..1c6f0c9 100644--- a/swh/indexer/storage/__init__.py+++ b/swh/indexer/storage/__init__.py@@ -298,9 +298,16 @@ class IndexerStorage(): db = self.db db.store_tmp_bytea(ids, cur)+ d = {} for c in db.content_fossology_license_get_from_temp(): license = dict(zip(db.content_fossology_license_cols, c))- yield converters.db_to_fossology_license(license)++ id_ = license['id']+ if id_ not in d:+ d[id_] = []+ d[id_].append(converters.db_to_fossology_license(license))++ yield self._generate(d) @db_transaction def content_fossology_license_add(self, licenses,@@ -548,3 +555,10 @@ class IndexerStorage(): if not idx: return None return dict(zip(self.db.indexer_configuration_cols, idx))++ def _generate(self, d):+ for id_, facts in d.items():+ yield {+ 'id': id_,+ 'facts': facts+ }diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.pyindex db7a295..3cf5da1 100644--- a/swh/indexer/storage/converters.py+++ b/swh/indexer/storage/converters.py@@ -129,7 +129,6 @@ def db_to_metadata(metadata): def db_to_fossology_license(license): return {- 'id': license['id'], 'licenses': license['licenses'], 'tool': { 'id': license['tool_id'],
I would like to know if I'm going in the right direction.
Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.
I was able to start the swh-indexer at 127.0.0.1:5007.
Awesome!
I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in #782 (closed).
I would like to know if I'm going in the right direction.
Yes, sounds like it :)
Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.
No, that routine drops all dbs, rebuilds them to their latest schema changes (well, as current HEAD in your current swh-environment).
To have data, you need to run our different modules (loader, then indexer).
I started writing something but that takes more times to be thorough, so i stopped mid-air to reply to you on the first points already.
More to come soon on how you could have data locally ;)
Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;)
For what is worth, that was my initial response $236.
diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.pyindex e71523d..1f06056 100644--- a/swh/indexer/storage/__init__.py+++ b/swh/indexer/storage/__init__.py@@ -292,15 +292,24 @@ class IndexerStorage(): list: dictionaries with the following keys: - id (bytes)- - licenses ([str]): associated licenses for that content+ - facts ([str]): associated licenses for that content """ db = self.db db.store_tmp_bytea(ids, cur)+ d = {} for c in db.content_fossology_license_get_from_temp(): license = dict(zip(db.content_fossology_license_cols, c))- yield converters.db_to_fossology_license(license)++ id_ = license['id']+ if id_ not in d:+ d[id_] = []+ d[id_].append(converters.db_to_fossology_license(license))++ for id_, facts in d.items():+ yield { 'id': id_, 'facts': facts}+ @db_transaction def content_fossology_license_add(self, licenses,diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.pyindex db7a295..3cf5da1 100644--- a/swh/indexer/storage/converters.py+++ b/swh/indexer/storage/converters.py@@ -129,7 +129,6 @@ def db_to_metadata(metadata): def db_to_fossology_license(license): return {- 'id': license['id'], 'licenses': license['licenses'], 'tool': { 'id': license['tool_id'],
If the changes look good, I'll update the failing tests and submit a patch for code review.
! In #782 (closed), @ardumont wrote:
Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;)
For what is worth, that was my initial response $236.
Thanks for writing the quick howto! I'll check it out.
That would help in reviewing and possibly suggesting code adaptation in context (as explained below).
And as you are quite advanced in the contributions now, that would make sense ;)
Thanks in advance.
Changes
I currently see some possible changes that your real cool demo illustrates.
The value returned should be directly the d dictionary's content
So either yield from d should be ok (possibly, not checked)
or your return must change to plain d (you would possibly need to change the decorator from @db_transaction_generator to @db_transaction.
Note: It's buggy because in the current state prior to your adaptation, we could already have multiple results (1 per tool) even if we have only one sha1 in the endpoint's input (#782 (closed)).
It's not that much of a deal now because currently we only have one tool (which is currently running). But long term, it's not correct. Thus, that issue in the first place btw ;)
Implementation wise, you could use from collections import defaultdict and initializes the d as a defaultdict(list), that way you can remove the if id not in d test statement and directly use d[id_].appendYou can check some use case samples in the python3 documentation.