Web API: make endpoints that expose extracted metadata return lists of factual information

added Easy hack Indexer Web app priority:Triage labels

right, we should make that JSON output a list of, rather than single dictionary

added priority:Normal label and removed priority:Triage label

changed title from Preparer JSON output for storing multiple value/tool entries to {+Web API: make endpoints that expose extracted metadata return lists of factual information+}

@zack, I would like to work on this issue.

The value of tool must look like this:

{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

! In #782 (closed), @s wrote:

{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

yep, that is correct!

@zack, I've been getting myself acquainted with the swh codebase for the past few days.

Would updating the db_to_* functions in |swh.indexer.storage.converters resolve this task?

Also, I want to know the canonical way to setup the development environment (my host system runs debian unstable).

What I've already done:

Cloned swh-environment
Pulled all swh repos using mr up.
Setup Arcanist.
Created ~/.config/swh/storage.yml and ~/.config/swh/webapp/webapp.yml
And

# terminal 0

## install external python deps.
cd /path/to/swh-environment
virtualenv --python=python3 .venv
source .venv/bin/activate
for i in `bin/ls-py-modules`; do
cd $i; pip3 install -r requirements.txt; cd .. 
done

## dump test data into postgres
make rebuild-testdata

# terminal 0

## start storage server
cd /path/to/swh-environment
source .venv/bin/activate
source pythonpath.sh
python3 -m swh.storage.api.server ~/.config/swh/storage.yml

# terminal 1

## start webapp
cd /path/to/swh-environment
source .venv/bin/activate
source pythonpath.sh 
python3 -m swh.web.manage runserver

Now, If I go to http://127.0.0.1:5004/api/1/content/sha1:1fc6129a692e7a87b5450e2ba56e7669d0c5775d/license/, It returns a 503

{
    "exception": "StorageAPIError",
    "reason": "An unexpected error occurred in the api backend: HTTPConnectionPool(host='127.0.0.1', port=5007): Max retries exceeded with url: /content/fossology_license (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))"
}

How do I make swh-indexer available at 127.0.0.1:5007? Is there anything else that I've to setup in the development environment?

Hi s!, and thanks for your interest in helping us out!

First of all, I should apologize for the lack of "getting started" documentation for developers. It's on my plage, and the stuff you see at https://docs.softwareheritage.org/devel/ is the beginning of it, but it is "not there yet", so your questions are entirely legitimate. We'll try to walk you through what's needed here, hoping in the future the doc will directly address needs like yours.

@ardumont can you give a hand to @s here?

@s Hello, following your nicely summed up actions, you'd need another terminal and configuration file.

The main storage and the indexer one have been splitted so you'd need:

In ~/.config/swh/storage/indexer.yml:

storage:
  cls: local
  args:
    db: softwareheritage-indexer-dev

Note: make rebuild-testdata you mention already setup-ed the db for you

And then running (following your initial logic)

# terminal 2
## start indexer storage server
$ cd /path/to/swh-environment
$ source .venv/bin/activate
$ source pythonpath.sh
$ python3 -m swh.indexer.storage.api.server
 * Running on http://0.0.0.0:5007/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 113-212-857

Cheers,

! In #782 (closed), @zack wrote:

! In #782 (closed), @s wrote:

{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

yep, that is correct!

Not totally so, licenses and tool must be side by side to be factual. Nomossa actually detected the 4 licenses (we have only one tool for license detection).

So that'd be more:

{
    "content_url": "/api/1/content/sha1:2d8280fbabf9a1eabbcbc562b9763cb07952118b/",
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "facts": [
        {
            "licenses": [
                "Dual-license",
                "GPL",
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "nomossa <filepath>"
                },
                "id": 1,
                "name": "nomos",
                "version": "3.1.0rc2-31-ga2cbb8c"
            }
        }
    ]
}

Would updating the db_to_* functions in swh.indexer.storage.converters resolve this task?

I'd say along those line, yes, probably one step before. Starting from the get endpoint. The storage will return you the content's associated licenses (including tool).

Aggregating those according to my previous comment...

...

Something is fishy here.

That endpoint returns a list already...

Yes, there is an issue in the web-app. It extracts the first result instead of listing all detected licenses by the tool for the sha1.

So, actually, the result of the current endpoint that should be seen is :

[{  // <- that's a list of {tool, licenses, id}
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "licenses": [
        "Dual-license",
        "GPL",
        "MIT",
        "MIT-style"
    ],
    "tool": {
        "configuration": {
            "command_line": "nomossa <filepath>"
        },
        "id": 1,
        "name": "nomos",
        "version": "3.1.0rc2-31-ga2cbb8c"
    }
},
{ 
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "licenses": [
        "MIT",
        "MIT-style"
    ],
    "tool": {
        "configuration": {
            "command_line": "scancode <filepath>"
        },
        "id": 2,
        "name": "scancode",
        "version": "9.2.3"
    }
}]

Which answers correctly @rdicosmo's initial concern (Note: there is redundancy since it will repeat the 'id' in for each tool).

We can take the opportunity to rewrite the endpoint's result as mentioned.

Starting from the get endpoint.

That would be indeed one place where to actually convert the result as proposed.

And then adapt the webapp endpoint to actually list all detected licenses for that sha1 (so not take the first result as is actually done).

Cheers,

Still, following the sample i proposed, that would then give something like:

[{  // <- that's still a list because there can be other different ids
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "facts": [
        {
            "licenses": [
                "Dual-license",
                "GPL",
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "nomossa <filepath>"
                },
                "id": 1,
                "name": "nomos",
                "version": "3.1.0rc2-31-ga2cbb8c"
            }
        },
        {
            "licenses": [
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "scancode <filepath>"
                },
                "id": 2,
                "name": "scancode",
                "version": "9.2.3"
            }
        }
    ]
}]

Note that could also be a dict with key and values the current "facts" (that will just diverge from the current pattern we have, list in, list out in storage api). That could be converted in the web layer (if that makes sense).

Cheers,

@ardumont Thank you very much for helping out.

I was able to start the swh-indexer at 127.0.0.1:5007.

I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in #782 (closed).

Here's the diff:

diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.py
index e71523d..1c6f0c9 100644
--- a/swh/indexer/storage/__init__.py
+++ b/swh/indexer/storage/__init__.py
@@ -298,9 +298,16 @@ class IndexerStorage():
         db = self.db
         db.store_tmp_bytea(ids, cur)
 
+        d = {}
         for c in db.content_fossology_license_get_from_temp():
             license = dict(zip(db.content_fossology_license_cols, c))
-            yield converters.db_to_fossology_license(license)
+
+            id_ = license['id']
+            if  id_ not in d:
+                d[id_] = []
+            d[id_].append(converters.db_to_fossology_license(license))
+
+        yield self._generate(d)
 
     @db_transaction
     def content_fossology_license_add(self, licenses,
@@ -548,3 +555,10 @@ class IndexerStorage():
         if not idx:
             return None
         return dict(zip(self.db.indexer_configuration_cols, idx))
+
+    def _generate(self, d):
+        for id_, facts in d.items():
+            yield {
+                'id': id_,
+                'facts': facts
+                }
diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.py
index db7a295..3cf5da1 100644
--- a/swh/indexer/storage/converters.py
+++ b/swh/indexer/storage/converters.py
@@ -129,7 +129,6 @@ def db_to_metadata(metadata):
 
 def db_to_fossology_license(license):
     return {
-        'id': license['id'],
         'licenses': license['licenses'],
         'tool': {
             'id': license['tool_id'],

The modified files:

init.py converters.py

I would like to know if I'm going in the right direction.

Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.

@ardumont Thank you very much for helping out.

You are welcome ;)

I was able to start the swh-indexer at 127.0.0.1:5007.

Awesome!

I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in #782 (closed). I would like to know if I'm going in the right direction.

Yes, sounds like it :)

Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.

No, that routine drops all dbs, rebuilds them to their latest schema changes (well, as current HEAD in your current swh-environment). To have data, you need to run our different modules (loader, then indexer).

I started writing something but that takes more times to be thorough, so i stopped mid-air to reply to you on the first points already. More to come soon on how you could have data locally ;)

Cheers,

More to come soon on how you could have data locally ;)

Here, it goes. Mounting partial dumps (softwareheritage-dev, softwareheritage-indexer-dev dbs).

Following make rebuild-testdata
Mount a dump with the following small data for the indexer db:

Retrieve the file from F3040109.
Then mount the dump using something like:

cat swh-index-dev.114-small-data.sql | psql softwareheritage-indexer-dev

This will only mount the data for the indexer api though.

Consistently mount the data dump for the softwareheritage db

To actually see the data from the webapp, you need to have something consistent in both db.

For this, mount the following dump, F3040114.

cat 118-hylang-hy-origin-data.sql | psql softwareheritage-dev

And there you should be good to go.

Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;) For what is worth, that was my initial response $236.

Cheers,

@ardumont, Thanks for the data dumps (F3040109 and F3040114). I'm able to test the /api/1/content/sha1:HASH/license/ endpoint.

Here's an updated diff:

diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.py
index e71523d..1f06056 100644
--- a/swh/indexer/storage/__init__.py
+++ b/swh/indexer/storage/__init__.py
@@ -292,15 +292,24 @@ class IndexerStorage():
             list: dictionaries with the following keys:
 
             - id (bytes)
-            - licenses ([str]): associated licenses for that content
+            - facts ([str]): associated licenses for that content
 
         """
         db = self.db
         db.store_tmp_bytea(ids, cur)
 
+        d = {}
         for c in db.content_fossology_license_get_from_temp():
             license = dict(zip(db.content_fossology_license_cols, c))
-            yield converters.db_to_fossology_license(license)
+
+            id_ = license['id']
+            if  id_ not in d:
+                d[id_] = []
+            d[id_].append(converters.db_to_fossology_license(license))
+
+        for id_, facts in d.items():
+            yield { 'id': id_, 'facts': facts}
+
 
     @db_transaction
     def content_fossology_license_add(self, licenses,
diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.py
index db7a295..3cf5da1 100644
--- a/swh/indexer/storage/converters.py
+++ b/swh/indexer/storage/converters.py
@@ -129,7 +129,6 @@ def db_to_metadata(metadata):
 
 def db_to_fossology_license(license):
     return {
-        'id': license['id'],
         'licenses': license['licenses'],
         'tool': {
             'id': license['tool_id'],

If the changes look good, I'll update the failing tests and submit a patch for code review.

! In #782 (closed), @ardumont wrote: Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;) For what is worth, that was my initial response $236.

Thanks for writing the quick howto! I'll check it out.

@ardumont, Thanks for the data dumps (F3040109 and F3040114).

Sure

I'm able to test the /api/1/content/sha1:HASH/license/ endpoint.

Awesome. That way of demoing is quite cool, loving it, thanks.

Setup

If the changes look good, I'll update the failing tests and submit a patch for code review.

I'd be more than ok if you could open a diff now with arcanist? (Even if tests are not completely ok yet ;)

Once the setup is done, it's only a arc diff origin/master command away. More detail can be found in our wiki documentation.

That would help in reviewing and possibly suggesting code adaptation in context (as explained below). And as you are quite advanced in the contributions now, that would make sense ;)

Thanks in advance.

Changes

I currently see some possible changes that your real cool demo illustrates.

The value returned should be directly the d dictionary's content

So either yield from d should be ok (possibly, not checked)
or your return must change to plain d (you would possibly need to change the decorator from @db_transaction_generator to @db_transaction.

Either way, another change is needed in the actual buggy webapp's current endpoint service. That should be adapted to whatever you choose.

Note: It's buggy because in the current state prior to your adaptation, we could already have multiple results (1 per tool) even if we have only one sha1 in the endpoint's input (#782 (closed)). It's not that much of a deal now because currently we only have one tool (which is currently running). But long term, it's not correct. Thus, that issue in the first place btw ;)

Implementation wise, you could use from collections import defaultdict and initializes the d as a defaultdict(list), that way you can remove the if id not in d test statement and directly use d[id_].append You can check some use case samples in the python3 documentation.

Cheers,

Setup

If the changes look good, I'll update the failing tests and submit a patch for code review.

I'd be more than ok if you could open a diff now with arcanist? (Even if tests are not completely ok yet ;)

Ok! I've opened swh/devel/swh-indexer!414 (closed) for review.

In the meanwhile, I will update the tests.

Ok! I've opened swh/devel/swh-indexer!414 (closed) for review.

Awesome! Thanks. Heading there.

Web API: make endpoints that expose extracted metadata return lists of factual information

Child items ...

Activity

Setup

Changes

Setup

Web API: make endpoints that expose extracted metadata return *lists* of factual information

Activity

Setup

Changes

Setup

Web API: make endpoints that expose extracted metadata return lists of factual information