Improve UTF8 UnicodeDecodeError handling in JSON conversion layer and update API documentation
When converting swh object raw bytes data to a JSON serializable representation, swh-web
catches UnicodeDecodeError
exception when trying to decode some UTF-8 encoded strings:
-
revision authors and committers: when a person name/fullname can not be decoded, a new key named
decoding_failures
is added to the person dictionary indicating which fields could not be decoded and the non-utf8 string are then decoded with backslash escape mode, see example -
revision messages: when a revision message could not be decoded, a new key named
message_decoding_failed
is added to the revision dictionary and the message is set to None, see example
That UTF-8 decoding error handling is not really consistent and calls for improvements to have something more generic. Using the error handler implemented for revision authors globally seems the right way to do it.
Once it is done, a new section should be added in the Web API top level documentation to inform about the fields related to UTF-8 decoding errors that might be found in JSON responses.
Migrated from T2617 (view on Phabricator)