gitorious import: UnicodeDecodeError when reading references
When dealing with refs, dulwich expects utf-8 and it's visibly not always the case. This fails ungracefully.
Steps to reproduce with latest swh-loader-git:
repo = 'test-project2009.git'
origin_url = 'http://foo/bar/git/%s' % repo
import logging
logging.basicConfig(level=logging.DEBUG)
from swh.loader.git.tasks import LoadDiskGitRepository
t = LoadDiskGitRepository()
t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
source: uffizi:/srv/storage/space/mirrors/gitorious.org/mnt/repositories/test-project2009/test-project2009.git
Full stack trace:
python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'test-project2009.git'
repo, date='2016-05-03T15:16:32+00:00')
>>> origin_url = 'http://foo/bar/git/%s' % repo
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 revisions
ERROR:swh.scheduler.task.LoadDiskGitRepository:Loading failure, updating to `partial` status
Traceback (most recent call last):
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 896, in load
self.store_data()
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1005, in store_data
self.send_batch_occurrences(self.get_occurrences())
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 693, in send_batch_occurrences
send_in_packets(occurrences, self.send_occurrences, packet_size)
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 35, in send_in_packets
for obj in objects:
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 218, in get_occurrences
for refs, target in self.repo.refs.as_dict().items()
File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 164, in as_dict
keys = self.keys(base)
File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 143, in keys
return self.allkeys()
File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 470, in allkeys
sys.getfilesystemencoding())
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccd' in position 11: surrogates not allowed
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 2 with status partial
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 2 with status partial
DEBUG:amqp:Start from server, version: 0.9, properties: {'platform': 'Erlang/OTP', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'product': 'RabbitMQ', 'cluster_name': 'rabbit@corellia.lan', 'capabilities': {'connection.blocked': True, 'per_consumer_qos': True, 'direct_reply_to': True, 'exchange_exchange_bindings': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'consumer_priorities': True, 'authentication_failure_close': True}, 'information': 'Licensed under the MPL. See http://www.rabbitmq.com/'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: ['en_US']
DEBUG:amqp:Open OK!
DEBUG:amqp:using channel_id: 1
DEBUG:amqp:Channel open
{'status': 'failed'}
Migrated from T911 (view on Phabricator)