Skip to content

pypi: Use BeautifulSoup for parsing HTML instead of xmltodict

Another issue found while retesting the listers locally.

xmltodict now raises an error while trying to parse the HTML content of https://pypi.org/simple/ page., see below:

Traceback (most recent call last):
  File "/home/anlambert/.virtualenvs/swh/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 135, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/cli.py", line 65, in run
    get_lister(lister, **config).run()
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pattern.py", line 121, in run
    for page in self.get_pages():
  File "/home/anlambert/swh/swh-environment/swh-lister/swh/lister/pypi/lister.py", line 57, in get_pages
    page_xmldict = xmltodict.parse(response.text)
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: mismatched tag: line 6, column 4

So use BeautifulSoup HTML parser instead as it is aleady a requirement of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.


Migrated from D5027 (view on Phabricator)

Merge request reports