[Bug]: Parser for H-tags also includes "h"eader tag when trying to parse
healplease opened this issue · comments
Alex commented
Prerequisites
- I have searched the issues and believe that it has not already been reported
- I have made sure this bug reproduces on the latest version
- I agree to follow the Code of Conduct
Bug description
Exception ValueError: invalid literal for int() with base 10: 'eader'
happens when trying to convert soup element to markdown with unmarkd.unmark(soup: BeautifulSoup)
.
Stacktrace:
File "C:\Users\healplease\Downloads\test\service.py", line 180, in get_text
text = unmarkd.unmark(soup)
File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\__init__.py", line 12, in unmark
return unmarkers.BasicUnmarker().unmark(html)
File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 287, in unmark
return self.__parse(html).strip().replace("\u0000", "\uFFFD")
File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 179, in __parse
output += self.resolve_handler_func(name)(child)
File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 194, in tag_div
return self.__parse(child)
File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 182, in __parse
output += "#" * int(name[1:]) + " " + self.__parse(child) + "\n"
ValueError: invalid literal for int() with base 10: 'eader'
Reproduction steps
- Import
unmarkd
andbs4
. - Get HTML with header in it and parse it with BS4, parser html.parser(example HTML: https://www.octoparse.com/blog/9-best-free-web-crawlers-for-beginners)
- Use
unmarkd.unmark
on soup object.
Expected result: only h1-h6 tags should be included to convert to # headings in markdown.
Observed result: header tag also included when converting, resulting in exception.
Other information
OS - Windows
Python - 3.11.2
Reproduces how often
I can reproduce this bug 100% of the time