[Bug]: Parser for H-tags also includes "h"eader tag when trying to parse

Question

[Bug]: Parser for H-tags also includes "h"eader tag when trying to parse

healplease opened this issue 5 months ago · comments

Prerequisites

I have searched the issues and believe that it has not already been reported
I have made sure this bug reproduces on the latest version
I agree to follow the Code of Conduct

Bug description

Exception ValueError: invalid literal for int() with base 10: 'eader' happens when trying to convert soup element to markdown with unmarkd.unmark(soup: BeautifulSoup).

Stacktrace:

File "C:\Users\healplease\Downloads\test\service.py", line 180, in get_text
    text = unmarkd.unmark(soup)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\__init__.py", line 12, in unmark
    return unmarkers.BasicUnmarker().unmark(html)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 287, in unmark
    return self.__parse(html).strip().replace("\u0000", "\uFFFD")
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 179, in __parse
    output += self.resolve_handler_func(name)(child)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 194, in tag_div
    return self.__parse(child)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 182, in __parse
    output += "#" * int(name[1:]) + " " + self.__parse(child) + "\n"
ValueError: invalid literal for int() with base 10: 'eader'

Reproduction steps

Import unmarkd and bs4.
Get HTML with header in it and parse it with BS4, parser html.parser(example HTML: https://www.octoparse.com/blog/9-best-free-web-crawlers-for-beginners)
Use unmarkd.unmark on soup object.

Expected result: only h1-h6 tags should be included to convert to # headings in markdown.
Observed result: header tag also included when converting, resulting in exception.

Other information

OS - Windows
Python - 3.11.2

Reproduces how often

I can reproduce this bug 100% of the time