ThatXliner / unmarkd

An extremely configurable markdown reverser for Python3.

Home Page:https://pypi.org/project/unmarkd/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: Parser for H-tags also includes "h"eader tag when trying to parse

healplease opened this issue · comments

commented

Prerequisites

  • I have searched the issues and believe that it has not already been reported
  • I have made sure this bug reproduces on the latest version
  • I agree to follow the Code of Conduct

Bug description

Exception ValueError: invalid literal for int() with base 10: 'eader' happens when trying to convert soup element to markdown with unmarkd.unmark(soup: BeautifulSoup).

Stacktrace:

File "C:\Users\healplease\Downloads\test\service.py", line 180, in get_text
    text = unmarkd.unmark(soup)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\__init__.py", line 12, in unmark
    return unmarkers.BasicUnmarker().unmark(html)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 287, in unmark
    return self.__parse(html).strip().replace("\u0000", "\uFFFD")
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 179, in __parse
    output += self.resolve_handler_func(name)(child)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 194, in tag_div
    return self.__parse(child)
  File "C:\Users\healplease\Downloads\test\venv\lib\site-packages\unmarkd\unmarkers.py", line 182, in __parse
    output += "#" * int(name[1:]) + " " + self.__parse(child) + "\n"
ValueError: invalid literal for int() with base 10: 'eader'

Reproduction steps

  1. Import unmarkd and bs4.
  2. Get HTML with header in it and parse it with BS4, parser html.parser(example HTML: https://www.octoparse.com/blog/9-best-free-web-crawlers-for-beginners)
  3. Use unmarkd.unmark on soup object.

Expected result: only h1-h6 tags should be included to convert to # headings in markdown.
Observed result: header tag also included when converting, resulting in exception.

Other information

OS - Windows
Python - 3.11.2

Reproduces how often

I can reproduce this bug 100% of the time