nielstron / quantulum3

Library for unit extraction - fork of quantulum for python3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

use quantulum to remove quantities from string

AxelStbl opened this issue · comments

commented

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I would like to use Quantulum to remove content not only extract it.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Basically I want to get the list of surfaces found so I can remove them from the text like I in the modification I did here.

###############################################################################
def parse(text, lang="en_US", verbose=False):
   """
   Extract all quantities from unstructured text.
   """

   log_format = "%(asctime)s --- %(message)s"
   logging.basicConfig(format=log_format)

   if verbose:  # pragma: no cover
       prev_level = logging.root.getEffectiveLevel()
       logging.root.setLevel(logging.DEBUG)
       #_LOGGER.debug("Verbose mode")

   orig_text = text
   #_LOGGER.debug('Original text: "%s"', orig_text)

   text = clean_text(text, lang)
   values = extract_spellout_values(text, lang)
   text, shifts = substitute_values(text, values)

   quantities = []
   surfaces = []
   for item in reg.units_regex(lang).finditer(text):

       groups = dict([i for i in item.groupdict().items() if i[1] and i[1].strip()])
       #_LOGGER.debug(u"Quantity found: %s", groups)

       try:
           uncert, values = get_values(item, lang)

           unit, unit_shortening = get_unit(item, text)
           surface, span = get_surface(shifts, orig_text, item, text, unit_shortening)
           surfaces.append(surface)
           objs = build_quantity(
               orig_text, text, item, values, unit, surface, span, uncert, lang
           )
           if objs is not None:
               quantities += objs
       except ValueError as err:
           _LOGGER.debug("Could not parse quantity: %s", err)

   if verbose:  # pragma: no cover
       logging.root.setLevel(prev_level)

   return quantities, surfaces

AFAIK each returned quantity has an attribute "surface" that contains its surface in the passed string. Maybe you can use that instead? Maybe even the exact start and end of the surface (two indices)

commented

Thanks indeed it works like this!