brutuscat / medusa

- THIS IS AN OLD FORK - Checkout Medusa Crawler gem instead "medusa-crawler"

Home Page:https://github.com/brutuscat/medusa-crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

some URL fragments are being retained & encoded

MothOnMars opened this issue · comments

Medua's regex to identify anchor fragments is fairly strict: /#[a-zA-Z0-9_-]*$/

As a result, fragments with non-alphanumeric characters are being retained and encoded. Example:

Medusa.crawl('https://www.usbr.gov/library/glossary/', depth_limit: 0, discard_page_bodies: true) do |medusa|
  medusa.on_every_page do |page|
    puts page.links.map(&:to_s).select{ |link| /%23/ === link }
  end
end

Result:
https://www.usbr.gov/library/glossary/%23crest%20elevation
https://www.usbr.gov/library/glossary/%23prestressed%20dam
https://www.usbr.gov/library/glossary/%23modifiedhomogeneousearthfilldam%3Emodified%20homogeneous%0D%0Aearthfill%20dam%3C/a%3E,%20%3Ca%20href=
https://www.usbr.gov/library/glossary/%23o&m

PR to come.