some URL fragments are being retained & encoded
MothOnMars opened this issue · comments
Martha Thompson commented
Medua's regex to identify anchor fragments is fairly strict: /#[a-zA-Z0-9_-]*$/
As a result, fragments with non-alphanumeric characters are being retained and encoded. Example:
Medusa.crawl('https://www.usbr.gov/library/glossary/', depth_limit: 0, discard_page_bodies: true) do |medusa|
medusa.on_every_page do |page|
puts page.links.map(&:to_s).select{ |link| /%23/ === link }
end
end
Result:
https://www.usbr.gov/library/glossary/%23crest%20elevation
https://www.usbr.gov/library/glossary/%23prestressed%20dam
https://www.usbr.gov/library/glossary/%23modifiedhomogeneousearthfilldam%3Emodified%20homogeneous%0D%0Aearthfill%20dam%3C/a%3E,%20%3Ca%20href=
https://www.usbr.gov/library/glossary/%23o&m
PR to come.