Bad links to email@domain.com cause duplicate
opened this issue · comments
Deleted user commented
Do you want to request a feature or report a bug?
Bug
What is the current behavior?
When indexing a website that contains a link to an email address of the same domain, the site crawls as though its a new page. eg. indexing google.com where the following HTML appears:
<a href="contact@google.com">link</a>
.
The site will then index pages at:
- https://contact@google.com/page1
- https://contact@google.com/page2
- https://contact@google.com/page3
- https://contact@google.com/page4
This is a valid url, but should be discounted as a duplicate.
If the current behavior is a bug, please provide the steps to reproduce.
As above
What is the expected behavior?
The section of the URL prior to the @ symbol should be discounted.
Deleted user commented
Pull request created here:
#58
Lars Graubner commented
Merged and released.