xitanggg / open-resume

OpenResume is a powerful open-source resume builder and resume parser. https://open-resume.com/

Home Page:https://open-resume.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parser parsed about 1/2 right from a Linkedin generated resume

xitanggg opened this issue · comments

Documenting a feedback issue

Interesting. I put my LinkedIn generated resume into the parser, and it only parsed it about 1/2 right.
Which makes me wonder, who is doing the wrong thing here? Is LinkedIn generating a bad resume, or is the ATS parser not able to parse what I assume should be pretty standard, as I imagine a lot of people use that feature of LinkedIn.

This was my feedback (thanks for making it a ticket). I'm attaching the PDF I used for reference.

Resume-Jeremy-Edberg.pdf

Thanks again @jedberg for reporting this issue. I wasn't aware that LinkedIn has a resume builder and missed testing that against the parser.

I tested my LinkedIn resume on greenhouse ATS and it was able to parse my profile (name, email), work experiences (company, title, date) correctly, so the LinkedIn resume works fine there and it was something with the OpenResume parser algorithm.

I took a look at the parser algorithm and just pushed 2 small fixes to the 2 main issues that occur when parsing a LinkedIn resume.

Issue 1. The summary section isn't identified correctly, causing the name in the profile section to be extracted incorrectly.
This is a small fix by adding summary as a section header keyword so it can identify the summary section correctly. Once summary section is separated from the profile section, name can be identified correctly.

Issue 2. The date in the work experience section isn't extracted correctly.
For each work experience, the parser algorithm attempts to find the non-descriptions part and the descriptions part by checking if there is any bullet point and use it as a divider for the descriptions part. If there isn't any bullet point found, it fallbacks to using the first 2 lines as the non-descriptions part, as this is the common case in a typical resume where the first line is company followed by the second line being title and date. On LinkedIn, people don't commonly use bullet points, so it triggers the fallback, but LinkedIn has 3 lines (one for each company name, title, date) instead of the typical 2 lines, so date being last on the list falls into the descriptions part and isn't extracted correctly.

This is also a small fix by adding a better fallback heuristic to identify a description part by checking if the line has at least 8 words and it seems to work well.

With these 2 fixes, the parser algorithm should be able to parse at least 80%+ correctly on a LinkedIn resume. I do notice another issue where it incorrectly subdivide a work experience since it detects a new line in a description. This can be a small fix, but I will leave it for now since I intended the parser algorithm to mainly parse a standard resume pdf, where people are more likely to use bullet points instead of writing paragraphs in descriptions.

As a side note, I also expect the parser algorithm to work well but not as well (or not even close) as an industry ATS, since these companies have way more data and resources to enhance their algorithms, e.g. they probably index keywords of all universities, companies, top positions, top last names, etc. So I would say if this parser algorithm can parse a resume 80%+ correctly, it should give great confidence that the resume is ATS friendly to an industry ATS.

Closing this issue for now. Very impressive resume btw