nelsonic / github-scraper

🕷 🕸 crawl GitHub web pages for insights we can't GET from the API ... 💡

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scraper does not account for thousands in Star/Watcher/Fork count

gianlucascoccia opened this issue · comments

I have found that, for popular repositories with thousands of stars and watchers, the scraper does not provide the correct output, probably as it disregards the 'k' appended after the number.

Take, for instance, the Atom repository:
Schermata 2020-06-25 alle 08 39 26

This is the output provided by my simple script, that just outputs the data received from the scraper:
Schermata 2020-06-25 alle 08 41 24

@gianlucascoccia thanks for opening this issue to inform us that the k values are no longer working.
As you can see from the screenshot you have kindly shared, the commits value is NaN too ... 😕

GitHub have very recently updated their UI and changed a bunch of classes
so our scraper/parser is no longer getting the correct data. #113

We have a RegEx that parses the 52.3k to 52300:

* `parse_int` parses a String e.g: 1.2k and returns an Int 1200
* @param {String} str - the string to be parsed. e.g: "14.7k"
* @return {Number} int - the integer representation of the String.
*/
function parse_int (str) {
return parseInt(
str
.trim()
.replace(/\.(\d)k$/, "$100") // $1 match the digit \d
.replace(/k$/, "000")
.replace(/\.(\d)m$/, "$100000") // $1 match the digit \d
.replace(/m$/, "000000")
.replace(/[^0-9]/g, '')
, 10)
}

and it has tests:

t.equal(parse_int("300"), 300, '"300" => 300')
t.equal(parse_int("1k"), 1000, '"1k" => 1000')
t.equal(parse_int("4.3k"), 4300, '"4.3k" => 4300')
t.equal(parse_int("89.6k"), 89600, '"89.6k" => 89600')
t.equal(parse_int("146k"), 146000, '"146k" => 146000')
t.equal(parse_int("310k"), 310000, '"310k" => 310000')
t.equal(parse_int("1m"), 1000000, '"1m" => 1000000')
t.equal(parse_int("1.1m"), 1100000, '"1.1m" => 1100000')

But as I say, GitHub have changed their UI/classes so they have "broken" our scraper. 🤦
If you want to help fix this by updating the classes in the repo file:

data.tags = $('.list-topics-container').text().trim()
.replace(/\n /g, '').replace(/ +/g,', ');
data.usedby = parse_int($('.social-count').text());
data.watchers = parse_int(badges['0'].children[0].data);
data.stars = parse_int(badges['1'].children[0].data);
data.forks = parse_int(badges['2'].children[0].data);
data.commits = parse_int($('.commits .num').text());
data.branches = parse_int($('.octicon-git-branch').next().text());
data.releases = parse_int($('.octicon-tag').next().text());
data.langs = []; // languages used in the repo:

A pull request is very much welcome.
Thanks. ☀️