nelsonic / github-scraper

I have found that, for popular repositories with thousands of stars and watchers, the scraper does not provide the correct output, probably as it disregards the 'k' appended after the number.

Take, for instance, the Atom repository:

This is the output provided by my simple script, that just outputs the data received from the scraper:

@gianlucascoccia thanks for opening this issue to inform us that the k values are no longer working.
As you can see from the screenshot you have kindly shared, the commits value is NaN too ... 😕

GitHub have very recently updated their UI and changed a bunch of classes
so our scraper/parser is no longer getting the correct data. #113

We have a RegEx that parses the 52.3k to 52300:

github-scraper/lib/utils.js

Lines 2 to 16 in 47d0a46

    
            * `parse_int` parses a String e.g: 1.2k and returns an Int 1200 
        
            *  @param {String} str - the string to be parsed. e.g: "14.7k" 
        
            *  @return {Number} int - the integer representation of the String. 
        
            */ 
        
           function parse_int (str) { 
        
             return parseInt( 
        
               str 
        
               .trim() 
        
               .replace(/\.(\d)k$/, "$100") // $1 match the digit \d 
        
               .replace(/k$/, "000") 
        
               .replace(/\.(\d)m$/, "$100000") // $1 match the digit \d 
        
               .replace(/m$/, "000000") 
        
               .replace(/[^0-9]/g, '') 
        
             , 10) 
        
           }

and it has tests:

github-scraper/test/utils.test.js

Lines 7 to 14 in 47d0a46

    
           t.equal(parse_int("300"), 300, '"300" => 300') 
        
           t.equal(parse_int("1k"), 1000, '"1k" => 1000') 
        
           t.equal(parse_int("4.3k"), 4300, '"4.3k" => 4300') 
        
           t.equal(parse_int("89.6k"), 89600, '"89.6k" => 89600') 
        
           t.equal(parse_int("146k"), 146000, '"146k" => 146000') 
        
           t.equal(parse_int("310k"), 310000, '"310k" => 310000') 
        
           t.equal(parse_int("1m"), 1000000, '"1m" => 1000000') 
        
           t.equal(parse_int("1.1m"), 1100000, '"1.1m" => 1100000')

But as I say, GitHub have changed their UI/classes so they have "broken" our scraper. 🤦
If you want to help fix this by updating the classes in the repo file:

github-scraper/lib/repo.js

Lines 26 to 35 in 47d0a46

    
           data.tags = $('.list-topics-container').text().trim() 
        
                       .replace(/\n /g, '').replace(/ +/g,', '); 
        
           data.usedby = parse_int($('.social-count').text()); 
        
           data.watchers = parse_int(badges['0'].children[0].data); 
        
           data.stars    = parse_int(badges['1'].children[0].data); 
        
           data.forks    = parse_int(badges['2'].children[0].data); 
        
           data.commits  = parse_int($('.commits .num').text()); 
        
           data.branches = parse_int($('.octicon-git-branch').next().text()); 
        
           data.releases = parse_int($('.octicon-tag').next().text()); 
        
           data.langs = []; // languages used in the repo:

A pull request is very much welcome.
Thanks. ☀️

fixed. see: https://github.com/nelsonic/github-scraper/actions/runs/7549448498/job/20553449066#step:5:655

	* `parse_int` parses a String e.g: 1.2k and returns an Int 1200
	* @param {String} str - the string to be parsed. e.g: "14.7k"
	* @return {Number} int - the integer representation of the String.
	*/
	function parse_int (str) {
	return parseInt(
	str
	.trim()
	.replace(/\.(\d)k$/, "$100") // $1 match the digit \d
	.replace(/k$/, "000")
	.replace(/\.(\d)m$/, "$100000") // $1 match the digit \d
	.replace(/m$/, "000000")
	.replace(/[^0-9]/g, '')
	, 10)
	}

	t.equal(parse_int("300"), 300, '"300" => 300')
	t.equal(parse_int("1k"), 1000, '"1k" => 1000')
	t.equal(parse_int("4.3k"), 4300, '"4.3k" => 4300')
	t.equal(parse_int("89.6k"), 89600, '"89.6k" => 89600')
	t.equal(parse_int("146k"), 146000, '"146k" => 146000')
	t.equal(parse_int("310k"), 310000, '"310k" => 310000')
	t.equal(parse_int("1m"), 1000000, '"1m" => 1000000')
	t.equal(parse_int("1.1m"), 1100000, '"1.1m" => 1100000')

	data.tags = $('.list-topics-container').text().trim()
	.replace(/\n /g, '').replace(/ +/g,', ');
	data.usedby = parse_int($('.social-count').text());
	data.watchers = parse_int(badges['0'].children[0].data);
	data.stars = parse_int(badges['1'].children[0].data);
	data.forks = parse_int(badges['2'].children[0].data);
	data.commits = parse_int($('.commits .num').text());
	data.branches = parse_int($('.octicon-git-branch').next().text());
	data.releases = parse_int($('.octicon-tag').next().text());
	data.langs = []; // languages used in the repo:

Scraper does not account for thousands in Star/Watcher/Fork count