tomas / needle

Nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support.

Home Page:https://www.npmjs.com/package/needle

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nasdaq.com article url returns nothing

colorways opened this issue · comments

Issue

When trying to access the link below, Needle doesn't do anything - no response, no error. It's like the request goes into a black hole. I don't get back the done! or any other indication of success or failure.

Many other sites work just fine with the below code.

I've tried it with callbacks, pipes, and promises.

I've also tried it on different networks.

The below code was run unsuccessfully on Needle 2.5.0, 2.5.2, and 2.6.0.

It was also run unsuccessfully on Node 14.14.0, 14.15.4, and 15.7.0

needle.get("https://www.nasdaq.com/articles/wall-streets-retail-frenzy-deepens-as-gamestop-etsy-soar-2021-01-26", function (error, response) {
    console.log("done!");
    if (!error && response.statusCode == 200)
        console.log(response.body);
});

I'm either going to be rated 11 out of 10 as a dummy or something weird is happening. Does anyone have an idea what the problem could be?

The server is stalling requests, as far as I can tell.

$ curl -v https://www.nasdaq.com/articles/wall-streets-retail-frenzy-deepens-as-gamestop-etsy-soar-2021-01-26
*   Trying 23.56.218.34:443...
* TCP_NODELAY set
* Connected to www.nasdaq.com (23.56.218.34) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=Connecticut; L=Shelton; O=Nasdaq, Inc.; CN=www.nasdaq.com
*  start date: Oct  7 00:00:00 2020 GMT
*  expire date: Nov  8 00:00:00 2021 GMT
*  subjectAltName: host "www.nasdaq.com" matched cert's "www.nasdaq.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
> GET /articles/wall-streets-retail-frenzy-deepens-as-gamestop-etsy-soar-2021-01-26 HTTP/1.1
> Host: www.nasdaq.com
> User-Agent: curl/7.65.3
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing

It just freezes there and does nothing. The connection doesn't close but it doesn't send any data.

@tomas Tomás, first... I cannot thank you enough for your work on needle. The kindness and effort you've put into this library is incredible and I'm so grateful to you for how much needle has helped my programs.

As for this server stalling... Interesting. I took your idea and curled the same url but this time using all the headers from Firefox (where the page loads without any problem). With all of the headers in the curl, it pulls the page down successfully and immediately:

curl -v -A "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0" -H "host: www.nasdaq.com" -H "Upgrade-Insecure-Requests: 1" -H "Sec-GPC: 1" -H "Connection: keep-alive" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.5" https://www.nasdaq.com/articles/wall-streets-retail-frenzy-deepens-as-gamestop-etsy-soar-2021-01-26

This gave me some hope. But when I tried setting these same headers in needle, the stalling appears to still happen. Am I doing this right or is there anything you can see that's wrong here?

var options = {
    headers: {
        'Host': 'www.nasdaq.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-GPC': '1'
    },
    user_agent: 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0'
}
needle.get('https://www.nasdaq.com/articles/wall-streets-retail-frenzy-deepens-as-gamestop-etsy-soar-2021-01-26', options, function (error, response) {
    console.log("done!");
    if (!error && response.statusCode == 200)
        console.log(response.body);
});

Thank you for taking some time and helping to troubleshoot this, Tomás. Greatly appreciated!

Apparently it's the combination of the User-Agent + the Accept-Encoding header.

Try running with DEBUG=needle and make sure the headers are being sent correctly.

And thanks!

It only took me a year to come back and close this issue. I finally found the problem (and a partial explanation).

As Needle uses HTTP/1.1, I noticed that both Firefox and curl were using HTTP/2 for requests. I forced both of them to use HTTP/1.1 (curl by adding --http1.1, and Firefox by going to about:config and setting network.http.spdy.enabled.http2: false). In both cases, the page still returned but, under certain header conditions, I could get curl to hang using HTTP/1.1 like needle does.

In the end, I found that using older, mobile-based user-agents along with connection: keep-alive returns the page with Needle where nothing else would. I found the connection: keep-alive was added by both Firefox and curl automatically where Needle didn't have it. I'm a Junior Varsity programmer so I can't say why regular, new Firefox/Chrome user-agents work in curl but not in Needle. The older, mobile-based ones do though. I don't know. For nasdaq.com, these minimum headers were strictly required or the request hung (meaning no error, no success):

headers: {
    'user-agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36',
    'accept-language': 'en-US,en;q=0.5',
    'accept-encoding': 'gzip, deflate, br',
    'connection': 'keep-alive'
}

I'm going to close the issue because this seems a pretty esoteric edge case and may not be worth investigating. Tomas, thank you again for all the time you spent writing Needle. It's such a great package and helps so many of us!

Future Troubleshooting Tips

For those who may stumble on this looking to troubleshoot your own hanging problem, here are a few suggestions:

  1. Leverage the debug package Tomas has used in Needle by adding DEBUG=needle prior to the npm start in your package.json (so something like "start": DEBUG=needle npm start). This will output the exact headers being sent by Needle and some additional, possibly helpful info.

  2. For more investigating of your specific url, use curl at the command line with the -v switch for Verbose output. Set headers using -H. Study the console output closely and compare it to Firefox/Chrome's Network tab in Developer Tools. If headers exist in curl or the browser, try them out in your Needle request. If you want to try to make the browser/curl fail, remove headers there (this can be done in Firefox with a right-click on the GET request in the Network tab > Edit and Resend and remove headers there). I used the following for curl:

    curl -v -H 'accept-encoding: gzip, deflate, br' \
            -H 'accept-language: en-US,en;q=0.5' \
            -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36' \
            https://www.nasdaq.com/articles/security-concerns-sink-solana-investors-on-watch \
            --output nasdaq.html
  3. Try including/excluding headers in both the curl and Needle requests. See what's absolutely required and what isn't - it'll minimize the surface area of the problem.

  4. Try a wide range of user-agents. I probably tried 10 before I found some older, mobile ones that worked. This one worked for me: Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36