alexferrari88 / sbstck-dl

CLI tool for downloading Substack newsletters for archival purposes, offline reading, or data analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to use --sid option

lenzj opened this issue · comments

Thank you for creating sbstck-dl. I'm trying to download a private post that I've subscribed to so I can read it on my ereader. Below is the general command that I've used.

$ sbstck-dl download --sid xxxxxx....xxxx --url https://my-substack-url

The command completes without error, however it only downloads the abbreviated public version. I suspect that I'm using an incorrect cookie string from my browser session. Could you provide some pointers or guidance on how to locate the correct cookie? Also, is it possible to add an error message to the sbstck-dl utility to notify the user if the cookie was unable to be used successfully?

Thanks!

thank you for reporting the issue. Can you please post the output of this command (please obfuscate sensitive data before doing so, including the full text of the post, should this appear in the output):

curl -X GET 'URL_OF_NEWSLETTER' -H 'Cookie: substack.sid=YOUR_VALUE'

if not working, try this:

curl -v --cookie "substack.sid=YOUR_VALUE" URL_OF_NEWSLETTER

Your suggestion was helpful. After playing around with curl I think I figured out the issue. In my browser session I don't have a cookie called "substack.sid", but rather mine is named "connect.sid"

When I use the first curl command with substack.sid=xxxxxxxxxx, then I get the abbreviated public version.

When I do the same command but use connect.sid=xxxxxxxxxx, then the download includes the full text.

It looks like sbstck-dl has the cookie name hard coded as "substack.sid" in the location below.

Name: "substack.sid",

If the cli option(s) could define the cookie name and the cookie value then I think it would work. I'll see if I can submit a patch to add the option.

Thanks again!

I tried the following patch and command, but it still wasn't working for some reason. Still only downloading the abbreviated public version. The curl command definitely downloads the full page though. I'm probably missing something obvious. Have to investigate more tomorrow.

$ sbstck-dl download --sidname connect.sid --sidval xxxx....xxxxx --url https://my-substack-url
diff --git a/cmd/root.go b/cmd/root.go
index 29867b6..6fa21e1 100644
--- a/cmd/root.go
+++ b/cmd/root.go
@@ -19,7 +19,8 @@ var (
        ratePerSecond  int
        beforeDate     string
        afterDate      string
-       substackID     string
+       cookieName     string
+       cookieValue    string
        ctx            = context.Background()
        parsedProxyURL *url.URL
        fetcher        *lib.Fetcher
@@ -46,10 +47,13 @@ func Execute() {
        if ratePerSecond == 0 {
                log.Fatal("rate must be greater than 0")
        }
-       if substackID != "" {
+       if cookieValue != "" {
+               if cookieName == "" {
+                       cookieName = "substack.sid"
+               }
                cookie = &http.Cookie{
-                       Name:  "substack.sid",
-                       Value: substackID,
+                       Name:  cookieName,
+                       Value: cookieValue,
                }
        }
        fetcher = lib.NewFetcher(lib.WithRatePerSecond(ratePerSecond), lib.WithProxyURL(parsedProxyURL), lib.WithCookie(cookie))
@@ -62,7 +66,8 @@ func Execute() {
 
 func init() {
        rootCmd.PersistentFlags().StringVarP(&proxyURL, "proxy", "x", "", "Specify the proxy url")
-       rootCmd.PersistentFlags().StringVarP(&substackID, "sid", "i", "", "The substack.sid cookie value (required for private newsletters)")
+       rootCmd.PersistentFlags().StringVarP(&cookieValue, "sidval", "i", "", "The sid cookie value (required for private newsletters)")
+       rootCmd.PersistentFlags().StringVarP(&cookieName, "sidname", "n", "", "The sid cookie name (default is substack.sid)")
        rootCmd.PersistentFlags().BoolVarP(&verbose, "verbose", "v", false, "Enable verbose output")
        rootCmd.PersistentFlags().IntVarP(&ratePerSecond, "rate", "r", lib.DefaultRatePerSecond, "Specify the rate of requests per second")
        rootCmd.PersistentFlags().StringVar(&beforeDate, "before", "", "Download posts published before this date (format: YYYY-MM-DD)")

According to Substack's Privacy Policy:

Cookie Name: substack.sid / connect.sid
Cookie Type: Persistent
Cookie Purpose: Session identifier (login, etc)
Cookie Lifetime: 90 days max
Cookie Domain: .substack.com

This means that the cookie name can be either substack.sid or connect.sid (as you reported as well).

I'm pushing a new version soon, trying to fix the issue. I'll notify you here when is ready for you to try.

@lenzj can you please pull the latest commits and check if this is fixed for you?

Thank you for making those updates. I pulled the latest commits and ran the command below, but unfortunately it still downloaded the abbreviated public version. I noticed the cookie I extracted from firefox was URL encoded (various %NN values included), so I decoded it and tried that as well and still no luck. My guess is that either curl is interpreting the cookie differently than sbstck-dl, or it is submitting the cookie differently and that is why it works with curl? Let me know if there's anything else I should try.

$ sbstck-dl download --cookie_name connect.sid --cookie_val xxxxxx.....xxxxxx --url https://my-substack-url

I had a random thought that maybe sbstck-dl is fetching the private html properly, but it is missing content during the extraction step. So I made the crude hack below to feed the raw html file from curl into sbstck-dl to see if it is able to extract the content properly. It worked perfectly processing the raw download of the private post. So the extraction code is working fine, it's something with the fetching step.

diff --git a/lib/fetcher.go b/lib/fetcher.go
index 659baf1..6fe471e 100644
--- a/lib/fetcher.go
+++ b/lib/fetcher.go
@@ -6,7 +6,8 @@ import (
 	"io"
 	"net/http"
 	"net/url"
-	"strconv"
+	"os"
+	"log"
 	"time"
 
 	"github.com/cenkalti/backoff/v4"
@@ -206,6 +207,7 @@ func (f *Fetcher) FetchURL(ctx context.Context, url string) (io.ReadCloser, erro
 // fetch performs the actual HTTP GET request to the specified URL and returns the response body and any encountered error.
 // It checks for too many requests (status code 429) and handles it by returning a FetchError.
 func (f *Fetcher) fetch(ctx context.Context, url string) (io.ReadCloser, error) {
+/*
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
 	if err != nil {
 		return nil, err
@@ -238,6 +240,12 @@ func (f *Fetcher) fetch(ctx context.Context, url string) (io.ReadCloser, error)
 	}
 
 	return res.Body, nil
+*/
+	rawhtml, err := os.Open("./raw.html")
+	if err != nil {
+		log.Fatal(err)
+	}
+	return rawhtml, nil
 }
 
 // makeDefaultBackoff creates and returns the default exponential backoff configuration.

too bad it still doesn't work but very good guesses on your part!

You don't have to say yes and if you don't no hard feelings, but would it be possible for you to share with me your connect.sid so that I can have better insights? As of right now, I'm just shooting in the dark based on 3rd party's directions :)

If you can, my email is the first letter of my name period ferrari88 at gmail. (I mean, at the times of LLM, trying to obfuscate email addresses like this is sort of a joke 😅)

Yes, that would definitely make things easier if you could test and debug things directly. I sent you an email with the info. Thank you!

thank you, as soon as I have some time, I will try to get at the bottom of it!

I have even consulted ChatGPT4 but to no avail.

What I have tried:

  • setting user agent as curl's
  • forcing http2
  • forcing the TLS version to 1.2
  • using a cookie jar

I'm afraid I'm out of my depth here but I will keep on thinking about possible solutions.
Just to confirm your findings, the problem is in the fetcher because the page it downloads does not contain the full text.
Also, Postman works as curl does but Thunder Client from VScode fails.

I might have found the issue. Will post soon the fix.
Basically there is nothing wrong with the way the request is made. The issue is in the way the Execute function was working. None of the persistent flags were actually used 🤦‍♂️