timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Needed: Map missing user_id -> user_handle with remote services

timhutton opened this issue · comments

This will improve followers.txt, following.txt, DMs.md where currently many handles are missing.

Suggestions and initial work for how to retrieve these handles has happened in several places recently. Thank you!

Ping: @flauschzelle, @lenaschimmel, @press-rouch, @n1ckfg (but of course anyone can contribute)

[Edit: the PRs that were mentioned are now merged]

The remote lookup is of course much slower than any local operations, therefore I would suggest that it happens as an optional preprocessing step. My suggestion is a separate remote lookup script with arguments for what you want to retrieve, depending on your use case, available bandwidth, disk space, time, etc.
e.g.
remote_lookup.py --dm_users --following --likes --liked_photos --liked_videos --thread_replies
This would use the guest api to build a database of all the data on disk, which parser.py can then optionally use to populate its output.

We currently have four available ways to resolve user IDs to user handles:

Tested Reliability Available data Standalone script Integration into main script Speed
Search in existing archive data yes works offline -> perfect handle, sometimes display name yes yes * instant
via tweeterid.com yes frequent outages **, *** only handle yes yes * 1,5 sec per id
via Standard API no should be good *** probably full profile data yes no >= 1 sec per id
via Guest API yes should be good *** full profile data yes no <= 0.5 sec per id

Footnotes:

  • * in my fork, already functioning well, but needs some cleanup
  • ** often works for a few minutes, fails for a few minutes, works...
  • *** as long as twitter.com still works, and as they don't change or switch off the API

My intuition would be to focus on Search in existing archive data plus via Guest API, and keep the other two approaches as a possible fallback if the Guest API should fail in the future.

The Guest API seems extremely powerful, since it gives us access to full profile data as well as to full tweets, which is also useful for #6, #20, #22, #39, #72, #73.

(Thank you @lenaschimmel! I was just about to ask these questions and try to collate everything, and you've done it much better than I would have.)

Is the Guest API documented somewhere? @press-rouch's branch seems to have an entire module for twitter_guest_api - is that published as a package?

I couldn't find much information about the Twitter Guest API, but @nogira has a TypeScript and a Rust implementation of it. Interesting bit in their README, writte three months ago:

-- iT SEEMS THE TWITTER STANDARD V1.1 API ACTUALLY WORKS WITH GUEST TOKEN TOO --

I did not check if this is still true, but that would be huge!

twitter_guest_api is my heavily adapted version of twitter-video-dl (kudos to @inteoryx) - I stuck it in a module so it wouldn't clutter the root. I believe it is reverse engineered from the website javascript, so it is likely to be fragile in the medium to long term (although the first version was written over a year ago, so there hasn't been much churn in that period).
The original code did a one-shot query - I refactored it to do one-time initialisation of the headers, refreshing of the guest token when it expires, and multiple query types.

@lenaschimmel Confirmed - https://api.twitter.com/1.1/users/show.json?user_id=<id> works with the same bearer token and guest token headers as my current implementation! It gets more fields, including the most recent post, but not such an overwhelming amount that it hits bandwidth too badly. This should allow me to strip down the twitter_guest_api implementation significantly. It still needs to get hold of the tokens, but it doesn't have to muck around with the exploratory requests and query mapping.

I'd be interested to see a minimal code snippet for retrieving a handle.

Here's my latest twitter_guest_api implementation.

Example:

import requests
import twitter_guest_api

with requests.Session() as session:
    api = twitter_guest_api.TwitterGuestAPI(session)
    user_id = "1389666600341544960"
    user = api.get_account(session, user_id)
    print(f"{user_id} mapped to {user['name']} ({user['screen_name']})")

I was just about to push the change into my fork but then I noticed the metadata for tweets was missing image URLs and the alt text, so I'm looking into that now. User account data seems to be complete though.

Hooray for undocumented features! Adding tweet_mode=extended appears to get me the image URLs and alt text.

As stand-alone code, with some debug prints:

"""Utilities for downloading from Twitter"""

import json
import logging
import re
import requests

# https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-show-id
SHOW_STATUS_ENDPOINT = "https://api.twitter.com/1.1/statuses/show.json"
# https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-users-show
SHOW_USER_ENDPOINT = "https://api.twitter.com/1.1/users/show.json"
# Undocumented!
GUEST_TOKEN_ENDPOINT = "https://api.twitter.com/1.1/guest/activate.json"
BEARER_TOKEN_PATTERN = re.compile(r'"(AAA\w+%\w+)"')

def send_request(url, session_method, headers):
    """Attempt an http request"""
    print('Sending request:', url, session_method.__name__, headers)
    response = session_method(url, headers=headers, stream=True)
    if response.status_code != 200:
        raise Exception(f"Failed request to {url}: {response.status_code} {response.reason}")
    return response.content.decode("utf-8")

def get_guest_token(session, headers):
    """Request a guest token and add it to the headers"""
    print('post:', GUEST_TOKEN_ENDPOINT, headers)
    guest_token_response = session.post(GUEST_TOKEN_ENDPOINT, headers=headers, stream=True)
    guest_token_json = json.loads(guest_token_response.content)
    guest_token = guest_token_json['guest_token']
    if not guest_token:
        raise Exception(f"Failed to retrieve guest token")
    logging.info("Retrieved guest token %s", guest_token)
    headers['x-guest-token'] = guest_token

def get_response(url, session, headers):
    """Attempt to get the requested url. If the guest token has expired, get a new one and retry."""
    print('get:', url, headers)
    response = session.get(url, headers=headers, stream=True)
    if response.status_code == 429:
        # rate limit exceeded?
        logging.warning("Error %i: %s", response.status_code, response.text.strip())
        logging.info("Trying new guest token")
        get_guest_token(session, headers)
        print('get:', url, headers)
        response = session.get(url, headers=headers, stream=True)
    return response

def initialise_headers(session, url):
    """Populate http headers with necessary information for Twitter queries"""
    headers = {}

    # One of the js files from original url holds the bearer token and query id.
    container = send_request(url, session.get, headers)
    js_files = re.findall("src=['\"]([^'\"()]*js)['\"]", container)

    bearer_token = None
    # Search the javascript files for a bearer token and query ids
    for jsfile in js_files:
        logging.debug("Processing %s", jsfile)
        file_content = send_request(jsfile, session.get, headers)
        find_bearer_token = BEARER_TOKEN_PATTERN.search(file_content)

        if find_bearer_token:
            bearer_token = find_bearer_token.group(1)
            logging.info("Retrieved bearer token: %s", bearer_token)
            break

    if not bearer_token:
        raise Exception("Did not find bearer token.")

    headers['authorization'] = f"Bearer {bearer_token}"

    get_guest_token(session, headers)
    return headers

class TwitterGuestAPI:
    """Class to query Twitter API without a developer account"""
    def __init__(self, session):
        self.headers = initialise_headers(session, "https://www.twitter.com")

    def get_account(self, session, account_id):
        """Get the json metadata for a user account"""
        query_url = f"{SHOW_USER_ENDPOINT}?user_id={account_id}"
        response = get_response(query_url, session, self.headers)
        if response.status_code == 200:
            status_json = json.loads(response.content)
            return status_json
        logging.error("Failed to get account %s: (%i) %s",
                      account_id, response.status_code, response.reason)
        return None

    def get_tweet(self, session, tweet_id, include_user=True, include_alt_text=True):
        """
        Get the json metadata for a single tweet.
        If include_user is False, you will only get a numerical id for the user.
        """
        query_url = f"{SHOW_STATUS_ENDPOINT}?id={tweet_id}"
        if not include_user:
            query_url += "&trim_user=1"
        if include_alt_text:
            query_url += "&include_ext_alt_text=1"
        response = get_response(query_url, session, self.headers)
        if response.status_code == 200:
            status_json = json.loads(response.content)
            return status_json
        logging.error("Failed to get tweet %s: (%i) %s",
                      tweet_id, response.status_code, response.reason)
        return None


with requests.Session() as session:
    api = TwitterGuestAPI(session)
    user_id = "1389666600341544960"
    user = api.get_account(session, user_id)
    print(f"{user_id} mapped to {user['name']} ({user['screen_name']})")

Output shows 7 get/posts are needed:

Sending request: https://www.twitter.com get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.c7dfc719.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/vendor.d9a7d629.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/i18n/en.89426ec9.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/main.6de340c9.js get {}
post: https://api.twitter.com/1.1/guest/activate.json {'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'}
get: https://api.twitter.com/1.1/users/show.json?user_id=1389666600341544960 {'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA', 'x-guest-token': '1594029370401869825'}
1389666600341544960 mapped to Dan Luu (altluu)

Is this as simple as it gets, do we think? I don't have much experience in this area.

The first 6 requests are one-offs; if you're getting multiple users, each further one only requires one more request.

I think we could hard-code the bearer token. It looks like it hasn't changed in 2 years. That would mean we then only have to make one initialisation request (getting the guest token), then a request per user.

I've just noticed that we could batch up 100 user lookups at a time via this endpoint. There's a matching one for tweets.

Okay, that lookup endpoint is a game-changer. I can get 250 accounts in about a second. Will try the same for tweets and get it all pushed.

With a hard-coded bearer token (and single-user per query access) as minimal stand-alone code:

import json
import requests


def get_twitter_api_guest_token(session, bearer_token):
    """Returns a Twitter API guest token for the current session."""
    guest_token_response = session.post("https://api.twitter.com/1.1/guest/activate.json",
                                        headers={'authorization': f'Bearer {bearer_token}'})
    if not guest_token_response.status_code == 200:
        raise Exception(f'Failed to retrieve guest token from Twitter API: {guest_token_response}')
    return json.loads(guest_token_response.content)['guest_token']


def get_twitter_user(session, bearer_token, guest_token, user_id):
    """Asks the Twitter API for the user details associated with this user_id."""
    query_url = f"https://api.twitter.com/1.1/users/show.json?user_id={user_id}"
    response = session.get(query_url,
                           headers={'authorization': f'Bearer {bearer_token}', 'x-guest-token': guest_token})
    if not response.status_code == 200:
        raise Exception(f'Failed to retrieve user from Twitter API: {response}')
    return json.loads(response.content)


with requests.Session() as session:
    bearer_token = 'AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
    guest_token = get_twitter_api_guest_token(session, bearer_token)
    user_ids = ['1389666600341544960', '2246902119']
    for user_id in user_ids:
        user = get_twitter_user(session, bearer_token, guest_token, user_id)
        print(f"{user_id} = {user['screen_name']}")

@press-rouch If we can keep the code in the PR down to this vague sort of size (all in parser.py) then I will be very happy. Let's take small steps and not have more sophistication than we need.

Here's the batch version:

import json
import requests

def get_twitter_api_guest_token(session, bearer_token):
    """Returns a Twitter API guest token for the current session."""
    guest_token_response = session.post("https://api.twitter.com/1.1/guest/activate.json",
                                        headers={'authorization': f'Bearer {bearer_token}'})
    guest_token = json.loads(guest_token_response.content)['guest_token']
    if not guest_token:
        raise Exception(f"Failed to retrieve guest token")
    return guest_token

def get_twitter_users(session, bearer_token, guest_token, user_ids):
    """Asks Twitter for all metadata associated with user_ids."""
    users = {}
    while user_ids:
        max_batch = 100
        user_id_batch = user_ids[:max_batch]
        user_ids = user_ids[max_batch:]
        user_id_list = ",".join(user_id_batch)
        query_url = f"https://api.twitter.com/1.1/users/lookup.json?user_id={user_id_list}"
        response = session.get(query_url,
                               headers={'authorization': f'Bearer {bearer_token}', 'x-guest-token': guest_token})
        if not response.status_code == 200:
            raise Exception(f'Failed to get user handle: {response}')
        response_json = json.loads(response.content)
        for user in response_json:
            users[user["id_str"]] = user
    return users

with requests.Session() as session:
    bearer_token = 'AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
    guest_token = get_twitter_api_guest_token(session, bearer_token)
    user_ids = ['1389666600341544960', '2246902119']
    users = get_twitter_users(session, bearer_token, guest_token, user_ids)
    for user_id in user_ids:
        print(f"{user_id} = {users[user_id]['screen_name']}")

which is a bit longer, but orders of magnitude faster when len(user_ids) > 100, and less over-engineered than my module.

Curious if the work being done in this issue will affect how many of my followers in followers.txt resolve to ~unknown~handle~ (perhaps they are protected accounts) -- or if resolvable should I document in a new issue?

Example:

~unknown~handle~ https://twitter.com/i/user/14114392

This is the account for thoughtbot: https://twitter.com/thoughtbot

@press-rouch It says it's rate limited to 900 (or 300?) requests per 15-minute window. Does that mean with the batch call this limits us to 90,000 users (or 30,000) per 15-minute window? For many of us that will be plenty and for now it's fine but might have to consider that later. (I only mention because Neil Gaiman likely had lots of followers and is looking for a tool to parse his Twitter archive. :) )

@stepheneb Yes! That and DMs.md. We should be able to resolve many of those ~unknown~handle~ placeholders with this work. It's not that they're protected - the archive itself doesn't contain that information. And actually protected accounts will likely not be resolved by this change I suspect.

When I was making thousands of individual tweet requests, I found I would eventually hit a 429 "rate limit exceeded" status code. The trivial workaround for this was to just request a new guest token (which seems to make guest access superior to a registered user!).
I would suggest that users with massive followings might want the option to skip processing their follower lists, as I expect they are probably not particularly interested in the names of every one of their fans.

I couldn't find much information about the Twitter Guest API, but @nogira has a TypeScript and a Rust implementation of it. Interesting bit in their README, writte three months ago:

-- iT SEEMS THE TWITTER STANDARD V1.1 API ACTUALLY WORKS WITH GUEST TOKEN TOO --

I did not check if this is still true, but that would be huge!

@press-rouch

to be clearer, when i say guest token i mean the guest bearer token rather than the x-guest-token. i don't believe the x-guest-token is needed for the standard V1.1 API

this is the user timeline endpoint i tested it on

const token = "AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA";
const fetchTweetsFromUser = async (screenName: string, count: number) => {
  const response = await fetch(
    `https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=${screenName}&count=${count}`,
    {
      headers: {
        Authorization: `Bearer ${token}`,
      },
    }
  );
  const json = await response.json();
  return json;
}
await fetchTweetsFromUser("elonmusk", 10).then(console.log);

@nogira Interesting, thanks. I just had a quick try and it appears that the user timeline endpoint works using only the bearer token, but the user and tweet lookup endpoints return a 429 without x-guest-token.