timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: Parse likes

lynnandtonic opened this issue · comments

It would be cool to parse your likes archive in the same way as your own tweets (sorry if you can already do this, I’m a Python noob). I realize it might be more challenging because of multiple tweet authors though!

Thanks for making this parser and for your clear instructions, I really appreciate it! 💚

I started looking at this before I knew about this project. Unfortunately likes are significantly harder to process than your own tweets as they have much less metadata - here's an example:
{ "like" : { "tweetId" : "1590480568017616898", "fullText" : "Looking at the bitcoin graph you can see pretty clearly that the actual thing driving the price of crypto is a love of round numbers https://t.co/V8m7LylGpG", "expandedUrl" : "https://twitter.com/i/web/status/1590480568017616898" } },
So there are a few steps we have to do:

  1. For any t.co links, perform a request to discover the actual URL they are referencing.
  2. If the actual URL refers to media on Twitter's servers, then download that media.
  3. Build the final markdown files referencing the data we've managed to get hold of.

I've got something working for part 1. Part 2 is much harder, since the links do not get you a direct URL for the media file itself, whereas you can derive that URL for your own images. I've had mild success with https://github.com/inteoryx/twitter-video-dl (quite easy to tweak for photos as well) but it's quite slow, and it will need refactoring to initialise once before retrieving multiple files.

Would it be useful to separate these into two different improvements? I have separate needs for expanding t.co links from retrieving media.

@grayside That's a good idea. I don't have time to make the issue/PR at the moment but here's a gist with the redirect script if anyone would like to pick that up.

I've since decided that I'm probably not going to use this stage for parsing likes, since you get the expanded urls in the metadata for a given tweet id, but hopefully it will be useful independently.

Progress report - I've added a get_likes script to my fork:
https://github.com/press-rouch/twitter-archive-parser
Note that this does not yet produce final output. However it does download all the metadata and media for the tweets, which is the most important thing at the moment. Hopefully this weekend I can wrangle it into producing the markdown/HTML.

Following the coding guidelines discussed in issue #79, here's a minimal implementation to get all metadata (including the alt-text for issue #20) for a set of tweets. On my 50Mbit connection I can get ~100 tweets per second.

import json
import requests

def get_twitter_api_guest_token(session, bearer_token):
    """Returns a Twitter API guest token for the current session."""
    guest_token_response = session.post("https://api.twitter.com/1.1/guest/activate.json",
                                        headers={'authorization': f'Bearer {bearer_token}'})
    guest_token = json.loads(guest_token_response.content)['guest_token']
    if not guest_token:
        raise Exception(f"Failed to retrieve guest token")
    return guest_token

def get_tweets(session, bearer_token, guest_token, tweet_ids):
    """Asks Twitter for all metadata associated with tweet_ids."""
    tweets = {}
    while tweet_ids:
        max_batch = 100
        tweet_id_batch = tweet_ids[:max_batch]
        tweet_id_list = ",".join(tweet_id_batch)
        query_url = f"https://api.twitter.com/1.1/statuses/lookup.json?id={tweet_id_list}&tweet_mode=extended&include_ext_alt_text=1"
        response = session.get(query_url,
                               headers={'authorization': f'Bearer {bearer_token}', 'x-guest-token': guest_token})
        if response.status_code == 429:
            # Rate limit exceeeded - get a new token
            guest_token = get_twitter_api_guest_token(session, bearer_token)
            continue
        if not response.status_code == 200:
            raise Exception(f'Failed to get tweets: {response}')
        response_json = json.loads(response.content)
        for tweet in response_json:
            tweets[tweet["id_str"]] = tweet
        tweet_ids = tweet_ids[max_batch:]
    return tweets

with requests.Session() as session:
    bearer_token = 'AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
    guest_token = get_twitter_api_guest_token(session, bearer_token)
    tweet_ids = ["1585744456203964418","1585744453133897731"]
    tweets = get_tweets(session, bearer_token, guest_token, tweet_ids)
    for id, tweet in tweets.items():
        print(f"{id}: {tweet['full_text']}")
        if "media" in tweet["entities"]:
            print(f"media url: {tweet['entities']['media'][0]['media_url_https']}")

(this does not include the follow-up stage to download media)

To clarify: would this work by exploiting archive data, or by parsing everything from the API? Protected accounts being excluded from the feature would be a bit of a pain.

@komic The archive contains a minimal amount of data: the tweet id, the text, and a URL. I don't know what the archive contains for protected accounts. I would expect we would start out with the archive data and expand it using the API where possible.
An advanced feature would be providing your own developer key in order to gain access to that data, but I've no idea how feasible that is.

I don't know what the archive contains for protected accounts.

Looking like the same thing. like.js contains tweetId, fullText and expandedUrl for each entry. I'm asking because the main script suggested unlocking my account before running it, but I didn't and everything still seemed to work fine.

I'm asking because the main script suggested unlocking my account before running it, but I didn't and everything still seemed to work fine.

@komic Maybe I can shed some light into these two lines of text as I introduced them with issue #32 and pull-request #33. Back then, when running the download_better_images.py script it did work on all of my accounts except one which happened to also be the only one set to "protected". I then did some trial and error by un-protecting it, running the script again (which then worked just fine), protecting it again and also running the script again (which then failed just like on my initial run). I am still not completely sure whether the protected state is the (only) reason why the images can't be accessed on their deeplink or what is going on behind the scenes but it was a quick "fix" for a rather random bug.

tl;dr: the "unlock your account" suggestion has nothing to do with parsing likes but may be a solution if downloading the full size images does not work.

@komic I think that parsing likes with lookups will work fine if your account is protected, but if you have liked tweets from someone else's protected account, then we won't be able to get the extended data.

Since download_better_images.py has been integrated with parser.py, it might be better if we reword the text from

This script may not work if your account is protected.

to

Downloading your images may not work if your account is protected.

Interesting @achisto, I can confirm I didn't run into this.

Just noticed like.js is pretty much unsorted, are you planning to rearrange everything based on when the tweet itself was posted, or leave them as is?

Would it be useful to separate these into two different improvements? I have separate needs for expanding t.co links from retrieving media.

related: #90

@press-rouch this would definitely be much clearer, yes. I totally forgot about the two scripts getting merged at some point.

@press-rouch of all the issues that are somehow connected to PR #97, this is the one that I did not wrap my head around yet. I think downloading liked tweets is a rather separate process, which does not need to be integrated with the three-pass thingy I do in parse_tweets in the branch downloadtweets, right?

@lenaschimmel I think there's a reasonable chance you might have liked a tweet that you have retweeted, quoted, or replied to, so we probably want some de-duplication between the two processes.

One question is there any way to parse the likes? No problem if I need to retry multiple times since the API is going to be paid soon over free not sure if that will also affect the rest of the parser.

Extremely interested in this feature!

I did end up finding the get_likes script @press-rouch mentioned in #22 (comment), but I'm only getting a bunch of 404 so I'm guessing the API shutdown killed it.