d60 / twikit

Twitter API Scraper | Without an API key | Twitter Internal API | Free | Twitter scraper | Twitter Bot

Home Page:https://twikit.readthedocs.io/en/latest/twikit.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get entire Tweet thread

ashutoshsaboo opened this issue · comments

Hi, so I was wondering if this lib can be used in anyway to unroll entire tweet threads to a html/markdown content of some sort? Kind of similar to what threadreaderapp and the likes do.

Basically if there's a thread of 10 tweets one after the other by the author, the lib should be able to get a list of those 10 tweets (along with their content) in serial order. I think last i had checked the twitter api, there was some thread_id/conversation_id (i forgot precisely) assigned to each tweet and a pointer to next_tweet in the same thread to accomplish the same. But given this lib doesn't use the twitter api, hence was wondering about the same. Is this possible currently? If not possible currently, if it's possible to please add support for the same in this lib?

Also many thanks for your great work on this lib! @d60

commented

Hi, @ashutoshsaboo

Added Tweet.thread in version 1.5.7.

Example:

>>> tweet = client.get_tweet_by_id('1779883783321272691')
>>> tweet.thread
[<Tweet id="1779910037533536356">, <Tweet id="1779910453306560552">]

Woah! That was super quick! 🚀 So any tweet that I query by, even if that's not the first tweet in the thread, will this still give all the tweets in the original thread in sequential order? @d60

commented

@ashutoshsaboo
tweet.thread only includes tweets that are below that tweet. If you want to retrieve tweets above it, you need to use tweet.reply_to. So, to get all tweets in the thread, it would be like this:

t = client.get_tweet_by_id('123456789')
thread = [*t.reply_to, t, *t.thread]

Great this works perfectly! thanks!
One small edge case i found for t.thread maybe you can return an empty list when no thread exists and it's a single tweet? It returns NoneType currently and that needs unnecessary handling with the iterable. I think you are doing that already for t.reply_to as well, so would be good to do it for t.thread as well? Possible to do this small change? @d60

One small offtopic but relevant ask, is it possible to vend out the source links in the tweets as well? Like when someone links to a page in a tweet, generally they are auto shortened by t.co prefix links, I'm wondering if there's a way to get the original source links for the same? I think for images and videos you vend out t.media which helps for the same, can something similar be also done for links? If i recall correctly, this was possible to get via the twitter apis. Given this is accessing source directly, should also be possible ideally?

commented

@ashutoshsaboo
Indeed, it would be better to have Tweet.thread return an empty list instead of None. I'll make that change in the next update.
And you can use Tweet.urls to get the original URLs.

>>> tweet = client.get_tweet_by_id('1780709582958100825')

>>> print(tweet.text)
Whether youre planning a drive or already on the road, were making it easier to find information about electric vehicle charging stations. Learn morehttps://t.co/mdVCGnYY2V
>>> print(tweet.urls)
[{'display_url': 'goo.gle/3U0OcPe', 'expanded_url': 'https://goo.gle/3U0OcPe', 'url': 'https://t.co/mdVCGnYY2V', 'indices': [154, 177]}]

Ohh i couldn't see in the Tweet class here about urls attribute, which is why was confused - https://twikit.readthedocs.io/en/latest/_modules/twikit/tweet.html#Tweet . I see it now as part of the constructor, but didn't see in the docs, hence couldn't spot it earlier. Should be a quick fix to add to the docs. @d60

This looks perfect but!

Hey @d60 , i just noticed for long tweets such as these - https://twitter.com/EFarraro/status/1669480575542116353 - tweet.text seems to be only partial and not the entire text? Is there some bug causing that?

I believe browsing long tweets has no constraint even if your account in non verified - so ideally given this is accessing source it should ideally work? Any idea you might have on how to go about for getting content for such tweets?

commented

@ashutoshsaboo You can retrieve the full text of a tweet using Tweet.full_text.

@d60 problem is that full_text doesn't have the t.co media urls in place like t.text has looks like. Also I saw for short tweets, t.full_text is also many times empty, and only t.text contains the text. Only for longer tweets looks like t.full_text has the text, not otherwise (atleast from my limited testing).

Can this be made uniform to vend out valid t.full_text in all cases (i don't know if there's any additional network requests you need to make for getting full_text, but if it's a short tweet, and no extra content maybe just return t.text for t.full_text as well?), and to also have the t.co media/url links in place in full_text as well, just like it is the case in t.text? @d60

Without that, it's kind of challenging to access the entire tweet in a unified manner.

Any idea about the above? Can you help @d60

commented

@ashutoshsaboo
Hi, sorry for the delay.

I've released Version 1.5.11, and as you suggested, I've made adjustments to Tweet.full_text and Tweet.urls. Now, Tweet.full_text will contain Tweet.text even for short tweets. Additionally, Tweet.urls will include all URLs within the tweet.