This Python script uses the Twitter API and BeautifulSoup to
find and download dialogs from Twitter. It is built on top of the
collect_twitter_dialogs
module from
DSTC6-End-to-End-Conversation-Modeling.
It is meant to overcome some limitations of the DSTC6 module and other scrapers. The script doesn't require a list of source accounts, and for the most part avoids rate-limiting by scraping public urls. You can define the desired range of dialog length, and scale the execution across cores and treads.
-
create a twitter account if you don't have it.
you can get it via https://twitter.com/signup
-
create your application account via the Twitter
Developer's Site: https://dev.twitter.com/
see https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/
for reference, and keep the following keys
- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret
-
edit ./config.ini to set your access keys in the config file
- ConsumerKey
- ConsumerSecret
- AccessToken
- AccessTokenSecret
-
install dependencies
$ pip install -r requirements.txt
python getdialogs.py --help
If you need dialogs with 4 to 6 turns, for example:
python getdialogs.py \
--min_length=4 \
--max_length=6 \
output.csv
The script will collect data until interrupted (Ctrl+c
). It will
periodically save the collected dialogs to the informed path, which is
output.csv
above. Dialogs are appended to the output file, so it's
OK to stop and restart the script later. Previous results will not be lost.
You can inform the path to a custom config file with --config
. This is useful
for when you have many sets of credentials. Each run can use a different set to
avoid rate-limiting.
By default, the script tries to maximize the use of resources by splitting the
workload among processes. The number of such
processes can be set with --max_processes
. By default, this is set
to the number of cores available in the machine. This may be the best setting
if running in a dedicated node. If you're running the script casually in
a laptop, a less aggressive value for this setting should be more appropriate.
Similarly you can set the maximum number of threads to be used by each
process in --max_threads
. This value should be carefully chosen. If too
high (e.g. 10), the threads will compete with the thread that listens to
the Streaming API, causing it to fall behind. When a client fails to keep up with
the stream, Twitter disconnects it.
The script does not guarantee the conversations are unique. If a user appears twice in the stream, then her dialogs will be collected twice.
To guarantee uniqueness of the conversations, it's better to run a second script to remove duplicates. A Bloom Filter can help scaling this task. There's a good one written in Python.