timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document idempotency

cooljeanius opened this issue · comments

Is it safe to run this script multiple times on the same archive? Is it safe to run new versions of the script on versions of an archive that have already been processed by an old version of the script? Whether or not the script is idempotent should be documented in the README, IMO.

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

Thank you, I had this actual literal question and just came here to ask it. 👍🏻

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

OK, could this info go in the wiki or somewhere?