timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Put generated output in its own folder(s)

lenaschimmel opened this issue · comments

Using this tool on a large twitter that existed for some years account generates a lot of *-Tweet-Archive-*.md files, and now that DMs and group DMs are processed, also a lot of DMs-Archive-*.md files.

Also, some of the future features might need to save some additional files for internal re-use, which are not relevant to the end user.

I'd like to have a nicer organization, like this:

  • <archive_folder>/
    • assets/
    • data/
    • media/
    • parser-result/
      • tweets/
        • TweetArchive.html
        • *-Tweet-Archive-*.md
      • dms/
        • DMs-Archive-*.md
    • parser-data/

What do you think?

Also, I'm not sure if a html file in archive_folder/parser-result/tweets/TweetArchive.html could reference images that are higher up in the hierarchy via relative paths?

Agreed that it needs thinking about.

I prefer the references to not include relative paths that move up the hierarchy, since it makes moving things harder.

I'd quite like for our consumable output product to go in a separate folder. Maybe:

  • <archive_folder>
    • Your archive.html
    • data/
    • assets/
    • parser-output/
      • *.md *.html
      • media/
    • parsing-info/
      • logs *.txt
      • cache files *.json *.txt

It still leaves *.md and *.html mixed up but at least it declutters the root folder of the archive.

So this would mean that we have to copy or move files from <archive_folder>/media to <archive_folder>/parser-output/media/, right? Or use symlinks on those OS that support it?

I'm really not sure if going up the hierarchy like <img src="../../media/34534634364576-SEgrgrdGertgSrd.jpg"> is a problem or not when opening a html file via file://-URL. I've found conflicting info about that, but it seems that I did not yet find the correct search terms for useful info.

Running the script for the first time we would just copy media into parser-output/media/ instead of the current copy into /media. Unless people have already downloaded hires versions, then they would need to copy them manually to avoid needing to download them all again. Maybe that's a pain.

Going up the hierarchy is fine in html and markdown, as long as the file exists at the relative location. The problem comes when someone wants to move all the files into their blog for example. They would need to replicate the same folder structure so that the relative path would still work.

Ok, I had not even noticed that the media is already copied by the current code 😄

Then I'd suggest that the script creates the directories ./parser-output and ./parsing-info if needed, and then moves output from older versions there, if it exists. Then users who upgrade do not need to do or worry about anything.

(I'd like to start coding on this, because the > 400 files in the top level directory of my archive really slow me down and annoy me while implementing / testing other features.)

Yes, good suggestion.

Please do go ahead and implement this. Thank you!