mozilla / fathom

A framework for extracting meaning from web pages

Home Page:http://mozilla.github.io/fathom/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make `fathom-unzip` renaming more flexible to avoid `File name too long` error

biancadanforth opened this issue · comments

Currently, fathom-unzip only shortens filenames if their original filename matches the regexp 0*(\d+) .*(\.[a-zA-Z0-9]+$); otherwise, it leaves the filename as-is, resulting in an error like the one below:

OSError: [Errno 63] File name too long: '001 2/fp79 https_id.sonyentertainmentnetwork.com_create_account_entry=_2Fcreate_account_tp_psn=true_ui=pr_client_id=93be7f95-7d1f-461b-baf0-aa07bd53af84_redirect_uri=https_io.playstation.com_playstation_psn_acceptLogin_request_locale=en_US_response_type=code_scope=psn_s2s_service_entity=urn_service-entity_psn_service_logo=ps_smcid=web_social_toolbar#_create_account_wizard_account_info_page1_entry=_2Fcreate_account.html'

It'd be helpful if any filename exceeding any OS-specific filename length limits were always truncated.

Perhaps as part of the script, a file could be generated that maps the original filename with the new filename to avoid confusion.

Though I hope fathom-unzip will cease to be useful in the near future (because we're no longer generating samples with URLs in their names), if you got use out of this even once, it'd be worth the 5 characters it would likely take to implement it.

Do we still want the tool to shorten filenames the same way for filenames starting with numbers? It seems like we're talking about an additional route the code can take for filenames that are too long and do not start with a number. For that type of filename, the tool could truncate the filename to 10 or 50 or however many characters.

As I said, this tool is quickly going the way of the dodo, but I'll answer nonetheless.

Do we still want the tool to shorten filenames the same way for filenames starting with numbers?

Yes.

additional route the code can take for filenames that are too long and do not start with a number. For that type of filename, the tool could truncate the filename to 10 or 50 or however many characters.

Yep. I'd truncate it to the max the FS will allow, which I think is generally 255 (unicode code points?) for the common modern FSs. I don't really care if we crash if 2 filenames truncate to the same string and collide, as long as we don't quietly overwrite the first one. As I said, this tool is quickly going out of style as my bad decisions fade into the past. :-)

This tool isn't long for this world.