Trinkle23897 / learn2018-autodown

清华大学新版网络学堂课程自动下载脚本 / A python script to clone all files from learn.tsinghua.edu.cn

Home Page:https://learn.tsinghua.edu.cn

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

部分文件无法下载

lizy14 opened this issue · comments

Date: Sun, 03 Mar 2019 08:09:41 GMT
Server: Apache/2.4.37 (Unix) mod_jk/1.2.46 Resin/3.0.28
Content-Disposition: =?utf-8?b?YXR0YWNobWVudDsgZmlsZW5hbWU9IjLDpcKG?=
 =?utf-8?b?IC0xIMOkwrjCusOkwrvCgMOkwrnCiMOlwr/Cq8OmwpLCrcOlwr8=?=
 =?utf-8?b?IMOpwqHCu8Omwq3CuyDDp8KOwovDpsKswqPDpcK/?=
 =?utf-8?b?IMOpwqHCu8Olwq3CpsOkwrzCmsOmwpTCuS5kb2N4Ig==?=
Content-Length: 24389
Connection: close
Content-Type: application/octet-stream
Traceback (most recent call last):
  File "C:\_A\learn.py", line 149, in <module>
    sync_file(c)
  File "C:\_A\learn.py", line 114, in sync_file
    download('/b/wlxt/kj/wlkc_kjxxb/student/downloadFile?sfgk=0&wjid=%s' % f[7], f[1])
  File "C:\_A\learn.py", line 89, in download
    filename = st.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 4: invalid start byte

the forementioned case:

b'attachment; filename="2\xc3\xa5\xc2\x86 -1 \xc3\xa4\xc2\xb8\xc2\xba\xc3\xa4\xc2\xbb\xc2\x80\xc3\xa4\xc2\xb9\xc2\x88\xc3\xa5\xc2\xbf\xc2\xab\xc3\xa6\xc2\x92\xc2\xad\xc3\xa5\xc2\xbf \xc3\xa9\xc2\xa1\xc2\xbb\xc3\xa6\xc2\xad\xc2\xbb \xc3\xa7\xc2\x8e\xc2\x8b\xc3\xa6\xc2\xac\xc2\xa3\xc3\xa5\xc2\xbf \xc3\xa9\xc2\xa1\xc2\xbb\xc3\xa5\xc2\xad\xc2\xa6\xc3\xa4\xc2\xbc\xc2\x9a\xc3\xa6\xc2\x94\xc2\xb9.docx"'

is expected to be parsed into

'2内-1 为什么快播必须死 王欣必须学会改.docx'

Chrome and Edge can both handle it correctly.

我和奶牛老师说过这件事情了……我当时dirty fix了一发,但是没搞好,毕竟手写解析utf8也太硬核了……

有个方案是把download功能分到一个python2的脚本里面,我印象中py2不会出这个问题,看起来像是py3没弄好这个feature

或者有没有什么第三方的库,我当时没找到

正确的utf8编码是

'2内-1 为什么快播必须死 王欣必须学会改.docx'.encode('utf8')
b'2\xe5\x86\x85-1 \xe4\xb8\xba\xe4\xbb\x80\xe4\xb9\x88\xe5\xbf\xab\xe6\x92\xad\xe5\xbf\x85\xe9\xa1\xbb\xe6\xad\xbb \xe7\x8e\x8b\xe6\xac\xa3\xe5\xbf\x85\xe9\xa1\xbb\xe5\xad\xa6\xe4\xbc\x9a\xe6\x94\xb9.docx'

搞不动了,先弄个fallback吧,解析不出来的时候,用文件标题代替

I have fixed this issue. You can find it at recent commits: 73e8d83c92ad91, thank you very much!