抓取日志失败
rainssong opened this issue · comments
运行环境:win10 home
命令执行
./renrenBackup.exe fetch -e email -p pwd -b
该错误会在抓取第1~15个日志时出现,应该不是特定日志导致的错误
Arguments: ()
crawled 9 comments on blog 77xxxxx25
Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 41, in fetch
File "fetch.py", line 99, in fetch_user
File "fetch.py", line 76, in fetch_blog
File "crawl\blog.py", line 83, in get_blogs
File "crawl\blog.py", line 51, in load_blog_list
File "crawl\utils.py", line 103, in get_comments
File "crawl\crawler.py", line 119, in get_json
File "json_init_.py", line 348, in loads
File "json\decoder.py", line 337, in decode
File "json\decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[15240] Failed to execute script manage
怀疑是抓取太快,服务器不能及时返回。建议:加上失败重试
此bug同样出现在相册抓取过程中
建议在 crawl/crawler.py 119 行处,加上一句 print(resp.text)
看看到底返回的是什么,导致 json 解析失败,这个和不能及时返回应该没有关系,如果没返回应该更底层的 HTTP 请求就会报错重试的
没有返回值也没法判断问题的具体原因
或者如果方便的话,可以给私信我提供出错的 blog id,我去尝试抓取看看
建议在 crawl/crawler.py 119 行处,加上一句
print(resp.text)
看看到底返回的是什么,导致 json 解析失败,这个和不能及时返回应该没有关系,如果没返回应该更底层的 HTTP 请求就会报错重试的没有返回值也没法判断问题的具体原因
或者如果方便的话,可以给私信我提供出错的 blog id,我去尝试抓取看看
该错误出现时间不一致,因此不像是具体某篇日志导致的。加上输出后得到以下信息:
<html>
<head><title>500 Servlet Exception</title></head>
<body>
<h1>500 Servlet Exception</h1>
<code><pre>
java.io.FileNotFoundException: /500.jsp
at com.caucho.jsp.PageManager.getPage(PageManager.java:251)
at com.caucho.jsp.PageManager.getPage(PageManager.java:166)
at com.caucho.jsp.QServlet.getSubPage(QServlet.java:298)
at com.caucho.jsp.QServlet.getPage(QServlet.java:210)
at com.caucho.server.dispatch.PageFilterChain.compilePage(PageFilterChain.java:206)
at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:133)
at com.caucho.server.webapp.DispatchFilterChain.doFilter(DispatchFilterChain.java:115)
at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
at com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:277)
at com.caucho.server.webapp.RequestDispatcherImpl.error(RequestDispatcherImpl.java:113)
at com.caucho.server.webapp.ErrorPageManager.sendServletError(ErrorPageManager.java:362)
at com.caucho.server.webapp.ErrorPageManager.handleErrorStatus(ErrorPageManager.java:558)
at com.caucho.server.webapp.ErrorPageManager.sendError(ErrorPageManager.java:449)
at com.caucho.server.connection.AbstractHttpResponse.sendError(AbstractHttpResponse.java:486)
at com.caucho.server.connection.AbstractHttpResponse.sendError(AbstractHttpResponse.java:440)
at com.caucho.servlets.FileServlet.service(FileServlet.java:259)
at com.caucho.server.dispatch.ServletFilterChain.doFilter(ServletFilterChain.java:106)
at com.caucho.server.cache.CacheFilterChain.doFilter(CacheFilterChain.java:209)
at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:173)
at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:274)
at com.caucho.server.port.TcpConnection.run(TcpConnection.java:511)
at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:516)
at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
at java.lang.Thread.run(Thread.java:662)
</pre></code>
<hr /><small>
Resin Professional 3.0.21 (built Thu, 10 Aug 2006 12:17:46 PDT)
</small>
</body></html>
Traceback (most recent call last):
File "manage.py", line 116, in <module>
manager.run()
File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\__init__.py", line 417, in run
result = self.handle(argv[0], argv[1:])
File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\__init__.py", line 386, in handle
res = handle(*args, **config)
File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\commands.py", line 216, in __call__
return self.run(*args, **kwargs)
File "manage.py", line 41, in fetch
fetched = fetch_user(uid, fetch_status=status, fetch_gossip=gossip, fetch_album=album, fetch_blog=blog)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\fetch.py", line 99, in fetch_user
fetch_blog(uid)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\fetch.py", line 76, in fetch_blog
blog_count = crawl_blog.get_blogs(uid)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\blog.py", line 83, in get_blogs
total = load_blog_list(cur_page, uid)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\blog.py", line 49, in load_blog_list
get_comments(bid, 'blog', owner=uid)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\utils.py", line 103, in get_comments
resp_json = crawler.get_json(comment_url, params=param)
File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\crawler.py", line 120, in get_json
r = json.loads(resp.text.replace(',}', '}'))
File "D:\Program Files\Python37\Lib\json\__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "D:\Program Files\Python37\Lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "D:\Program Files\Python37\Lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
这个我在 get_url
里加个重试机制吧,之前确实没有报过类似问题,我可能还需要自己测试下当 500 错的时候他的 response.status_code 是按 500 还是 200 返回的