whusnoopy / renrenBackup

A backup tool for renren.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

抓取日志失败

rainssong opened this issue · comments

运行环境:win10 home

命令执行
./renrenBackup.exe fetch -e email -p pwd -b

该错误会在抓取第1~15个日志时出现,应该不是特定日志导致的错误

Arguments: ()
crawled 9 comments on blog 77xxxxx25
Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 41, in fetch
File "fetch.py", line 99, in fetch_user
File "fetch.py", line 76, in fetch_blog
File "crawl\blog.py", line 83, in get_blogs
File "crawl\blog.py", line 51, in load_blog_list
File "crawl\utils.py", line 103, in get_comments
File "crawl\crawler.py", line 119, in get_json
File "json_init_.py", line 348, in loads
File "json\decoder.py", line 337, in decode
File "json\decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[15240] Failed to execute script manage

怀疑是抓取太快,服务器不能及时返回。建议:加上失败重试

此bug同样出现在相册抓取过程中

建议在 crawl/crawler.py 119 行处,加上一句 print(resp.text) 看看到底返回的是什么,导致 json 解析失败,这个和不能及时返回应该没有关系,如果没返回应该更底层的 HTTP 请求就会报错重试的

没有返回值也没法判断问题的具体原因

或者如果方便的话,可以给私信我提供出错的 blog id,我去尝试抓取看看

建议在 crawl/crawler.py 119 行处,加上一句 print(resp.text) 看看到底返回的是什么,导致 json 解析失败,这个和不能及时返回应该没有关系,如果没返回应该更底层的 HTTP 请求就会报错重试的

没有返回值也没法判断问题的具体原因

或者如果方便的话,可以给私信我提供出错的 blog id,我去尝试抓取看看

该错误出现时间不一致,因此不像是具体某篇日志导致的。加上输出后得到以下信息:

<html>
<head><title>500 Servlet Exception</title></head>
<body>
<h1>500 Servlet Exception</h1>
<code><pre>
java.io.FileNotFoundException: /500.jsp
        at com.caucho.jsp.PageManager.getPage(PageManager.java:251)
        at com.caucho.jsp.PageManager.getPage(PageManager.java:166)
        at com.caucho.jsp.QServlet.getSubPage(QServlet.java:298)
        at com.caucho.jsp.QServlet.getPage(QServlet.java:210)
        at com.caucho.server.dispatch.PageFilterChain.compilePage(PageFilterChain.java:206)
        at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:133)
        at com.caucho.server.webapp.DispatchFilterChain.doFilter(DispatchFilterChain.java:115)
        at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
        at com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:277)
        at com.caucho.server.webapp.RequestDispatcherImpl.error(RequestDispatcherImpl.java:113)
        at com.caucho.server.webapp.ErrorPageManager.sendServletError(ErrorPageManager.java:362)
        at com.caucho.server.webapp.ErrorPageManager.handleErrorStatus(ErrorPageManager.java:558)
        at com.caucho.server.webapp.ErrorPageManager.sendError(ErrorPageManager.java:449)
        at com.caucho.server.connection.AbstractHttpResponse.sendError(AbstractHttpResponse.java:486)
        at com.caucho.server.connection.AbstractHttpResponse.sendError(AbstractHttpResponse.java:440)
        at com.caucho.servlets.FileServlet.service(FileServlet.java:259)
        at com.caucho.server.dispatch.ServletFilterChain.doFilter(ServletFilterChain.java:106)
        at com.caucho.server.cache.CacheFilterChain.doFilter(CacheFilterChain.java:209)
        at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:173)
        at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
        at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:274)
        at com.caucho.server.port.TcpConnection.run(TcpConnection.java:511)
        at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:516)
        at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
        at java.lang.Thread.run(Thread.java:662)
</pre></code>
<hr /><small>
Resin Professional 3.0.21 (built Thu, 10 Aug 2006 12:17:46 PDT)
</small>
</body></html>

Traceback (most recent call last):
  File "manage.py", line 116, in <module>
    manager.run()
  File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\__init__.py", line 417, in run
    result = self.handle(argv[0], argv[1:])
  File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\__init__.py", line 386, in handle
    res = handle(*args, **config)
  File "C:\Users\rains\.virtualenvs\renrenBackup-master-90t-rHLO\lib\site-packages\flask_script\commands.py", line 216, in __call__
    return self.run(*args, **kwargs)
  File "manage.py", line 41, in fetch
    fetched = fetch_user(uid, fetch_status=status, fetch_gossip=gossip, fetch_album=album, fetch_blog=blog)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\fetch.py", line 99, in fetch_user
    fetch_blog(uid)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\fetch.py", line 76, in fetch_blog
    blog_count = crawl_blog.get_blogs(uid)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\blog.py", line 83, in get_blogs
    total = load_blog_list(cur_page, uid)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\blog.py", line 49, in load_blog_list
    get_comments(bid, 'blog', owner=uid)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\utils.py", line 103, in get_comments
    resp_json = crawler.get_json(comment_url, params=param)
  File "D:\Setup\Tool\renrenBackup-master\renrenBackup-master\crawl\crawler.py", line 120, in get_json
    r = json.loads(resp.text.replace(',}', '}'))
  File "D:\Program Files\Python37\Lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "D:\Program Files\Python37\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "D:\Program Files\Python37\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

这个我在 get_url 里加个重试机制吧,之前确实没有报过类似问题,我可能还需要自己测试下当 500 错的时候他的 response.status_code 是按 500 还是 200 返回的

updated with 9ce7061