kezhenxu94 / house-renting

Possibly the best practice of Scrapy 🕷 and renting a house 🏡

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wiki里还需要增加两点ElasticSearch的启动错误说明

hao-lee opened this issue · comments

我的环境是 Fedora 27 x86_64,遇到的问题和解决方案如下所示。

权限问题

核心错误信息:Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes

该错误是由于目录权限不够导致的,把源码根目录下的 data/elastic/ 改成 777 权限就行了,简单粗暴。

参考资料:https://stackoverflow.com/q/41497520/4112667

该错误的完整信息如下:

[2018-09-06T12:53:10,296][INFO ][o.e.n.Node               ] [] initializing ...
[2018-09-06T12:53:10,448][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Failed to create node environment
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:125) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-6.2.4.jar:6.2.4]
	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:85) ~[elasticsearch-6.2.4.jar:6.2.4]
Caused by: java.lang.IllegalStateException: Failed to create node environment
	at org.elasticsearch.node.Node.<init>(Node.java:267) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.node.Node.<init>(Node.java:246) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:323) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-6.2.4.jar:6.2.4]
	... 6 more
Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:?]
	at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_161]
	at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_161]
	at java.nio.file.Files.createDirectories(Files.java:767) ~[?:1.8.0_161]
	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:204) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.node.Node.<init>(Node.java:264) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.node.Node.<init>(Node.java:246) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:213) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:323) ~[elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-6.2.4.jar:6.2.4]
	... 6 more

进程虚拟空间限制过小

核心错误信息:max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

该错误是由于进程的虚拟空间上限不够大导致的。运行以下命令解决:

sysctl -w vm.max_map_count=262144

参考资料:docker-library/elasticsearch#111 (comment)

该错误完整的信息如下:

[2018-09-06T13:31:11,112][INFO ][o.e.n.Node               ] [] initializing ...
[2018-09-06T13:31:11,310][INFO ][o.e.e.NodeEnvironment    ] [shmG9r4] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/mapper/fedora-root)]], net usable_space [14.5gb], net total_space [19.5gb], types [ext4]
[2018-09-06T13:31:11,311][INFO ][o.e.e.NodeEnvironment    ] [shmG9r4] heap size [494.9mb], compressed ordinary object pointers [true]
[2018-09-06T13:31:11,315][INFO ][o.e.n.Node               ] node name [shmG9r4] derived from node ID [shmG9r4iTYWkz7FJUJJIoA]; set [node.name] to override
[2018-09-06T13:31:11,316][INFO ][o.e.n.Node               ] version[6.2.4], pid[1], build[ccec39f/2018-04-12T20:37:28.497551Z], OS[Linux/4.17.17-100.fc27.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_161/25.161-b14]
[2018-09-06T13:31:11,316][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.x1oF74x1, -XX:+HeapDumpOnOutOfMemoryError, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.cgroups.hierarchy.override=/, -Xms512m, -Xmx512m, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config]
[2018-09-06T13:31:13,480][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [aggs-matrix-stats]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [analysis-common]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [ingest-common]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [lang-expression]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [lang-mustache]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [lang-painless]
[2018-09-06T13:31:13,481][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [mapper-extras]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [parent-join]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [percolator]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [rank-eval]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [reindex]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [repository-url]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [transport-netty4]
[2018-09-06T13:31:13,482][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded module [tribe]
[2018-09-06T13:31:13,483][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded plugin [ingest-geoip]
[2018-09-06T13:31:13,483][INFO ][o.e.p.PluginsService     ] [shmG9r4] loaded plugin [ingest-user-agent]
[2018-09-06T13:31:18,990][INFO ][o.e.d.DiscoveryModule    ] [shmG9r4] using discovery type [zen]
[2018-09-06T13:31:19,756][INFO ][o.e.n.Node               ] initialized
[2018-09-06T13:31:19,756][INFO ][o.e.n.Node               ] [shmG9r4] starting ...
[2018-09-06T13:31:19,955][INFO ][o.e.t.TransportService   ] [shmG9r4] publish_address {172.24.0.2:9300}, bound_addresses {0.0.0.0:9300}
[2018-09-06T13:31:19,974][INFO ][o.e.b.BootstrapChecks    ] [shmG9r4] bound or publishing to a non-loopback address, enforcing bootstrap checks
ERROR: [1] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2018-09-06T13:31:19,994][INFO ][o.e.n.Node               ] [shmG9r4] stopping ...
[2018-09-06T13:31:20,041][INFO ][o.e.n.Node               ] [shmG9r4] stopped
[2018-09-06T13:31:20,042][INFO ][o.e.n.Node               ] [shmG9r4] closing ...
[2018-09-06T13:31:20,057][INFO ][o.e.n.Node               ] [shmG9r4] closed

再补充一点,

爬虫异常退出

核心错误信息:

    def write(self, data, async=False):
                              ^
SyntaxError: invalid syntax

该错误是由于 scrapy 所依赖的 twisted 库对 Python 3.7 的支持有问题导致的,目前上游没有解决,所以没办法搞。

该错误完整的信息如下:

2018-09-06 13:37:19 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: house_renting)
2018-09-06 13:37:19 [scrapy.utils.log] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_MAX_DELAY': 10, 'AUTOTHROTTLE_START_DELAY': 10, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0, 'BOT_NAME': 'house_renting', 'COMMANDS_MODULE': 'house_renting.commands', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 10, 'DOWNLOAD_TIMEOUT': 30, 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'house_renting.spiders', 'RETRY_TIMES': 3, 'SPIDER_MODULES': ['house_renting.spiders'], 'TELNETCONSOLE_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15 '}
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 149, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 156, in _run_command
    cmd.run(args, opts)
  File "/house-renting/crawler/house_renting/commands/crawl.py", line 17, in run
    self.crawler_process.crawl(spider_name, **opts.spargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 167, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 195, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 200, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 52, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/site-packages/scrapy/extensions/telnet.py", line 12, in <module>
    from twisted.conch import manhole, telnet
  File "/usr/local/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
    def write(self, data, async=False):
                              ^
SyntaxError: invalid syntax

多谢,后面我加到 Wiki 中

拉取 redis 镜像非常慢

换用国内Docker镜像,方法为打开或创建 /etc/docker/daemon.json 文件,写入以下内容:

{
    "registry-mirrors": [
        "http://18817714.m.daocloud.io"
    ],
    "insecure-registries": []
}

注意,http://18817714.m.daocloud.io 这网址是我从一博主的博客内找的,如果不能用了请自行申请。

@kezhenxu94 我不太熟悉Docker,不知道能不能做到将Python的版本定死成3.6,不然爬虫没法运行啊

解决了,在Dockerfile里指定Python版本就行了

@hao-lee 已经修改 Python 版本

@kezhenxu94 前天昨天搞了两个晚上,今晚花了20分钟终于搞定了,跑起来了,强大的难以置信,早就听说ElasticSearch功能强大,当世利器无出其右,今日一见果然了得。我觉得这个项目非常伟大,对租房来说太有用了,我本来想自己写个爬虫的,但是肯定搞不到这么完善,我觉得这个项目可以继续发展,从代码本身可以作为学习的好例子,从实用角度也可以吊打各种中介。我的Docker知识是现学现用,菜鸟一个,但是Python咱还是会的,我想研究一下代码然后陆续提交一点pr,我还想完善一下文档,在知乎打打广告,这个项目的价值理应得到发掘,以后应该建立一个Organization专门维护这项目。

现在有个公众号叫“暖房”,使用机器学习过滤中介信息,我觉得可以借鉴,不过这是后话了。

总之,这个项目太叼了,搭起来之前觉得好麻烦,搭起来之后觉得碉堡了。

Oh, man, have you ever heard about English language?.. :)

@andkirby emmm...As this is a tool for renting house in China, I think this may be not very useful to people from US or other countries... 😅

@hao-lee, yeah, indeed. :D I found out what's this for later. :)
I just found the same error here and would like to get a solution but... :) Anyway, GoogleTranslate works fine, and sometimes funny. :^D

@andkirby 😅 Okay......

想问一下,我用k8s部署到azure中,也是出现这样的问题,但是不能每个node中都去更改吧。