rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Home Page:http://bit.ly/agile_data_science

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chapter 4 save from pyspark to MongoDB exhausts Vagrant VM memory

pjhinton opened this issue · comments

In Chapter 4, where flight data is published to MongoDB from pyspark:

https://learning.oreilly.com/a/agile-data-science/21213057/

if I use a Vagrant image based on box ubuntu/bionic64 built off of this source tree:

pjhinton/Agile_Data_Code_2@9fe4c71

MongoDB runs out of memory:

Dec 21 17:09:02 ubuntu-bionic mongod[29354]: src/central_freelist.cc:333] tcmalloc: allocation failed 32768
Dec 21 17:09:03 ubuntu-bionic mongod[29354]: message repeated 2 times: [ src/central_freelist.cc:333] tcmalloc: allocation failed 32768]
Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Main process exited, code=exited, status=14/n/a
Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Failed with result 'exit-code'.

The version of MongoDB in use is the one supplied by Ubuntu:

# dpkg -l mongodb
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                          Version                     Architecture                Description
+++-=============================================-===========================-===========================-================================================================================================
ii  mongodb

Versions of pymongo and pymongo-spark in use are:

$ pip show pymongo
Name: pymongo
Version: 3.7.2
Summary: Python driver for MongoDB <http://www.mongodb.org>
Home-page: http://github.com/mongodb/mongo-python-driver
Author: Bernie Hackett
Author-email: bernie@mongodb.com
License: Apache License, Version 2.0
Location: /home/vagrant/anaconda/lib/python3.5/site-packages
Requires: 
Required-by: pymongo-spark
$ pip show pymongo-spark
Name: pymongo-spark
Version: 0.1.dev0
Summary: Utilities for using Spark with PyMongo
Home-page: https://github.com/mongodb/mongo-hadoop
Author: MongoDB, Inc.
Author-email: mongodb-user@googlegroups.com
License: http://www.apache.org/licenses/LICENSE-2.0.html
Location: /home/vagrant/anaconda/lib/python3.5/site-packages/pymongo_spark-0.1.dev0-py3.5.egg
Requires: pymongo
Required-by: 

The version of Spark is the one installed by bootstrap.sh: 2.2.1.