Detailed instructions for setup on Ubuntu are described in a separate section below. Other distributions may have slighly different requirements.
For Windows, I recommend using Anaconda to manage your Python environments because it comes with a lot of packages preinstalled that are difficult to set up without Anaconda. For Mac and Linux, you can choose to use it if you'd like, but it's not as necessary.
Clone the ClipboardApp repository into your preferred directory with Git Bash on Windows or a normal terminal otherwise: git clone https://github.com/ClipboardProject/ClipboardApp.git
For Windows Home, download from here. Documentation is here.
For Windows Professional or Enterprise, download from here. Documentation is here.
For Mac, download from here. Documentation is here.
For Linux, download from your package manager. Documentation is here (Other distros have links on the left side of the page).
Make sure you follow any OS and distro-specific instructions for setting up Docker. It may be helpful to go through the getting started guide here.
If you're new to Docker or you're recovering from a failed installation attempt, it's best to start by uninstalling older versions of Docker: sudo apt-get remove docker docker-engine docker.io
Run: sudo apt-get update
Install the following packages:
sudo apt-get install apt-transport-https
sudo apt-get install ca-certificates
sudo apt-get install curl
sudo apt-get install software-properties-common
These allow apt to use a repository over HTTPS
Add Docker's official GNU Privacy Guard (GPG) key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
This should print, "OK" to the terminal.
Run: sudo apt-key fingerprint 0EBFCD88
Verify that the Key Fingerprint line shows: 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
Set up the stable Docker repository:
sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"
Run: sudo apt-get update
again.
Install the latest version of Docker CE: sudo apt-get install docker-ce
If there were problems during the installation, try removing docker and starting over.
sudo apt-get purge docker-ce
sudo rm -rf /var/lib/docker
Run: sudo curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
Add executable permissions to the docker-compose binary: sudo chmod +x /usr/local/bin/docker-compose
Run docker-compose --version
to verify it installed correctly. It should show a version and build number similar to:
"docker-compose version 1.21.2, build 1719ceb"
If the docker-compose command doesn't work, add the following line to your ~/.bashrc file
export PATH="/usr/bin/docker-compose:$PATH"
Close and reopen your terminal(s) to apply the changes.
I was unable to get Kinematic to work on Docker Toolbox, so I would recommend skipping that. Make sure virtualization is enabled in the BIOS. If you need to change virtualization settings, do a full reboot cycle, otherwise Windows may not report that the settings have changed. If you're running Windows 10 Professional, you'll need to make sure Hyper-V is enabled in the "Turn Windows Features On or Off" dialog. If you're using Docker Toolbox on Windows Home edition, when you start Windows, you'll want to start the VirtualBox instance manually before starting Docker or Docker will complain about not having an IP address.
For Docker Toolbox on Windows Home, go to the environment variables section in the control panel. Look for a variable called DOCKER_HOST. Add another variable called DOCKER_IP which is the same as DOCKER_HOST,
but with the tcp prefix and the port number removed. For example, if DOCKER_HOST is tcp://192.168.1.11:2376
, DOCKER_IP should be 192.168.1.11
. Add another variable called DB_CLIENT_IP with a value of localhost
.
For Docker on Windows Professional, do those same steps, except both DOCKER_IP and DB_CLIENT_IP should be localhost
.
Add the lines export DOCKER_IP=localhost
and export DB_CLIENT_IP=localhost
to your ~/.bashrc
file. If you haven't used your .bashrc
file before, you may need to source it. To do so, add
if [ -f ~/.bashrc ]; then
source ~/.bashrc
fi
to your ~/.bash_profile
.
Add the lines DOCKER_IP=localhost
and DB_CLIENT_IP=localhost
to /etc/environment
. This file may be in a different location on some distros.
If you are using Linux, all of the subsequent Docker commands in this guide must be run with sudo
. If you would like to be able to use Docker without sudo
, look through the answers here. You will also need to move your environment variables to your ~/.bashrc
file if you go this route.
Verify that Docker installed correctly with: docker run hello-world
. You should see, "Hello from Docker!"
After setting the environment variables, close all terminals and every editor or process that's running Python. This is needed to ensure that your Python environment correctly reloads your environment variables. The DOCKER_IP variable is necessary because the versions of Docker that run on VirtualBox generate an IP address that is dependent on the host's configuration, so the code will read this variable to know where to make HTTP requests. The DB_CLIENT_IP variable is necessary because the client needs to run on 0.0.0.0 inside the Docker container, which is the internal IP address that Docker containers use to communicate with each other, but it needs to run on localhost outside the container.
Open a Docker terminal on Windows Home, Git Bash or some kind of bash emulator on Windows Professional, or a normal terminal otherwise, and cd
into the Git repo. Run ./start.bash
. If you get a permissions error, you may need to run chmod +x start.bash
.
This will grant execution permissions to the file. If all goes well, the database will be created, the scrapers will start running, and the website will start up. This process will take some time.
Eventually, you should start seeing messages like POST /putevents
. That means the data is being saved. Once a message says Data retrieved successfully
, the code is done running. You can go to your Docker IP on Windows (usually 192.168.99.100
) or localhost
otherwise to see the site.
If you want to see the data in the database, download Robo 3T from here using the link on the right. You can use another MongoDB client if you'd prefer. When you start Robo 3T, a popup to configure connections should appear. If you're on Windows Home, right click on the New Connection row and click "Edit". Change localhost to your Docker IP and hit "Save". Now press "Connect". Otherwise, leave the settings alone. It should connect successfully and you should see a database called Clipboard on the pane on the right. The database should contain a collection called "event" and an index that includes the start and end timestamp, along with other field(s).
On Linux, you'll probably want to move the extracted Robo 3T folder to /opt/your-extracted-folder-name
and run sudo ln -s /opt/your-extracted-folder-name/bin/robo3t /usr/local/bin/robo3t
. Then you can run robo3t
from the command line.
When the program is finished running, go to your Robo 3T instance, right click on the "event" collection, and click "View Documents". The screen should populate with data.
For debugging, you'll want to run some or all of the code locally instead of inside Docker. To set up the dependencies, cd
into the ClipboardApp repository.
On Windows, you'll need to have Visual Studio installed if you don't already because Scrapy is dependent on Visual Studio's C++ compiler.
If you're using Anaconda, open up an Anaconda terminal and run conda install -c anaconda python=3.7
to ensure you're running python 3.7 and then anaconda-install.sh
.
Otherwise, run ./install.sh
. You can use the shell script up-db-and-client.sh
to run the database and database client inside of docker, allowing you to run the other components instead of inside Docker.
up-db-only.sh
only runs the database inside Docker, allowing you to run the data engine and database client locally. These scripts are just shortcuts for running specific docker-compose up
commands. You can use docker-compose up
with any combination of services like this: docker-compose up clipboard_db clipboard_site
to run just the database and site in Docker.
IMPORTANT: When running a component from an IDE or text editor, you must have the component (eg data_engine
) folder set as the base project folder.
Opening the entire clipboard app folder will not work because of how Python looks for files to import.
The folling settings are defined in data_engine/config.py
:
-
ENABLE_API_CACHE: If
True
, any API calls made will be cached to a local file. This is useful to speed up development and to prevent hitting sites repeatedly. -
API_CACHE_EXPIRATION: Time in seconds that API data will be cached for.
-
API_DELAY_SECONDS: The amount of time between API calls. This is used by calling
ApiBase.wait()
. This is necessary when making large amounts of API calls in quick succession so as not to overrun the server. -
ENABLE_SCRAPY_CACHE: If
True
, any Scrapy calls made will be cached using Scrapy's builtin cache system. This is useful to speed up development and to prevent hitting sites repeatedly. -
SCRAPY_CACHE_EXPIRATION: ime in seconds that Scrapy data will be cached for.
-
VERBOSE_SCRAPY_OUTPUT: If
True
, Scrapy will show verbose logs during the scraping process. This may be useful for debugging, but the vast amount of data shown makes it difficult to spot errors as they occur.
clipboard_common_lib/clipboardcommonlib/shared_config.py
contains settings that are shared by multiple services. This file is distributed as a pip package. After modifying this file, re-install it with clipboard_common_lib/install-common-libs.sh
Our current development tasks and bugs are kept in the issues list here.
The easiest way to learn the code base and get started contributing is to add a new scraper as defined in this issue.
The issue contains instructions on how to pick a specific site.
This project consists of four parts
-
Data Engine: This is the heart of the application. It asynchronously scrapes websites and pulls in data from APIs, cleans and formats the data, then sends it to the MongoDB client.
-
Database Client: This is a standalone service that receives data from the data engine for insertion into MongoDB and processes requests from the clipboard site to display data to the user.
Any time data is received from a website, the old data from that site is deleted and refreshed with the new data. -
MongoDB Instance: This holds a single collection of all data from the sites. Only the database client interacts with the database.
-
Clipboard Site: The website that displays the aggregated data. Interacts with the database via the database client.
As stated previously, adding a scraper is the best way to start contributing. If you're not familiar with web scraping, this gives a decent overview about what web scraping is. We're using Scrapy for this project, which is a complex and sophisticated web scraping framework. If you'd to start with a tutorial that will help you learn more about how to write a scraper without worrying about the complexities of Scrapy, take a look at this guide which uses a library called BeautifulSoup. If you're comfortable with the concepts used in web scraping, take a look at this tutorial. Ignore the installation instructions because you should have installed Scrapy earlier in this guide.
Scrapy uses the CssSelect module to implement css selectors. Docs can be found here. CssSelect defines its selectors according to the w3 specification here with a few exceptions that are listed in CssSelect's documentation.
Most websites that we're dealing with will need to be scraped because the data on them is statically loaded from the server as html. However, some sites use APIs to dynamically load data. We should use these whenever possible because scrapers are fragile and need to be changed any time the content on the page changes. APIs are more stable and are less likely to have breaking changes introduced often.
Here is an example of how to detect if a site has an API we can use.
- Go to https://chipublib.bibliocommons.com/events/search/index in Google Chrome
- Open the developer tools using F12 on Windows/Linux and Command+Option+I on Mac
- Click on the "Network" tab at the top of the toolbox
- Reload the page. The grid should be populated with data.
- Click on the "Name" column for any of the requests. A detailed view should appear and the "Headers" tab should be selected.
- Click on the "Response" tab. There could be a variety of data in here. This view can have a variety of data.
For resource requests like images, it will say there is no data available, javascript files will show the javascript code, css files will show the stylesheet, etc. The only response data we care about right now is json. - Look for a request name that starts with "search?". Looking through the response, you should see a json object.
- Click on the "Headers" tab. The Request URL is what was requested by your browser to retrieve the json data. We can use that same url to get that data in our application.
- If you keep clicking through more requests, you should see several more that also returned json data.
This is the code that was used to create an API client for that site.
You can use this as a guide if you need to create your own API client. Some sites have APIs that are well-documented and designed for external use. These should be used if they are available.
Some sites may provide an iCalendar feed. Try to use the iCal reader if it is possible to do so.
All new scrapers should inherit from SpiderBase All new API clients should inherit from ApiBase
The end goal of all scrapers and API clients is to transform the raw data into event objects that conform to this class.
For each item, you'll want to parse out the following data (as much as is available). You'll notice that these fields correspond to the first parameter in the extract methods in SpiderBase.py
.
organization
: The name of the organization that's putting on the eventtitle
: The name of the eventdescription
: Detailed description of the eventaddress
: Location of the event (okay if exact address is not known)url
: Link to url for event. Link to specific event is preferred, but a link to a page containing general event listings is okay.price
: Cost to attend, if providedcategory
: Category of event, as defined here. (Work in progress. We'll flesh out categories more eventually)- Start/End Time and Date: Dates and times can be supplied with several parameters. Choose one date formate and one time format. Eventually, all dates and times will be converted into Unix timestamps.
time
: Use if only one time is supplied for the event (not time range)start_Time
andEnd_Time
: Use if the site supplies distinct data for these two valuestime_Range
: Use if the start and end time is supplied in a single string ex: 6:00-8:00 PMdate
: Use if the event could be one day or multiple days but it is contained in a single string. This is done this way because some sites have data that could be single days or multiple days.start_date
andend_date
: Use if the site supplies distinct data for these two valuesstart_timestamp
andend_timestamp
: Use if the data is formatted like a Unix timestamp (Unlikely for scrapers but possible for an API)
Once you've decided how to find these fields for your site, look at the methods in SpiderBase.py
or ApiBase.py
and how they're used in existing spiders and API clients to see how to process the data.