ginkgomzd / socialwarehouse

This is a data warehouse and data lake system for social, civic and social analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Social Warehouse

This is a data warehouse and data lake system for social, civic and social analysis from Siege Analytics.

Built with:

Runs:

Data warehouse is built to enable longitudinal analysis from Census and Bureau of Labour Statistics. Intended growth:

  • FEC information
  • Election results
  • Media markets
  • Officials and jurisdictions

Using

It's recommended to use make to run the docker compose commands below because we are auto-generating docker-compose.yml and .env files. Because of the magic of make, changes to the source includes will automatically trigger a re-make of the compose files.See below for more on how to work with the auto-generation.

As always, compose.override.yml can be used if you need changes to the auto-generated configs.

Here are the make commands that wrap docker compose:

  • down - this will terminate the containers, volumes, networks and remove them. It's a last resort command.
  • up - this will start containers, networks, volumes from rest and run them in detached mode.
  • build - this will build the containers, networks and volumes.
  • rebuild - this will build the containers, networks, volumes from nothing, not relying on cached resources. [docker compose build --no-cache]
  • clean - this will terminate the containers, volumes and networks, and remove them.
  • prune - this will remove all stopped containers, without removing running containers.

Here are some of the important make commands for working with the containers:

  • pg_shell - this will create an ssh connection to the PostgreSQL server container.
  • python_term - this will create an ssh connection to the Python container
  • fetch_jars - this uses maven to get jar files that are used by Spark to operate. It will save them in the default location copy them to the jars directory in the project.

Auto-generated Compose Files

By default, all .yml files in the docker/ directory are used to generate the docker-compose.yml file. The .env file is generated from .env files in the conf/ directory.

All of the wrapped compose commands declare .env and docker-compose.yml as dependencies, so they will be re-generated if any of the .yml files or .env files change.

Note, that a service may be defined across multiple files, and the order of the files is important. The docker-compose.yml file is generated by concatenating the files in the order they are listed in the COMPOSE_FILES variable. The auto-generated file-list sorts .profile.yml files to the end of the list, so that they have precedence over the plain .yml files.

You can explicitly select the include files by using the COMPOSE_FILES and COMPOSE_ENV_FILES variables when you run make. You might add such overrides to new targets in the Makefile, so you can define custom stacks for different environments or purposes.

mycustom:
	# use -B to force a rebuild of docker-compose.yml
	$(MAKE) -B COMPOSE_FILES="docker/mycustom.yml docker/mycustom.profile.yml" up

If you don't want your additional configs auto-included in default runs, just use a different naming convention.

Adding Services

To compile the compose file snipits from the docker/ sub-directory, we set the --project-directory to the repo root. Beware then defining any paths in the compose snipits, that the current working directory is the repo root. Whereas, remember that the Dockerfile paths are relative to the build-context you set.

It is recommended that you create a generalized .yml file for each image you build. Then put additional configurations into a .profile.yml file. The .profile.yml files will always have precedence over service descriptions in plain .yml files. You might group related services into a profile, or add project-specific or environment-specific configurations, such as volumes, networks, or environment variables.

For instance, we define image-building configurations in docker/spark-build-image.yml and define our integration of the image into the various services in docker/spark.profile.yml. You can see how we use one image in multiple services with different envrionment variables and volumes.

References

About

This is a data warehouse and data lake system for social, civic and social analysis

License:Apache License 2.0


Languages

Language:Python 66.4%Language:Dockerfile 8.3%Language:TSQL 7.0%Language:Makefile 6.9%Language:Shell 6.0%Language:Scala 3.3%Language:PLpgSQL 2.1%