5forcegees/cloud_native_search

This is a web Crawl and Search tool.
It is defined by a set of terraform files configured to deploy the required resources into an AWS account.

Description of structure: There are a set of common shared files (common.vars.tf, common.output.tf, provider.config.tf) that are symlinked into the resource directories. There is also a configuration file in each numbered resource directory that is not shared: terraform.config.tf.
It cannot be shared because it contains the s3 state file configuration which is specific to the resources in that directory only and variable extrapolation is not possible here. As an alternative approach, these settings can be configured by using parameters passed in the terraform command line commands (see terraform help for details). The resource directories are numbered to clearly show the deployment sequence. The resource declarations in the resource directories are organized into code blocks that define a resource and all its dependent vars, dependent resources, and outputs together so that any changes to a resource can be made in a single location. Shared resources (such as the VPC/subnet where the Cluster, Lambda functions, etc are isolated) are defined separately from their dependent resources.

Top level structure:

README.md - this document

each of the numbered directories has a symlink to these shared files in the top level directory

    common.vars.tf - defines shared variables
    
    common.output.tf - defines shared outputs
    
    provider.config.tf - defines shared provider configuration
    
files - templates and files for several resources

  api_gateway/apigateway_swagger_template.yaml - rendered to include the lambda integration endpoint
  
  hadoop - scripts and files for managing the hadoop cluster 
  
	/nutch/conf/ - configuration files that get transferred into the EMR (Hadoop) cluster to configure the nutch web crawl
    
  cloudsearch/ - the output search and doc urls from the cloudsearch setup
  
  urls/seed.txt - lists the seed urls that the nutch job will crawl
  
  tools/ - tools and scripts used during development
  
    delete_all_documents.sh can be used to clean out the cloudsearch domain document list
    
1terraform-state-setup - deploys the S3 state storage and DynamoDB state locking resources

    terraform.config.tf - each of the numbered directories also has its individual S3 state file configuration here
    
2cloudsearch_setup - deploys the cloudsearch domain

3application_support_setup - deploys the application support resources

  including:
  
    the VPC where the cluster and optional build server will be deployed
    
    the SSH key used to access the build server and cluster
    
      currently configured to use the local ~/.ssh/id_rsa.pub file
      
      '$ terraform apply' in 3application_support_setup/ to update the ssh key to your local file
      
      or update configuration with your public key if not using the ~/.ssh/id_rsa.pub default path 
      
    the EC2 'build server' for managing the hadoop cluster remotely
    
    the s3 objects that the hadoop cluster references
    
    the lambda function that integrates the cloudsearch endpoint to the API Gateway
    
    the rendering of template files for programatically defining the configuration of each resource
    
4applications - deploys the hadoop cluster and configures cron job to trigger autoscaling and crawling

5forcegees / cloud_native_search

About

Languages