saeedt / CFS_Sampling

An optimal stratified sample design for Commodity Flow Survey (CFS) based on Simulated Annealing and Genetic Algorithm. A script in Procedural PostgreSQL is used to generate a frame with 100,000 records based on publicly available data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CFS Sample Design Data and Scripts

Data and Scripts for the proposed sample design for CFS are stored in this repository. Following are the list of folders and their content.

Raw_Data

Main data sources used for generating the sample data are stored in this folder.

SQL

SQL Scripts used to create tables and anaylze the raw data are store in this folder. We used PostgreSQL which is a free open source database management system (DBMS). The queries and functions can be run on PostgreSQL 9.6 or later. Running on other SQL compatible DBMSs such as MySQL/MriaDB or MS SQL Server may require minor modifications.

  • SQL_Scripts.sql includes the scripts for creating tables and all queries developed for cleaning and aggregating the data. The comments in this file provide a high level explanation of each step. We used Common Table Expressions (CTEs) to merge multiple related queries in one step.
  • `Generate_est.sql' includes a function written in procedural PostgreSQL language that generates a sampling frame with user defined parameters based on CBP and FAF datasets.

Final_Data

Includes the final output of the scripts in SQL folder applied to the data in Raw_Data.

  • fafcbp.csv is the combined FAF and CBP datasets in CSV format. It is the disaggregated FAF data by county and NAICS based on CBP data. This data is needed by the generate_est function presented in SQL folder.
  • 100K_Frame_newCFS.csv is a set of 100,000 establishments generated with the generate_est function.

R_Scrripts

Includes the R scripts, functions used in the document.

About

An optimal stratified sample design for Commodity Flow Survey (CFS) based on Simulated Annealing and Genetic Algorithm. A script in Procedural PostgreSQL is used to generate a frame with 100,000 records based on publicly available data.

License:MIT License


Languages

Language:R 75.7%Language:PLpgSQL 24.3%