stefanprodan / mgob

MongoDB dockerized backup agent. Runs schedule backups with retention, S3 & SFTP upload, notifications, instrumentation with Prometheus and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archiving and Querying MongoDB collection data from S3

pcinnusamy opened this issue · comments

@stefanprodan
Please help to guide me on the below archival process that how we can achieve this thru your approach

Description

Background:

MongoDB is being deprecated in favour of DocumentDB, therefore Product Engineering will be moving data to DocumentDB.After the move, the corresponding collections in Mongo will be archived for safekeeping
We are on Redshift, but planning to migrate to Snowflake in future.

Problem Statement:

The current archival process is to run mongodump on a collection, and finally produces a bson.gz file by a particular partition
However, this is not very access-friendly when data needs to be read back:
Have to run mongorestore on all required bson.gz files into a Mongo / DocumentDB cluster before querying it back

Questions to Resolve:

1.What should be the right process to archive data? When to archive? How frequent? By whom?
2.mongodump is being run on an EC2 instance, how should this be run without SSH-ing into a server? And runnable by all teams planning to archive their data.
3.What should be the right way to store archived data ?
4.How should the dump (if any) be partitioned ?
5.What is the right output format to improve readability? What tool (query engine) can we read it with?
6.How should a user read the data back, when this is needed ?
7.Data organisation structure should assume working with multiple pods, applications, tables / collections, etc

Other Contexts:

We are on Redshift, but planning to migrate to Snowflake (do these support querying the archived data, if yes, how?)