数据分析组计算架构需求讨论

Question

数据分析组计算架构需求讨论

Stockard opened this issue 4 years ago · comments

对做hadoop之类的同学来说现在的数据就是毛毛雨了，不过还是需要有人帮忙梳理一下计算架构。
建立这个结构的目的是可以方便地把现有的数据接入到线上的模型或者用于计算测试模型中，最后输出一些计算结果可以直接对接到可视化，或者持续输出到其他项目中。
希望能够快点部署，所以也希望能够利用现有的工具什么的。
我在这方面基本是零经验，请大家随便提意见。特别可以说说自己用过的解决方案，集思广益。

目前的情况是，

数据源有好多个，包括第三方API。以及组员贡献的，散乱的数据表格式。需要把这些数据汇总，存储起来。并且可以直接用于模型。
举几个例子：
第三方API：丁香园疫情（API暂时挂了），输出的Json格式，模型不太好用，需要先理成数据框。比如这里的和这里数据。这个数据需要及时存储，因为第三方API不太稳定。
散乱的数据表：组员贡献了从百度迁徙上爬下来的数据，这个数据很多人需要用，建迁徙模型也需要用到。这些数据大家还在持续贡献，有好的就需要及时录入。
经济数据：市级的数据，如GDP，医院数；同样还有省级的数据。未来可能还有国家级的数据。
输出需求，现在主要有几个，按优先度这样排：
1）对接主项目，把一些用于疫情数据可视化的内容输出，比如我们可以提供基于地理位置的多个数据层，比如患病人口密度，疫情管控情况等。这些数据很多地方都在做，但是数据科学组的同学可以把控哪些特征是比较重要的。
2）基于这些数据做出能持续输出的模型。例如，物资和医疗组目前需要一个疫情严重程度的TAG，这个模型就是有固定格式的输出，但是模型计算过程我们要能自己灵活调整。
3）做一些能够线上调试简单模型的工具。这个是属于我们自己的可视化和计算项目。
我不知道主项目目前的数据仓库是什么情况，哪些数据整理好了可以为我们所用，技术组那边的小伙伴可以参与一下。

目前我知道要做的事情，但是对怎么做没什么头绪，想听听大家的想法。

LM · Answer 1 · Mon Feb 03 2020 17:05:44 GMT+0800 (China Standard Time)

数据源有好多个，包括第三方API。以及组员贡献的，散乱的数据表格式。需要把这些数据汇总，存储起来。并且可以直接用于模型。
举几个例子：
第三方API：丁香园疫情（API暂时挂了），输出的Json格式，模型不太好用，需要先理成数据框。比如这里的和这里数据。这个数据需要及时存储，因为第三方API不太稳定。
散乱的数据表：组员贡献了从百度迁徙上爬下来的数据，这个数据很多人需要用，建迁徙模型也需要用到。这些数据大家还在持续贡献，有好的就需要及时录入。
经济数据：市级的数据，如GDP，医院数；同样还有省级的数据。未来可能还有国家级的数据。

Given the amount of data we have and we will have, I would go for the following solution.

Setup a data aggregation github repo using GitHub Actions and place the data in github pages so that we could pull data from the repo easily. Here is a minimum viable example: https://github.com/datumorphism/2019-ncov/actions
We write a python script to pull the data from github and transform it into data APIs. The script can be hosted on Zeit Now. The free plan of Zeit Now is good enough.
1. The data has to be standardized. We probably need a meeting or something like that.

Gamehu · Answer 2 · Mon Feb 03 2020 17:24:18 GMT+0800 (China Standard Time)

1.让消费端用graphql来取数据，要什么取什么，后端不需要定制化

2.后端整理数据统一入库，不知道我们用的是什么存数据，存Hbase，还是直接存postgresql？

如果弄数据模型比较麻烦，直接用postgresql弄个v8能写js，通过写js脚本获取个性化的数据

Rex · Answer 3 · Wed Feb 05 2020 02:46:04 GMT+0800 (China Standard Time)

@emptymalei 's proposal sounds feasible. In fact, based on my observations on the evolution of other sub-projects of this org, it seems the frontend and data-sync has gone down the path that they self serve their data without talking to an actual "backend API", that said, I think the api-server is more suitable for being consumed by the data science needs.

@Gamehu 's comments also sound good, Graphql might be more suitable for this use case than RESTful APIs, especially given the data model has not been standardized yet which subjects to changes every day. Since I'm not that familiar with GraphQL, I cannot comment on the estimation of the work I have to do with it.

When it comes to DBs, no matter what we decide to use, NoSQL/SQL... it's worth mentioning we should prefer any cloud-managed instances (such as GCP CloudSQL, Firebase, AWS RDS, dynamo, etc...) than maintaining our own.

Rex · Answer 4 · Wed Feb 05 2020 02:51:33 GMT+0800 (China Standard Time)

@Stockard Reading through your top priority, could you clarify what does "疫情数据可视化" mean? Especially what kind of info about epidemic do we want to visualize here? I could try to prioritize making the specific API endpoints if it is more clear.