Provide historical data on S3 so people can get it.

Question

Provide historical data on S3 so people can get it.

hunterowens opened this issue 11 years ago · comments

Harlan Harris · Answer 1 · Wed Sep 04 2013 07:36:09 GMT+0800 (China Standard Time)

Hello! I tried to email you guys via your google group, but it's not set up to post messages from non-subscribers, so mentioning here.

I'm working with a team at Code for DC (the local chapter of Code for America). We independently decided we wanted to build a statistical model very similar what you're working on, and ran across your project!

Would you be willing to share your data set with us, so we don't have to start scraping from scratch? Is there anything we can do to get this issue moving?

(The thing that we're currently interested is the commuter issue -- for each station, when is the dock likely to become empty/full in the morning/evening? Big problem, in practice!)

Thanks!

Hunter Owens · Answer 2 · Wed Sep 04 2013 07:43:24 GMT+0800 (China Standard Time)

@HarlanH Sorry about the delay, I've been meaning to get this setup, but for now, do you have a postgres install ready and configed to handle the dataset? I've started a pg_dump that I'll put on S3 once you are ready to receive it (it's north of 100gb, south of 1tb last time I checked the size of the database.)

Harlan Harris · Answer 3 · Wed Sep 04 2013 07:55:15 GMT+0800 (China Standard Time)

Wow, that's pretty good. How many records do you have? Tens of millions? If
you're pushing it to S3, would it make sense to make a view of the data
available on Redshift? That might make sense...

On Tue, Sep 3, 2013 at 7:43 PM, hunterowens notifications@github.comwrote:

@HarlanH https://github.com/HarlanH Sorry about the delay, I've been
meaning to get this setup, but for now, do you have a postgres install
ready and configed to handle the dataset? I've started a pg_dump that I'll
put on S3 once you are ready to receive it (it's north of 100gb, south of
1tb last time I checked the size of the database.)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23755948
.

Hunter Owens · Answer 4 · Wed Sep 04 2013 07:59:17 GMT+0800 (China Standard Time)

Around 120 million~ (and counting) in the case of DC. A m1.large
instance+1TB drive has given us pretty good performance so far, especially
with indexing.

On Tue, Sep 3, 2013 at 6:55 PM, Harlan Harris notifications@github.comwrote:

Wow, that's pretty good. How many records do you have? Tens of millions?
If
you're pushing it to S3, would it make sense to make a view of the data
available on Redshift? That might make sense...

On Tue, Sep 3, 2013 at 7:43 PM, hunterowens notifications@github.comwrote:

@HarlanH https://github.com/HarlanH Sorry about the delay, I've been
meaning to get this setup, but for now, do you have a postgres install
ready and configed to handle the dataset? I've started a pg_dump that
I'll
put on S3 once you are ready to receive it (it's north of 100gb, south
of
1tb last time I checked the size of the database.)

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23755948>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23756552
.

Harlan Harris · Answer 5 · Wed Sep 04 2013 08:27:12 GMT+0800 (China Standard Time)

Hm, OK. We can do that. In theory, it'd be maybe better to export this in something other than Postgres, but beggars, choosers, and all that. :)

Hunter Owens · Answer 6 · Wed Sep 04 2013 08:28:38 GMT+0800 (China Standard Time)

I could also do a CSV dump, I believe. That work better for your setup?

On Tue, Sep 3, 2013 at 7:27 PM, Harlan Harris notifications@github.comwrote:

Hm, OK. We can do that. In theory, it'd be maybe better to export this in
something other than Postgres, but beggars, choosers, and all that. :)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23757941
.

Harlan Harris · Answer 7 · Wed Sep 04 2013 08:35:50 GMT+0800 (China Standard Time)

It'd be more flexible, certainly! For both us and others, daily CSV files
would give us the most options, from loading into arbitrary databases
(Redshift would be fun), to direct processing one file at a time. I wonder
what would be an easy, cost-effective options for you guys on an ongoing
basis?

On Tue, Sep 3, 2013 at 8:28 PM, hunterowens notifications@github.comwrote:

I could also do a CSV dump, I believe. That work better for your setup?

On Tue, Sep 3, 2013 at 7:27 PM, Harlan Harris notifications@github.comwrote:

Hm, OK. We can do that. In theory, it'd be maybe better to export this
in
something other than Postgres, but beggars, choosers, and all that. :)

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23757941>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23757999
.

Hunter Owens · Answer 8 · Thu Sep 05 2013 03:40:38 GMT+0800 (China Standard Time)

Harlan-

I'm thinking the best, long term solution is to provide a API that will
allow users to get the data they need in JSON format. I'm going to try and
take a thack at that sometime soon.

Short term, what works best so the CFA DC team can get working?

Thanks,
Hunter

On Tue, Sep 3, 2013 at 8:35 PM, Harlan Harris notifications@github.comwrote:

It'd be more flexible, certainly! For both us and others, daily CSV files
would give us the most options, from loading into arbitrary databases
(Redshift would be fun), to direct processing one file at a time. I wonder
what would be an easy, cost-effective options for you guys on an ongoing
basis?

On Tue, Sep 3, 2013 at 8:28 PM, hunterowens notifications@github.comwrote:

I could also do a CSV dump, I believe. That work better for your setup?

On Tue, Sep 3, 2013 at 7:27 PM, Harlan Harris notifications@github.comwrote:

Hm, OK. We can do that. In theory, it'd be maybe better to export this
in
something other than Postgres, but beggars, choosers, and all that. :)

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23757941>
.

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23757999>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23758293
.

Harlan Harris · Answer 9 · Thu Sep 05 2013 09:53:18 GMT+0800 (China Standard Time)

Hunter, yes, an API would be great. (It'd be even better if CaBi ran their
own API, but whatever.)

Short term, would you be willing to just run a SQL query and dump to a CSV
file? We'd like all DC stations, but we only need snapshots every 10
minutes (I'm guessing you're pulling every minute) from say 6am to 10am,
for maybe the last 6 months. I'm not sure what columns are available...

Thanks!

On Wed, Sep 4, 2013 at 3:40 PM, hunterowens notifications@github.comwrote:

Harlan-

I'm thinking the best, long term solution is to provide a API that will
allow users to get the data they need in JSON format. I'm going to try and
take a thack at that sometime soon.

Short term, what works best so the CFA DC team can get working?

Thanks,
Hunter

On Tue, Sep 3, 2013 at 8:35 PM, Harlan Harris notifications@github.comwrote:

It'd be more flexible, certainly! For both us and others, daily CSV
files
would give us the most options, from loading into arbitrary databases
(Redshift would be fun), to direct processing one file at a time. I
wonder
what would be an easy, cost-effective options for you guys on an ongoing
basis?

On Tue, Sep 3, 2013 at 8:28 PM, hunterowens notifications@github.comwrote:

I could also do a CSV dump, I believe. That work better for your
setup?

On Tue, Sep 3, 2013 at 7:27 PM, Harlan Harris <
notifications@github.com>wrote:

Hm, OK. We can do that. In theory, it'd be maybe better to export
this
in
something other than Postgres, but beggars, choosers, and all that.
:)

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23757941>
.

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23757999>
.

—
Reply to this email directly or view it on GitHub<
https://github.com/dssg/bikeshare/issues/40#issuecomment-23758293>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-23817947
.

Hunter Owens · Answer 10 · Tue Sep 10 2013 02:24:27 GMT+0800 (China Standard Time)

@HarlanH - Sorry for the delay, been away from a decent internet connection for a while.

I gave you the entire dataset - from roughly system going online to today. Due to bugs in the collecting process, some observations are missing. Originally, Oliver O'Brien was collecting data at 2 minute intervals, which has now been replaced with 1 minute intervals. The data documentation is the wiki/README is pretty good, but feel free to ask questions. FYI, you're looking at the bike_ind_washingtondc table for reference.

Raw 3.5 GB on S3
Gzip ~500 MB.

Hunter Owens · Answer 11 · Tue Sep 10 2013 02:25:03 GMT+0800 (China Standard Time)

@HarlanH Also, if anybody in your group is interested in building an API, let me know.

Harlan Harris · Answer 12 · Tue Sep 17 2013 08:40:40 GMT+0800 (China Standard Time)

Just a note to thank you for sending this data! I've extracted a
model-friendly summary, which is on the Code for DC CKAN site:
http://opendatadc.org/dataset/capital-bikeshare-first-empty-time We'll be
working on building some models and visualizations around this soon!

On Mon, Sep 9, 2013 at 2:25 PM, hunterowens notifications@github.comwrote:

@HarlanH https://github.com/HarlanH Also, if anybody in your group is
interested in building an API, let me know.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-24102353
.

Hunter Owens · Answer 13 · Wed Sep 18 2013 03:08:47 GMT+0800 (China Standard Time)

Closing this issue. See #68 for more info.

erinselene · Answer 14 · Thu Mar 26 2015 04:10:59 GMT+0800 (China Standard Time)

I have been using this washingtondc csv you posted and building off your code. However, I'm wondering if the timezone conversion that's in your code already has been done here, since it's not lining up with some overlapping data I got from Alta (Yours appears as 5 hours earlier). Do you happen to know? Thank you.