bmschmidt / baseballdatabank-parquet

Development for baseball databank, an Open Data collection of historical baseball data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Baseball Databank

Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.

This work is licensed by Chadwick Baseball Bureau under the Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see http://creativecommons.org/licenses/by-sa/3.0/

About this data

  • This is a legacy resource. Data in this format has been circulated by various people for many years, and there are many applications and users who have tools which take data in this format. It is maintained by Chadwick Baseball Bureau to support compatibility with those tools and programs. As such, the schema is not open to amendments, either in terms of the scope of coverage or in terms of the data categories available.
  • This is a free resource. Statistical data will be updated once at some point during the MLB offseason. To borrow the slogan used by ProMods, "It's ready when it's ready." New releases will be announced via our Twitter account at @chadwickbureau. We, politely, will not be able to respond to any enquiries as to when new versions of the data will be released.
  • These data are maintained wholly by Chadwick Baseball Bureau, for the benefit of the community. Users who require data of a different scope, in a different format, and/or with more specific schedules for updates are encouraged to enquire about our various licensing options.

Organisation of the files

There are three directories in the repository.

  • core/ contains the databank itself. These files are automatically produced from our larger dataset.
  • contrib/ contains files which are manually maintained by others using the same identifier system as the core. We bundle these for the convenience of the community.
  • upstream/ contains files used to construct the databank.

Maintenance and sources

Most of the data in the Databank is provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com). The data differ from the data the Bureau provides to its clients in that it contains less detail, is updated less frequently, and is provided on an as-is basis.

The Databank is historically based in part on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.

The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.

Enquiries and suggested revisions

Enquiries and suggested revisions to the data can be posted in the issue tracker at https://github.com/chadwickbureau/baseballdatabank/issues.

Files in core/ are all generated by scripts. As such they are not edited manually (and therefore pull requests should not be submitted against these files).

Files in upstream/ are manually-maintained files which contain information specific to constructing the Databank. As they are maintained manually, it is valid to submit pull requests containing corrections or additions to these files.

About

Development for baseball databank, an Open Data collection of historical baseball data