gbif / dwca-io

Darwin Core Archive IO

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Darwin Core Archive I/O (dwca-io)

Formerly known as dwca-reader

The dwca-io library provides:

  • Reader for DarwinCore Archive file with or without extensions.
  • Reader for single tabular file using Darwin Core terms as headers
  • Support for discovery of metadata document (e.g. EML).
  • Writer for simple DarwinCore Archive file with or without extensions

To build the project

Note: this project requires Java 8.

mvn clean install

Usage

Reading a simple Darwin Core Archive

Read an archive and display data from the core record:

Path myArchiveFile = Paths.get("myArchive.zip");
Path extractToFolder = Paths.get("/tmp/myarchive");
Archive dwcArchive = DwcFiles.fromCompressed(myArchiveFile, extractToFolder);

// Loop over core records and display id, genus, specific epithet
for (Record rec : dwcArchive.getCore()) {
  System.out.printf("%s: %s %s%n", rec.id(), rec.value(DwcTerm.genus), rec.value(DwcTerm.specificEpithet));
}

Reading DarwinCore archive + extensions

Read from a folder (extracted archive) and display data from the core and the extension:

Path myArchiveFile = Paths.get("myArchive.zip");
Path extractToFolder = Paths.get("/tmp/myarchive");
Archive dwcArchive = DwcFiles.fromCompressed(myArchiveFile, extractToFolder);

System.out.println("Archive rowtype: " + dwcArchive.getCore().getRowType() + ", "
    + dwcArchive.getExtensions().size() + " extension(s)");

// Loop over star records and display id, core record data, and extension data
for (StarRecord rec : dwcArchive) {
  System.out.printf("%s: %s %s%n", rec.core().id(), rec.core().value(DwcTerm.genus), rec.core().value(DwcTerm.specificEpithet));
  if (rec.hasExtension(DwcTerm.Occurrence)) {
    for (Record extRec : rec.extension(DwcTerm.Occurrence)) {
      System.out.println(" - " + extRec.value(DwcTerm.country));
    }
  }
}

Other supported file types

The DwcFiles.fromLocation method also supports the following file types:

Notes

  • The delimitedBy attribute of a field is not supported.
  • The dateFormat attribute of a file is not supported.
  • Iterating over an Archive with extensions requires pre-sorting the data files. This can take seconds to minutes, depending on the size of the archive. If you prefer, you can use Archive#initialize() to sort the archive beforehand.

Maven

Ensure you have the GBIF repository in your pom.xml

<repositories>
  <repository>
    <id>gbif-repository</id>
    <url>https://repository.gbif.org/content/groups/gbif</url>
  </repository>
</repositories>

Add the dwca-io artifact

  <dependency>
    <groupId>org.gbif</groupId>
    <artifactId>dwca-io</artifactId>
    <version>{latest-version}</version>
  </dependency>

where {latest-version} can be found here

Change Log

Change Log

Documentation

JavaDocs

Unsupported archives

Darwin Core Text specifies several features which are not supported by this library.

  • A <core> or <extension> setting a dateFormat
  • A <files> <location> which is a URL

These features are very rarely used, and will not be implemented without good reason.

About

Darwin Core Archive IO

License:Apache License 2.0


Languages

Language:Java 99.3%Language:FreeMarker 0.7%