submission: arkdb

Question

submission: arkdb

cboettig opened this issue 6 years ago · comments

Summary

What does this package do? (explain in 50 words or less):

Read in large, compressed text files into a databases in chunks. Write all tables in a database out into compressed tsv files.

Paste the full DESCRIPTION file inside a code block below:

Package: arkdb
Version: 0.0.0.9000
Title: Archive and Unarchive Databases Using Flat Files
Description: Flat text files provide a more robust, compressable,
  and portable way to store tables.  This package provides convenient
  functions for exporting tables from relational database connections
  into compressed text files and streaming those text files back into
  a database without requiring the whole table to fit in working memory.
Authors@R: c(person("Carl", "Boettiger", 
                    email = "cboettig@gmail.com", 
                    role = c("aut", "cre", "cph"),
                    comment=c(ORCID = "0000-0002-1642-628X")),
              person("Richard", "FitzJohn", role = "ctb"))
URL: https://github.com/cboettig/arkdb
BugReports: https://github.com/cboettig/arkdb/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
ByteCompile: true
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
Roxygen: list(markdown = TRUE)
Imports: 
    dbplyr,
    DBI,
    readr,
    RSQLite,
    progress,
    dplyr,
    methods
Suggests: 
    nycflights13,
    testthat,
    knitr,
    covr,
    fs,
    rmarkdown

URL for the package (the development repository, not a stylized html page):

https://github.com/cboettig/arkdb

Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

data deposition! Because good archiving formats are crucial. Kinda crazy how many researchers and repositories distribute data only through APIs and uncompressed sql dumps.

Who is the target audience and what are scientific applications of this package?

Researchers working with large data files

Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?

I don't think so?

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:
- The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:


R CMD check results
0 errors | 0 warnings | 0 notes

Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
- @richfitz (I've listed him as a contributor because I've borrowed code from his other packages here, so I don't think that counts as a conflict of interest(?))
- @jimhester
- Maybe an easy package for a first-time reviewer, since this is rather minimal?

Lincoln Mullen · Answer 1 · Mon Jun 11 2018 23:56:10 GMT+0800 (China Standard Time)

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly

Editor comments

Hi @cboettig. Thanks for the submission. I'll be editing this package.

I get errors in the tests or examples when running devtools::check(cran = FALSE, run_dont_test = TRUE). My guess is that those errors are in the \dontrun{} sections of the examples. Could you please fix those errors before the review starts?

One issue raised by goodpractice::g(). Could you look into this please?

fix this R CMD check NOTE: Namespace in Imports field not imported from:
    ‘dbplyr’ All declared Imports should be used.

I also note some genuine spelling errors. You may wish to look at devtools::spell_check() yourself.

avialable      ark.Rd:29
databaase      unark.Rd:5,89
ouput          ark.Rd:12

I will be looking for reviewers.

Carl Boettiger · Answer 2 · Tue Jun 12 2018 01:19:51 GMT+0800 (China Standard Time)

@lmullen I continue to learn tricks from you! should remember to run the donttests and spell checks. I think all these issues should be fixed in the current master, ropensci/arkdb@1a33432

Lincoln Mullen · Answer 3 · Tue Jun 12 2018 06:41:40 GMT+0800 (China Standard Time)

@cboettig Thanks for fixing those. Everything looks good to start the review, and I've solicited reviewers.

Lincoln Mullen · Answer 4 · Sat Jun 16 2018 03:21:16 GMT+0800 (China Standard Time)

Reviewers: @richfitz and @expectopatronum.
Due date: 2018-07-06

Thanks to Verena and Rich for being willing to do this review. I've set a due date 3 weeks from now. Here are the reviewer's guide and the reviewer's template.

@cboettig Could you please add the rOpenSci review badge to your repository if you haven't done so already?

Verena Praher · Answer 5 · Sun Jul 01 2018 22:26:48 GMT+0800 (China Standard Time)

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3

Review Comments

I was able to install the package without errors on Windows 10. I ran the following commands with the following output:

goodpractice::gp() -> output screenshot

vignette("arkdb") -> Just a matter of taste I guess, but I'd find it easier to have a introductory vignette that is named after the package.
Warning message: vignette ‘arkdb’ not found
devtools::check() -> 1 Warning about qpdf (I guess I would need to install that but it didn't seem straightforward on Windows)
devtools::test() -> OK
example from help -> OK

I did not yet tick the check box next to Packaging guidelines. As far as I can see most/all required items are fulfilled, but I have a few remarks:

@return is not documented for ark
no documentation/help for ?arkdb is provided
consider adding tests for error cases

Additionally I have a few minor remarks that I would change in the code (unless I miss something and this is the better way to do it):

The order of parameters in ark(db_con, dir, lines) and unark(files, db_con, lines) seems not intuitive for me. Unless it makes more sense for using piping (magrittr) with db_con %>% ark and files %>% unark, I would change the order to (db_con, dir/files, lines)
Consider replacing repeat loop with while (start <= end) (in ark.R:71). Unless I miss some corner case there is no need for a repeat loop.
There is also a repeat in unark.R:74 but it seems here it makes more sense.

In general it is a well-written, light-weight package. The intention of the package is clear and the documentation useful. To highlight just a few of the things I liked:

I learned about the function match.arg
the package provides information to the user using progress_bar
good code style (usage of dplyr, lapply, ...)
usage of invisible to avoid unnecessary output

Lincoln Mullen · Answer 6 · Mon Jul 02 2018 22:39:42 GMT+0800 (China Standard Time)

Thanks for the helpful review, @expectopatronum.

Carl Boettiger · Answer 7 · Sat Jul 07 2018 11:54:24 GMT+0800 (China Standard Time)

@expectopatronum Thanks for this awesome review 🎉. Very helpful. I thought I'd wait for @richfitz comments but might as well check these off now and asking a few follow-up questions. (Rich has given me a hint meanwhile on at least one other thing I should improve).

no documentation/help for ?arkdb is provided

Good catch, fixed!

consider adding tests for error cases

I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?

@return is not documented for ark
The order of parameters in ark(db_con, dir, lines) and unark(files, db_con, lines) seems not intuitive for me. Unless it makes more sense for using piping (magrittr) with db_con %>% ark and files %>% unark, I would change the order to (db_con, dir/files, lines)

Grouping these, since they are related (ark returns a directory/path, now documented!, unark returns the db_con -- I think there's a good reason for this but I see the confusion too. As you noted, I expect most uses not to actually use the return object, so it is returned invisibly, as something that could potentially be used in piping. For instance, imagine using arkdb to migrate from, say, a MySQL DB (ick!) to an MonetDBLite database:

old <-  dbConnect(MySQL(), ...)
new <-  dbConnect(MonetDBLite(), "my_db")

old %>% 
  ark( tmpdir() ) %>% 
  fs::dir_ls() %>% 
  unark(new)

This is why ark returns dir while unark returns db_con, and is also why ark takes the arguments in the order db_con, dir, while unark takes files first and db_con as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination. ark operates on an existing database and outputs files, unark operates on files. What do you think?

repeat vs while Yeah, looks like I could have used a while loop -- but as this is the block of code I stole from @richfitz , I'll let him comment as there may be a good case on repeat. I'm not well-informed on the difference here.

Okay, pushing these few edits, lemme know what you think!

Rich FitzJohn · Answer 8 · Mon Jul 16 2018 17:47:22 GMT+0800 (China Standard Time)

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

RGF: As @cboettig notes above, this is adapted from a similar bit of code I wrote a few months back but I don't think that counts as a COI as I regularly hate on my own code...

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1

Review Comments

I have not read the other review as I write this review (and having now looked though the other review it looks like Carl already mostly dealt with those comments). Apologies this is so late

This package provides an interface for loading data into a database from disk, and extracting data from a database to disk, via R, without needing to do the entire operation in memory. This will help with very large databases (on the order of hundreds of millions of rows or more) where fitting data into memory is not an option. It will also increase the responsiveness of such operations (probably at the cost of overall speed) making large scale import and extraction from R a more pleasant experience.

This package, with about a hundred lines of code and two exported functions, fits a "micropackage" description, which makes reviewing a lot easier.

Major comments

The biggest flaw I see with the package is that it violates the Open–closed principle (I know this is not OOP but I feel this is one that really transcends any paradigm). Carl has used readr::write_tsv and readr::read_tsv for the archive/import - presumably they satisfy his personal preferences for speed/reliability/future compatibility. But there's no real reason why utils::read.csv/utils::write.csv might not be used, or datatable::fread or even something like feather probably (I don't really see that readr is a necessary strong dependency here either).

The package could be extended to take either 3 arguments (or some sort of wrapper object) that includes an extension and two functions (read/write). This interacts somewhat with the compression handling but I think it's probably still doable. As it is, if someone has an aversion to tsv (e.g., they want to point something that requires csv at the archived database) they cannot use the package without changing the source.

Misc comments:

The example in the README produces a significant number of warnings: Warning: Closing open result set, pending rows - these should be silenced if expected, eliminated if not.

The progress and general chattiness needs to be tuneable (e.g., with a verbose and a progress argument. Ideally both would be tuneable as progress bars are not very log-friendly but messages are (you can see this play out a bit with the vignette build - the progress output ends up on the parent terminal for me)

The log output as it is not incredibly informative after it has been run:

Exporting in 50000 line chunks:
airlines
...Done! (in 0.005350828 secs)
Exporting in 50000 line chunks:
airports
...Done! (in 0.02871299 secs)
Exporting in 50000 line chunks:
flights
[\] chunk 7...Done! (in 5.914837 secs)
Exporting in 50000 line chunks:
planes
...Done! (in 0.05566597 secs)
Exporting in 50000 line chunks:
weather
...Done! (in 0.4096012 secs)

Consider a little bit of reordering and indenting to get to

Exporting 'artlines' in 50000 line chunks:
   ...Done! (in 0.005350828 secs)
Exporting 'airports' in 50000 line chunks:
   ...Done! (in 0.02871299 secs)
Exporting 'flights' in 50000 line chunks:
   [\] chunk 7...Done! (in 5.914837 secs)

(and so on)

How is the default of lines chosen? Is there some heuristic for best performance? My gut instinct is that as it increases in size, the performance will get better, to a point. 10K lines might well be at the point where increasing the chunk size has limited benefit but I am curious if you looked? (I think I settled around a similar number - perhaps 50K in my application).

In ark.R

  if (is(db_con, "SQLiteConnection") |
      is(db_con, "MySQLConnection") |
      Sys.getenv("arkdb_windowing") == "FALSE") {
    query <- paste("SELECT * FROM", tablename, "LIMIT", 
                   lines, "OFFSET", (start-1)*lines)
  } else {
  ## Postgres can do windowing
    query <- paste("SELECT * FROM", 
                   tablename, 
                    "WHERE rownum BETWEEN",
                   (start - 1) * lines, 
                   "AND", 
                   start * lines)
  }

(The | should be ||). But this also surely is not a complete test? There will be other DBI wrappers (thinking here of MariaDB, there are at least two Postgres DB wrappers - so there may be similar duplication among MySQL). I think that the polarity of this test is also incorrect. Every SQL dialect can do the first form so that should be the fallback, not the Postgres extension. So rather than testing "known not to support" test for "known to support". You might also include a package-cached capability test (keyed against class) for a SELECT * FROM table WHERE rownum BETWEEN 1 and 2 - if you don't get a syntax error it should be ok to use.

Super minor comments

In unark.R:normalize_con

if(is(db_con, "src_dbi")){

I though that inherits was still the way to test this?

Your postgres windowed code is not tested. You might try that on travis?

Your normalize_con code is not tested for the non-src_dbi path

I am confused by the point of the untested early exit in unark_file - when would this be wanted?

There are also a few error paths untested (see covr output)

In unark.R

    d <- reader()
    body <- paste0(c(header, d$data), "\n", collapse = "")
    p$tick()
    chunk <- readr::read_tsv(body, ...)
    DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)

I might be missing something but surely one can compute the headers ahead of time (0'th iteration) and then do:

    chunk <- reader()
    names(chunk) <- header
    p$tick()
    DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)

with modifications to read_chunked like

read_chunked <- function(con, n) {
  assert_connection(con)
  next_chunk <- readLines(con, n)
  if (length(next_chunk) == 0L) {
    # if we don't stop, we will hit an error!
    stop("connection has already been completely read")
  }
  function() {
    data <- next_chunk
    next_chunk <<- read_tsv(con, n, header = FALSE) # <-- change here
    complete <- length(next_chunk) == 0L # <-- probable change here too
    list(data = data, complete = complete)
  }
}

(This is quite possibly something I could do in my code that this comes from too!)

I would be probably inclined to split ark_chunk into a "read" and "save" step.

Lincoln Mullen · Answer 9 · Mon Jul 16 2018 20:43:59 GMT+0800 (China Standard Time)

@richfitz and @expectopatronum Thank you both for your thorough reviews.

@cboettig: You've already started on Verena's review, of course, but could you please address these two reviews and make any necessary changes by July 30?

Carl Boettiger · Answer 10 · Tue Jul 17 2018 00:42:51 GMT+0800 (China Standard Time)

@richfitz Thanks, this is excellent and the detail much appreciated! Addresses a bunch of things that have puzzled or bugged me so really appreciate your expertise here.

Just skimming review for now, but one point where I could use a bit of additional feedback is the hard-wriring read_tsv() bit. I agree that should be abstracted out (it's issue #3)), but could use a bit more guidance/ideas on how to make it more general, how general it is worth making it.

For instance, supporting other delimited calls from readr should be simple, since they all take the same argument structure. I think you're right about base types, though currently the read_ side of the code relies on readr algorithm to determine the compression type. Could do this by parsing the file name instead; not sure I want to introduce a dependency on libmagic / wand just to detect compression type by magic number...

Other parsers get more tricky -- e.g. I wanted to support fst package to considerably speed up the process, but it doesn't seem to support the notion append. I haven't looked into feather feasibility-wise, but I think its authors say it's not a stable/archival format anyway, which cuts against the spirit of ark (and is meanwhile slower than other modern binary formats like fst). Of course compression is a bigger issue with many of these, particularly those that do not operate over the standard R connection . Ideas here are welcome but I'd be tempted to punt on this until a later date....

A somewhat related challenge that's not directly raised by the reviews is the issue of type stability. While some of the binary types I just mentioned support this out of the box, types (dates, factors, integer vs numeric etc) can change from a read-in / read-out. At the moment I feel this is an out-of-scope issue for a first release, but maybe a metadata file should be written with column types when archiving a database? (This could work within a family like readr, harder to support this automatically across different engines -- the heterogeneity of file-based I/O really makes me appreciate the standardization provided by DBI / ODBC -- on the database side the interoperability just works without me having to do anything!

Up against a deadline now but should be able to get this done by July 30.

Rich FitzJohn · Answer 11 · Wed Jul 18 2018 16:56:39 GMT+0800 (China Standard Time)

@cboettig - I've sketched out a proof-of-concept for generalising away from tsv here ropensci/arkdb#5

While doing that I noticed this section:

  compress <- match.arg(compress)
  rs <- DBI::dbSendQuery(db_con, paste("SELECT COUNT(*) FROM", tablename))
  size <- DBI::dbFetch(rs)
  end <- size[[1]][[1]]

You can use SELECT 1 FROM for efficiency I believe
I think this is the source of your result set warnings - why not just DBI::dbGetQuery(db_con, ...)?

Carl Boettiger · Answer 12 · Thu Jul 19 2018 00:51:50 GMT+0800 (China Standard Time)

@richfitz Thanks much! I'll take a look tonight.

Quick question: do you mean SELECT COUNT(1) FROM? My SQL is terrible, but my understanding is that it's the same to most interpreters (though I believe COUNT doesn't count NA/NULLs, not sure but perhaps I avoid this a bit by selecting all columns? Remarkable how difficult it is to ask a DB how many rows a table has -- apparently there are other DB specific ways that are only approximate. guess it comes from a world where DBs are constantly being written to and the notion of a number of rows is a more ephemeral concept anyhow)....

re dbGetQuery ... because I don't use DBI much and don't know how it works? is that the same as dbSendQuery + dbFetch (i.e. when no lazy eval of queries is desired, like here).

Rich FitzJohn · Answer 13 · Thu Jul 19 2018 14:49:00 GMT+0800 (China Standard Time)

Don't listen to me on the SELECT 1 thing - I should know better to suggest microoptimisations without checking first:

microbenchmark::microbenchmark(
  carl = DBI::dbGetQuery(db$con, "SELECT COUNT(*) FROM flights"),
  rich = DBI::dbGetQuery(db$con, "SELECT COUNT(1) FROM flights"))
## Unit: milliseconds
##  expr      min       lq     mean   median        uq       max neval
##  carl 2.404747 2.536132 2.818021 2.687870  3.002489  4.136245   100
##  rich 8.938237 9.306043 9.851665 9.705794 10.161812 12.921496   100

We ran into this problem at work but the tables in question have something on the order of 4 billion rows - so the approximate counts get quite useful...

Re dbGetQuery - use the approach as above though otherwise (from ?dbSendQuery)

The ‘dbSendQuery()’ method only submits and synchronously executes
the SQL query to the database engine. It does not extract any
records - for that you need to use the ‘dbFetch()’ method, and
then you must call ‘dbClearResult()’ when you finish fetching the
records you need. For interactive use, you should almost always
prefer ‘dbGetQuery()’.

So you're missing the dbClearResult(), which would ideally go inside a tryCatch or withCallingHandlers at which point you're better off just using dbGetQuery...

Verena Praher · Answer 14 · Sat Jul 21 2018 15:38:31 GMT+0800 (China Standard Time)

I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?

As I come from software engineering I can think of several cases that could be tested (of which some might be an overkill). I would at least test what happens when the file/database connections are not working though. Other tests could be passing wrong parameters (e.g., names of tables that don't exist), ... But as I said, that might be an overkill :)

This is why ark returns dir while unark returns db_con, and is also why ark takes the arguments in the order db_con, dir, while unark takes files first and db_con as the second argument. I could be talked out of this, as I can see how reversing the order is confusing, but I think it as: the first argument is the input (as required by piping), and the second is the destination. ark operates on an existing database and outputs files, unark operates on files. What do you think?

Yes, that's completely reasonable and fine with me!

Great job!

Carl Boettiger · Answer 15 · Fri Jul 27 2018 06:52:07 GMT+0800 (China Standard Time)

@richfitz @expectopatronum @lmullen

Okay, I think I've hit just about all of these now.

@expectopatronum I have added error handling assertions for the argument types at the top of ark and unark.

@richfitz I've summarized my replies to your issues in ropensci/arkdb#6. In particular, see ropensci/arkdb#10 which implements your pluggable formats (supporting csv and tsv and base methods), and ropensci/arkdb#8, which I think largely sidesteps the issue about the different routines trying to window the DB queries (a big thanks to @krlmlr for the pointers there). The original methods are still supported as an option. Messaging and error handling is also improved following the @richfitz suggestions.

Okay, I think this is a much better package now, though there are no doubt things that could be better still or new issues I've introduced. Over to you all now; thanks for all the input and looking forward to your feedback.

Verena Praher · Answer 16 · Sun Jul 29 2018 13:22:35 GMT+0800 (China Standard Time)

@cboettig Perfect, I ticked the final checkbox. From my point of view the package is good to go!

Lincoln Mullen · Answer 17 · Tue Jul 31 2018 02:29:38 GMT+0800 (China Standard Time)

Glad to see these revisions, @cboettig.

@expectopatronum Thanks very much for completing your thorough review.

@richfitz Thanks for going back and forth to improve this package. Would you be able to look over Carl's changes within roughly one week?

Rich FitzJohn · Answer 18 · Wed Aug 01 2018 19:16:29 GMT+0800 (China Standard Time)

Looks good to me!

Lincoln Mullen · Answer 19 · Thu Aug 02 2018 03:17:01 GMT+0800 (China Standard Time)

Thanks, @richfitz and @expectopatronum. @cboettig, we will consider this package successfully onboarded. 🚀

Carl, could you please work on the following?

To-dos:

Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo.
Add the rOpenSci footer to the bottom of your README

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/arkdb?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/arkdb).
We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

Carl Boettiger · Answer 20 · Tue Aug 14 2018 05:06:21 GMT+0800 (China Standard Time)

Thanks @lmullen!, I've ticked off the items above and transferred the package over to rOpenSci now. Badges / CI etc seem to be working fine. Planning a CRAN release soon.

Maëlle Salmon · Answer 21 · Thu Oct 18 2018 14:06:46 GMT+0800 (China Standard Time)

@lmullen can this issue be closed? 🙏 I'm asking because of my packages registry experiments, where I noticed the review badge of arkdb was still orange. 😉

Lincoln Mullen · Answer 22 · Thu Oct 18 2018 21:15:38 GMT+0800 (China Standard Time)

@maelle Yes, definitely. Closing now.