rjrudin / ml-multiline-record-ingest-starter

Starter project for using Spring Batch to ingest multiline records from a delimited file into MarkLogic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is a starter kit for creating an application that uses Spring Batch and marklogic-spring-batch for ingesting records from a delimited file where each record spans multiple lines. Each record then becomes a single document in MarkLogic. The intent is to simplify the process of creating an application using Spring Batch by leveraging the reusable components in marklogic-spring-batch, and by organizing a Gradle-based project for you that you can clone/fork/etc to quickly extend and customize for your specific needs.

This project has the following defaults in place that you can use as a starting point:

  1. Defaults to writing to MarkLogic using localhost/8000/admin/admin
  2. Defaults to reading examples files from ./data/persons/.
  3. Defaults to combining every 10 lines into 1 document
  4. Has a Gradle task for launching the ingest - "./gradlew ingest"

How do I try this out?

To try this out locally, just do the following:

  1. Clone this repo
  2. Verify you have ML 8+ or ML 9+ installed locally and that port 8000 (the default one) points to the Documents database (you can of course modify this to write to any database you want)
  3. Verify that the username/password properties in gradle.properties are correct for your MarkLogic cluster (it's best not to use the admin user unless absolutely necessary, but this defaults to it for the sake of convenience)
  4. Run ./gradlew ingest

The configuration properties are all in gradle.properties. You can modify those properties on the command line via Gradle's -P mechanism. For example, to load the data as JSON instead of XML:

./gradlew ingest -Pdocument_type=json

If you have ML9, you can try out the new Data Movement SDK (DMSDK):

./gradlew ingest -Papi=dmsdk

Or load the data via XCC instead of the REST API:

./gradlew ingest -Papi=xcc

For both the REST API and XCC, you can specify multiple hosts to send requests to:

./gradlew ingest -Phosts=host1,host2,host3

You can easily modify the file thread count, the MarkLogic thread count, and the chunk size (batch size):

./gradlw ingest -Pfile_thread_count=8 -Pthread_count=32 -Pchunk=50 

And you can modify the row count - the number of rows that are combined into a single document:

./gradlew ingest -Prow_count=17

Or point to a different path:

./gradlew ingest -Pinput_file_path=file:/path/to/lots/of/files/**/*.*

Or customize the root and child element names:

./gradlew ingest -Proot_local_name=my-root -Pchild_record_name=my-child

Or just modify gradle.properties and start building your own application.

You can also see all the supported arguments:

./gradlew help

But how do I modify the XML that's inserted into MarkLogic?

The way the batch job works is defined by the org.example.IngestConfig class. This class creates a Spring Batch Reader, Writer, and Processor (for more information on these concepts, definitely check out the Spring Batch user manual).

The XML is currently generated by the org.example.ColumnMapProcessor class. This is a quick-and-dirty Spring Batch Processor implementation that uses a simple StAX-based approach for converting a Spring ColumnMap (a map of column names and values; I'm using this term as shorthand for a Map<String, Object>) into an XML document.

To modify how this works, you'll need to write code, which opens the door to all the batch-processing power and flexibility provided by Spring Batch. Here are a few paths to consider:

  1. Modify the ColumnMapProcessor with your own method for converting a ColumnMap into a String of XML, or JSON
  2. Write your own Processor implementation from scratch and use that, and modify IngestConfig to use your Processor
  3. Write your own Reader that returns something besides a ColumnMap. Modify IngestConfig to use this new Reader, and you'll need to modify the Processor as well, which expects a ColumnMap.
  4. You can even replace the Writer, which depends on a MarkLogic Java Client DocumentWriteOperation instance. Typically though, you'll be able to retain this part by having your Reader and/or Processor return a DocumenteWriteOperation, which encapsulates all the information needed to write a single document to MarkLogic.

About

Starter project for using Spring Batch to ingest multiline records from a delimited file into MarkLogic

License:Apache License 2.0


Languages

Language:Java 100.0%