An HTTP API for querying and updating PURLs. See the API section below for docs.
- Ruby (3.2 or greater)
- bundler gem
- Apache Kafka (0.10 or greater), or Docker
Clone the repository:
git clone https://github.com/sul-dlss/purl-fetcher.git
cd purl-fetcher
Install dependencies:
bundle install
Set up the database:
rake db:migrate
The API communicates with a Kafka broker to dispatch and process updates asynchronously. You can run a Kafka broker locally, or use the provided docker-compose
configuration:
docker-compose up
Then, in a separate terminal, start a development API server:
bin/rails server
Finally, in another terminal, you can run the Kafka consumer to process updates from the Kafka broker:
bundle exec racecar PurlUpdatesConsumer
You can make requests to the API using curl
or a similar tool. To add an object to the database, you can first download its public Cocina JSON from production PURL:
curl https://purl.stanford.edu/bb112zx3193.json > bb112zx3193.json
Then, you can use the POST /purls/:druid
endpoint to add the object to the database:
curl -X POST -H "Content-Type: application/json" -d @bb112zx3193.json http://localhost:3000/purls/bb112zx3193
After the object has been added, it will show up in the list of changes:
curl http://localhost:3000/docs/changes
The full test suite (with RuboCop style enforcement) can be run with the default rake task:
rake
The tests can be run without RuboCop style enforcement:
rake spec
The RuboCop style enforcement can be run without running the tests:
rake rubocop
POST /purls/:druid
Purl Document Update
The POST /purls/:druid
endpoint provides the ability to create or update a PURL document from public Cocina JSON. This endpoint is used by dor-services-app as part of SDR workflows.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | Druid of a specific PURL | Yes | string eg(druid:cc1111dd2222 ) |
null |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
true
GET /docs/changes
Purl Document Changes
The /docs/changes
endpoint provides information about public PURL documents that have been changed, their release tag information and also collection association. This endpoint can be queried using purl_fetcher-client.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
first_modified |
query | Limit response by a beginning datetime | No | datetime in iso8601 | earliest possible date |
last_modified |
query | Limit response by an ending datetime | No | datetime in iso8601 | current time |
page |
query | request a specific page of results | No | integer | 1 |
per_page |
query | Limit the number of results per page | No | integer (1 - 10000) | 100 |
target |
query | Release tag to filter on | No | string | nil |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
{
"changes": [
{
"druid": "druid:dd111ee2222",
"latest_change": "2014-01-01T00:00:00Z",
"true_targets": ["SearchWorksPreview"],
"collections": ["druid:oo000oo0001"]
},
{
"druid": "druid:bb111cc2222",
"latest_change": "2015-01-01T00:00:00Z",
"true_targets": ["SearchWorks", "Revs", "SearchWorksPreview"],
"collections": ["druid:oo000oo0001", "druid:oo000oo0002"]
},
{
"druid": "druid:aa111bb2222",
"latest_change": "2016-06-06T00:00:00Z",
"true_targets": ["SearchWorksPreview"]
}
],
"pages": {
"current_page": 1,
"next_page": null,
"prev_page": null,
"total_pages": 1,
"per_page": 100,
"offset_value": 0,
"first_page?": true,
"last_page?": true
}
}
GET /docs/deletes
Purl Document Deletes
The /docs/deletes
endpoint provides information about public PURL documents that have been deleted. This endpoint can be queried using purl_fetcher-client.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
first_modified |
query | Limit response by a beginning datetime | No | datetime in iso8601 | earliest possible date |
last_modified |
query | Limit response by an ending datetime | No | datetime in iso8601 | current time |
page |
query | request a specific page of results | No | integer | 1 |
per_page |
query | Limit the number of results per page | No | integer (1 - 10000) | 100 |
target |
query | Release tag to filter on | No | string | nil |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
{
"deletes": [
{
"druid": "druid:ee111ff2222",
"latest_change": "2014-01-01T00:00:00Z"
},
{
"druid": "druid:ff111gg2222",
"latest_change": "2014-01-01T00:00:00Z"
},
{
"druid": "druid:cc111dd2222",
"latest_change": "2016-01-02T00:00:00Z"
}
],
"pages": {
"current_page": 1,
"next_page": null,
"prev_page": null,
"total_pages": 1,
"per_page": 100,
"offset_value": 0,
"first_page?": true,
"last_page?": true
}
}
GET /collections/:druid/purls
Collection Purls route
The /collections/:druid/purls
endpoint a listing of Purls for a specific collection. This endpoint is used by the Exhibits application.
Name | Located In | Description | Required | Schema | Default |
---|---|---|---|---|---|
druid |
url | Druid of a specific collection | Yes | string eg(druid:cc1111dd2222 ) |
null |
page |
query | request a specific page of results | No | integer | 1 |
per_page |
query | Limit the number of results per page | No | integer (1 - 10000) | 100 |
version |
header | Version of the API request eg(version=1 ) |
No | integer | 1 |
{
"purls": [
{
"druid": "druid:ee111ff2222",
"published_at": "2013-01-01T00:00:00.000Z",
"deleted_at": "2016-01-03T00:00:00.000Z",
"object_type": "set",
"catkey": "",
"title": "Some test object number 4",
"collections": [
"druid:ff111gg2222"
],
"true_targets": [
"SearchWorksPreview"
]
},
...
{
"druid": "druid:cc111dd2222",
"published_at": "2016-01-01T00:00:00.000Z",
"deleted_at": "2016-01-02T00:00:00.000Z",
"object_type": "item",
"catkey": "567",
"title": "Some test object number 2",
"collections": [
"druid:ff111gg2222"
],
"true_targets": [
"SearchWorksPreview"
],
"false_targets": [
"SearchWorks"
]
}
],
"pages": {
"current_page": 1,
"next_page": null,
"prev_page": null,
"total_pages": 1,
"per_page": 100,
"offset_value": 0,
"first_page?": true,
"last_page?": true
}
}
You can create Kafka messages that will cause all the Purls to be reindexed by doing:
Purl.unscoped.find_in_batches.with_index do |group, batch|
puts "Processing group ##{batch}"
group.each(&:produce_indexer_log_message)
end
Or only for searchworks:
Purl.target('Searchworks').find_in_batches.with_index do |group, batch|
puts "Processing group ##{batch}"
Racecar.wait_for_delivery do
group.each { |purl| purl.produce_indexer_log_message(async: true) }
end
end
The API's internals use an ActiveRecord data model to manage various information
about published PURLs. This model consists of Purl
, Collection
, and
ReleaseTag
active records. See app/models/
and db/schema.rb
for details.
This approach provides administrators a couple ways to explore the data outside of the API.
With Rails' runner
, you can query the database using ActiveRecord. For example, running the Ruby in script/reports/summary.rb
using:
RAILS_ENV=environment bundle exec rails runner script/reports/summary.rb
produces output like this:
Summary report as of 2016-08-24 09:52:49 -0700 on purl-fetcher-dev.stanford.edu
PURLs: 193960
Deleted PURLs: 1
Published PURLs: 193959
Published PURLs in last week: 0
Released to SearchWorks: 5
With Rails' dbconsole
, you can query the database using SQL. For example, running the SQL in script/reports/summary.sql
using:
RAILS_ENV=environment bundle exec rails dbconsole -p < script/reports/summary.sql
produces output like this:
PURLs 193960
Deleted PURLs 1
Published PURLs 193959
Published this year 9
Released to SearchWorks 5