standardebooks / web

The source code for the Standard Ebooks website.

Home Page:https://standardebooks.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cover art database

acabal opened this issue · comments

Also see #232 and #233

We need a cover art database that can do the following:

  • List approved paintings that can be used for covers, browsable/searchable by:
    • Keyword like castle or seascape
    • Artist name
    • Painting name
  • Detail page for each artwork listing artist, dates, hi-res download link, whether the cover is in use in the corpus
  • Form for anonymous contributors to submit artworks for approval
    • Submissions can be anonymous
    • Captcha to prevent bots - see the newsletter signup form for a basic captcha implementation
    • Submissions enter a queue to be approved by a moderator before going live on the site
  • Moderator admin page
    • List of artworks up for approval
    • Yes/no button to approve an artwork

Out of scope, to be explored at a later date:

  • User login/management system - for now submissions will be anonymous without requiring an account or login; for now we can simply password-protect the mod admin page via Apache
  • Editing metadata of artwork that has already been submitted
  • Any other mod function besides approving/declining artwork in the queue

Artwork requirements

  • Artist name - we should store artists with their info in a table so we don't duplicate all that information

  • Artist death year

  • Artwork name

  • Artwork completed year (may be circa, a range, or unknown as well)

  • Color image upload

  • A list of subject tags for the image, like "castle" or "seascape"

  • PD proof:

    • Link to an approved museum page

    OR ALL OF THE FOLLOWING

    • Year book was published
    • Link to direct page scan of artwork (not just the start of the book, the direct page)
    • Link to direct page scan of page mentioning book publication year (not just the start of the book, the direct page)
    • Link to direct page scan of book copyright/rights statement page (may be the same as the publication year page but not always)

DB schema

Use the se database on localhost

Tables:

  • Artworks

  • Artists

  • ArtworkTags

  • Users - SE already has a Users table we can reuse, or we may skip this if we just password-protect the mod page

URL schema

Use a REST-like schema for our URLs.

See the polls system on the SE website for a model to follow for a RESTful service with forms for users to submit data.

  • GET <root>/artworks/new -- artwork submission form
  • POST <root>/artworks -- create new artwork
  • GET <root>/artworks -- browse the list of approved artwork
  • GET <root>/artworks/<artist-slug>/<artwork-slug> -- view a specific artwork entry
  • GET <root>/admin/artworks -- view the queue of unapproved artwork
  • POST <root>/admin/artworks/<artwork-id> -- approve or reject artwork with status=<approved,declined>. Restfully this would be a PATCH request

Out of scope for now:

  • GET <root>/artworks/<artist-slug> -- view all artworks by an artist
  • GET <root>/admin -- moderator homepage which may later contain more mod abilities if necessary

Implementation details

Apache rewrites requests in the following ways:

  • POST /objects -> POST <FS_ROOT>/objects/post.php
  • GET /objects -> GET <FS_ROOT>/objects/index.php
  • GET /objects/<OBJECT_ID> -> GET <FS_ROOT>/objects/get.php?objectid=<OBJECT_ID>

The first two types of requests are done automatically by Apache so you don't have to change the config. The third type (getting an object by ID) is not rewritten automatically - you have to add a line to the Apache config.

RESTful operations should occur at the object level. Pages create objects using GET/POST data, and the objects have methods like Validate() to confirm the input is acceptable, Create() to create a new object, Save() to overwrite an existing object, and Delete() to delete the object.

Artworks can have an auto-incrementing int ID. Image uploads will live directly on the filesystem, not as DB blobs.

I wrote @jobcurtis via email with details on the small framework and structure the SE site uses. Here is a copy-paste of that email:

The codebase is in PHP using a lightweight custom templating system.

You can see how everything works here: github.com/standardebooks/web

Of interest will be our polls system, which is a basic example of how our RESTful file system structure works in our system: https://github.com/standardebooks/web/tree/master/www/polls

Members vote in polls by creating PollVote objects.

The form to create a new poll vote lives at GET https://standardebooks.org/polls/<POLL_ID>/votes/new

That forms creates PollVotes by doing POST https://standardebooks.org/polls/<POLL_ID>/votes -> <FS_ROOT>/www/polls/votes/post.php

To list all poll votes (i.e., the results of a particular poll):

GET https://standardebooks.org/polls/<POLL_ID>/votes -> <FS_ROOT>/www/polls/votes/get.php?pollid=<POLL_ID>

Our object model uses a base class called PropertiesBase which provides some syntactic sugar for getters and setters. For example, consider this class:

class Foo extends PropertiesBase{
    protected $_Bar;

    protected function GetBar(){
        return $this->_Bar;
    }

    protected function SetBar($val){
        $this->_Bar = $val;
    }
}

PropertiesBase allows you to call the class in this way:

$foo = new Foo();
$foo->Bar = 'baz';
print($foo->Bar); // 'baz'

@jobcurtis @colagrosso Can you two coordinate via this issue and submit PRs to the new covers branch?

I'm going to be fairly hands off here as I'm extremely busy. But, please feel free to ask me anything. Since we're using a small custom templating system and a not-quite-MVC pattern, some things might not be obvious or well documented.

Yep, understood. Thanks for the write-up.

I have a PR for an initial database schema and PHP classes. Still a lot to do on them, but enough for any early feedback.

Not in the PR, but I've been using this test script:

#!/usr/bin/php
<?

require_once('/standardebooks.org/web/lib/Core.php');

use Safe\DateTime;
$artist = new Artist('John McLure Hamilton', new DateTime('1936-01-01'));
$artist->Create();

$artwork = new Artwork($artist, 'Edward Horner Coates, 10th President of the P.A.F.A.', '1913', 'images/covers/filename.jpg', ['man', 'portrait', 'profile']);
$artwork->Create();

which has an example that I took from the first row of this spreadsheet: Standard Ebooks PD Art Research. It works fine, and produces these rows:

MariaDB [se]> select * from Artworks;
+-----------+----------+------------------------------------------------------+-----------------------------------------------------------------------------------+---------------+----------------------------+---------------------+------------+
| ArtworkId | ArtistId | Name                                                 | UrlName                                                                           | CompletedYear | ImageFilesystemPath        | Created             | Status     |
+-----------+----------+------------------------------------------------------+-----------------------------------------------------------------------------------+---------------+----------------------------+---------------------+------------+
|         1 |        1 | Edward Horner Coates, 10th President of the P.A.F.A. | /artworks/john-mclure-hamilton/edward-horner-coates-10th-president-of-the-p-a-f-a | 1913          | images/covers/filename.jpg | 2023-06-11 04:00:49 | unverified |
+-----------+----------+------------------------------------------------------+-----------------------------------------------------------------------------------+---------------+----------------------------+---------------------+------------+
1 row in set (0.00 sec)

MariaDB [se]> select * from Artists;
+----------+----------------------+---------------------+
| ArtistId | Name                 | DeathYear           |
+----------+----------------------+---------------------+
|        1 | John McLure Hamilton | 1936-01-01 00:00:00 |
+----------+----------------------+---------------------+
1 row in set (0.00 sec)

MariaDB [se]> select * from ArtworkTags;
+-----------+------------+
| ArtworkId | SubjectTag |
+-----------+------------+
|         1 | man        |
|         1 | portrait   |
|         1 | profile    |
+-----------+------------+
3 rows in set (0.00 sec)

Obvious shortcomings:

  1. I didn't add the 5 PD fields to the Artworks table yet
  2. Static functions on Artwork and Artist classes to load them from the database. That might also require new indices.
  3. A few more test examples in the script before moving onto a minimal upload form

Hi @colagrosso, as Alex said I've also been working on this a bit. I started with the upload form which is getting pretty close to complete (I should think I'll have a PR ready for tomorrow).

Your PR #235 looks pretty good to me. I'll rebase what I've got onto that so we can keep all our PRs working together. I'm not sure if @acabal wants to merge stuff as it gets done or wait for everything to be complete before reviewing it.

Since I'm almost done with the initial upload form, do you think it would make sense for you to work on loading the artwork from existing books into the database? Or alternatively, the page for admins to review submissions.

Great to hear from you, Job! Looking forward to working with you. Thanks for your comments on #235 already.

Will your work on the initial upload form cover these two URLs?

  • GET <root>/artworks/new -- artwork submission form
  • POST <root>/artworks -- create new artwork

If so, great. That would let us get an initial milestone working end-to-end where we can use the form and it will insert rows into the database.

In the meantime, I'll work on these two admin URLs so that we don't duplicate work:

  • GET <root>/admin/artworks -- view the queue of unapproved artwork
  • POST <root>/admin/artworks/<artwork-id> -- approve or reject artwork with status=<approved,declined>. Restfully this would be a PATCH request

Yeah that's right. I've also added a very basic page at GET /artworks (pretty much just hello world) so that submitting the form doesn't redirect you to a 404 until we get the real page done.

We should probably agree on how we store PD proof. The simplest way would just be 5 columns on the Artwork table. I think they will all have to be nullable as any given artwork might not have a particular field (and I guess artwork already in use won't have PD proof at all).

Clarifying question about validation; the issue says PD proof should have book publication date plus (link to museum page OR links to pages in books). I was under the impression that a link to a museum page meant it wasn't necessary to find the painting printed in a book. Is that correct?

Clarifying question about validation; the issue says PD proof should have book publication date plus (link to museum page OR links to pages in books). I was under the impression that a link to a museum page meant it wasn't necessary to find the painting printed in a book. Is that correct?

Yes, I made a mistake. Book pub year is not necessary if we have a museum URL.

Do note that if a museum URL is supplied, we should do a simple check on the museum domain to make sure it's on the approved list. So we might need another table, Museums? That could store the base domain for us to check against, and, for implementation at some future date, some kind of URL regex to see if we can normalize museum URLs a little (they often come with search queries and other stuff attached in the query string that are just clutter).

Off the back of #236, here are a few enhancements that we might want to make to the artwork submission form;

  • Suggest artists that already exist (there is already rudimentary support for autocomplete on the artist name field only)
  • Tooltips / wordier labels / guidance for some of the fields that aren't necessarily obvious
  • Setup instructions for the /www/images/uploads directory on Ubuntu
  • More comprehensive validation on PD proof fields (check it's an approved museum website, etc)
  • Maybe persist filled-in fields after failing validation - it's a pain that you have to fill the whole thing in again if you fail the CAPTCHA, for example
  • Configure apache to allow uploads > 2MB

Persisting fields I would say is a requirement, as that's pretty basic web form functionality. An example of how to do this is in the newsletter subscription form. The post function saves an object into the session, and the form uses it to prefill fields. If there's no object in the session, the form creates an empty object which will fill the fields with defaults (usually just blank).

$subscription = $_SESSION['subscription'] ?? new NewsletterSubscription();

In this case the object would be an Artwork object.

Hi, @jobcurtis, no rush on reviewing #240. I have some other work for the next few days.

I also made good progress on these two URLs:

  • GET <root>/artworks -- browse the list of approved artwork
  • GET <root>/artworks/<artist-slug>/<artwork-slug> -- view a specific artwork entry

So I uploaded that PR as a draft. It depends on #240 for things like a common ArtworkDetail template.

Apologies if I took on work you were eager to do or sent too many large PRs. I'll hold off on further large changes until we've addressed some of the enhancements we've identified.

It's getting exciting, though. After #241, we'll have a rough lifecycle of upload -> review -> browse working.

@colagrosso The artworks approval queue looks good! I'll have a play around with the approved artworks list draft this evening.

Don't worry at all about picking up work if it's not been claimed. I'll try to update on this issue when I'm working on something. Speaking of which, I'm currently working on making the artwork submission form more user-friendly.

Don't worry at all about picking up work if it's not been claimed. I'll try to update on this issue when I'm working on something. Speaking of which, I'm currently working on making the artwork submission form more user-friendly.

Right on. Something to consider for the artwork submission form: In templates/ArtworkDetail.php in #240, I did two things:

  1. Separated the artwork metadata from the PD proof
  2. Added a large TIP section explaining what's required for PD proof (copied from Alex's post above)

We could even make the TIP section its own template to re-use it.

Great work so far everyone!

A few notes on the 'POST artwork' logic:

  • When creating a new object, we should pass it the raw parameters from POST and let it create itself entirely within its Create() function. So, logic like validating tags, creating tags, deleting tags, etc., should occur at Artwork::Create() and not in post.php.

    $artwork->ArtworkTags = parseArtworkTags();

    This also applies to the image upload logic. We can pass Artwork::Create() the temp file path of the image, and it can do its own image creation/saving logic in Create(). For example:

    try{
        $artwork->Create($_FILES['color-upload']['tmp_name']);
    }
    catch(...){
        ...
    }

    Note that we don't have to delete the temp files in $_FILES because PHP deletes them automatically at the end of the request.

  • If Artwork::Create() throws an exception, we should be able to assume that the DB has not been changed, i.e. records have not been created or deleted. Therefore, we should not need to call Artwork::Delete() in the catch() block, because nothing should have been created.

    $artwork->Delete();

    Since Artwork::Create() calls Artwork::Validate() to confirm its inputs are acceptable before performing DB or filesystem operations, image validation logic should also go in Artwork::Validate(). Be careful to clean up any temp files you create (remember PHP deletes its own temp files).

Sure, I can move creation/validation to the Artwork class.

The Artwork::Delete call is there at the moment because we have to create the database object before we can calculate the filename, and creating files can fail, leaving us with an invalid Artwork in the database.

I could change this to look like;

  1. Create upload & thumbnail files in uploads dir with temporary names
  2. Create database object(s)
  3. Rename files using generated ID.

It's still possible that renaming the files could fail, but it's less likely if we've already created/moved them successfully. If it did fail, we'd be left with invalid database records, but they'd only show up in the admin queue. The other option would be to go back to using UUIDs. Thoughts?

If something fails with the file upload, deleting incomplete state should still belong in Artwork::Create(). That way if we call it from somewhere else, we don't have to remember that sometimes it might leave things half-finished if something goes wrong. So you should do a try/catch around file manipulations in Artwork::Create() and if they fail, delete the orphan records at that time before throwing an exception further up to post.php. (The Safe library provides functions that throw exceptions, unlike their regular PHP counterparts, for example Safe\rename to move files and throw an exception on failure instead of returning false.)

Also you probably don't need to put file uploads in a temporary dir. They're already in a temporary dir. Why not just manipulate them directly and save copies in the final destination? Artwork::Validate() can do some basic checks like "is the upload an image file" and "is the upload below the filesize max". If it fails then PHP will just delete the temp files at the end of the request anyway. If Artwork::Validate() succeeds then we can first create the DB entry, then manipulate PHP's temp upload files and save them into their final destination. I don't think an intermediate step of saving them in a 2nd temporary directory is necessary.

I needed some good test data for browsing, sorting, filtering, reviewing, etc., so I grabbed 62 useful images from the Standard Ebooks PD Art Research spreadsheet (#232 has links to previous discussions). I picked the 62 by finding the rows that had the "URL of high-res scan" column filled out that still linked to a valid image. (By definition, this is also the same set of rows where the "Thumbnail of hi-res image" column has an actual thumbnail.)

I made a new CSV with just those 62 entries and wrote a little PHP that will read the file and insert the artwork into the Artworks, Artists, and ArtworkTags tables. I also downloaded all the hi-res scans, made thumbnails, renamed them to 1.jpg, 1.thumb.jpg, 2.jpg, 2.thumb.jpg, etc. (Love that naming scheme now).

If you'd like to use the test data, too, first copy the images to the right place:

cp images/uploads/*.jpg /standardebooks.org/web/www/images/uploads
(fix up image permissions if needed, e.g.,) sudo chown -R www-data:www-data /standardebooks.org/web/www/images/uploads

Then either run the PHP script to insert the records from the CSV:

./insert-artwork-from-csv.php

OR insert the records from the database backup:

mysql se < se_artwork62_tables.sql

Here's a link to a zip file that has the images, CSV, PHP code, and DB backup: se_artwork62.zip (112 MB, Google Drive link)

With that, it's much more fun to scroll through the Browse Artwork page:

Screen.recording.2023-07-10.12.12.22.AM.webm

Job noted a problem in #252 that I wrote I would follow up in a separate PR, so copying it here to track it.

  • Prevent the insertion of duplicate artworks, e.g., artworks with the same artist and artwork URL names like /rembrandt-harmenszoon-van-rijn/sacrifice-of-isaac-abrahams-sacrifice. Today the URL lookup function returns the first matching artwork, and other duplicates aren't available by URL name.

Hey Job and Alex,

We're coming up on 3 months of this project, and we have a good milestone to celebrate. I spun up a temporary DigitalOcean Droplet with all the synced ebooks plus my pending changes in #270 that insert them into the database. I also have a new copy of the zip file of the 60 approved images with cleaner code thanks to Job's Artwork::Build() method: se_artwork60_v2.zip (106 MB, Google Drive link). (Also, 2 of the 62 files I linked above are in use, so we're down to 60 available images.)

You can browse through the available and in-use artwork here:

https://143.198.245.16/artworks

Search/filter to in_use artwork by Monet:
https://143.198.245.16/artworks?status=in_use&query=Monet&sort=created-newest&per-page=50

Practice reviewing some submitted artwork (artworkmod:artworkmod):

https://143.198.245.16/admin/artworks

Or submit new artwork (I'm getting better at the CAPTCHAs):

https://143.198.245.16/artworks/new

It's all pretty functional, so that's a good feeling.

It's an old joke but true: We're about 90% done, now it's time for the remaining 90%. We have a few unchecked TODOs above, but we should start a new, prioritized list. I'll start with my thoughts:

  1. The submission form at /artworks/new should give more guidance to the submitter (discussed previously).
    I wrote TIP aside on the artwork detail page, e.g.,

https://143.198.245.16/artworks/anonymous/achilles-lamenting-the-death-of-patroclus

and I think we should just put that TIP on the submission page, too. That way, the submitter and reviewer work from the same guidance. What do you two think?

  1. It's not easy to get from an artwork detail page to resume browsing all artwork at /artworks. The back button is the current best option. I'm thinking of adding a link to /artworks on the artwork detail page—probably at the bottom—but I'm open to suggestions. Is there a better organization/navigation design?

I have a few other ideas of things to polish like #269 out for review, but not that major. What else is high priority on your lists?

Great work, looks like we're close to finishing!

Have you had time to do detailed testing?

I don't think the PD tip needs to be on an approved artwork page, it's not relevant to people merely browsing. It should only appear on the submissions form, where it's relevant.

For now using the back button is fine. We can think of breadcrumbs later after this first version is out.

Have you had time to do detailed testing?

Yes, I've tested submitting, reviewing, browsing, and searching in detail. Submitting is kind of a pain because it's a big form, but that's no different from the old form that populated the Standard Ebooks PD Art Research spreadsheet.

Everything else is pretty clear and snappy. I'm probably blind to some issues, so I'll leave the Droplet running for a while for you and Job to click around on. Or we could invite an additional trusted tester or two.

I don't think the PD tip needs to be on an approved artwork page, it's not relevant to people merely browsing. It should only appear on the submissions form, where it's relevant.

Fair point about merely browsing. Right now, we reuse the same artwork detail template for browsing and reviewing, e.g.,

https://143.198.245.16/artworks/anonymous/achilles-lamenting-the-death-of-patroclus
https://143.198.245.16/admin/artworks/887

I thought the PD tip was a useful reminder to reviewers, which is why it's there in the first place. I'll change it so that it's on the submission form and the admin review page, but not when browsing.

OK great. I'll let you guys test this all. Let me know when you think it's ready for a final review.

How is this coming along? I'm getting the feeling we're almost ready for a v1!

Hi Alex & Mike - apologies for going quiet recently; it's been a busy few weeks. I agree that we're pretty close to the initial version. I should have time this weekend to get back up to date with the state of the project and do a bit of testing. I'll take a look at those outstanding PRs too.

Ok, I've been doing some testing today and it all looks pretty good to me. I only found one minor issue with Firefox rejecting invalid XHTML, which I've fixed in #274.

As for the outstanding issues, I think the important ones that I can think of are the following - I've raised PRs for them both:

  • Allow uploads > 2MB (#275)
  • Guidance around PD proof in submission form (#276)

Hi Job and Alex:

I see a path to a v1. I've tested all this on 2 test droplets and my home machine, so it should work on Alex's test site and eventually production. Job, could you check my work first before we ask Alex to try the steps below?

Here are the steps I propose:

  1. Merge these 3 open PRs into the covers branch: #282, #283, #284
    Shouldn't be controversial, but good to get another pair of eyes on the changes, and I'm open to feedback.

  2. Merge covers into master. This will create a merge commit on master, marking the time we did the merge.

git fetch origin covers  # Only needed if covers isn't already on the machine
git checkout master
git merge covers
  1. There are two new commits that we need to apply to master after the merge. This is to make the files consistent with changes that occurred on master during the summer. I pushed these commits to a demo branch so you can see what they are:

Covers files: SeException -> AppException

Covers files: Remove unneeded Core.php lines

I can open a PR for these changes, or we can apply them in whatever way is quickest. They are not complicated.

4a. Add the new MySQL tables:

cd /standardebooks.org/web/config/sql/se
for SQL in ArtistAlternateSpellings.sql Artists.sql Artworks.sql ArtworkTags.sql Museums.sql Tags.sql 
do
  mysql se < $SQL
done

4b. Insert initial MySQL data from this gist:

git clone https://gist.github.com/a2931d4bb730913d0a334448cb6ce7cf.git
cd a2931d4bb730913d0a334448cb6ce7cf/
mysql se < insert-se-table-data.sql
  1. Insert some sample, unverified artwork

Grab a copy of se_artwork56_v9.zip (Google Drive link, 101MB). These are high-quality submissions, previously discussed above on Jul 10.

unzip se_artwork56_v6.zip
cd se_artwork56_v6
./insert-artwork-from-csv.php

That will add 56 unverified artworks. Browse around at /admin/artworks (username:password is artworkmod:artworkmod on the test site). Approve a few and see them on /artworks.

  1. Add artwork from existing ebooks. (This will take a few minutes if run over all 800+ ebooks.) If the ebooks are already deployed to the site, this will skip rebuilding them and just add the images to the database, which will be faster:
for EBOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  /standardebooks.org/web/scripts/deploy-ebook-to-www -v --no-images --no-build  --no-epubcheck --no-recompose --no-feeds --no-bulk-downloads --upsert-to-cover-art-database $EBOOK
done
  1. Browse, search, and sort at /artworks for both "Approved" and "In use" artwork.

  2. Test submitting new artwork at /artworks/new. Verify the submitted artwork is available for review at /admin/artworks.

  3. When it's time, create usernames and passwords for artwork moderators on the production site:

htpasswd [-c] config/apache/htpasswd-standardebooks.org <username>

One warning on the steps: I had some trouble keeping the unix file ownership and permissions correct between the se and www-data users. If you get stuck on permission errors, I can tell you how I fixed them, e.g.,

  • git config --global --add safe.directory '*' for the se user
  • liberal use of sudo chown -R www-data:www-data

but I probably did it wrong and just don't grok the correct structure, perhaps with a group that se and www-data are both members of.

Sorry if the large number of steps and the warning doesn't inspire confidence. It actually is a nice little site and not a pile of hacks. I'll leave this test droplet running a bit longer if you'd like to click around and see how it should work:

https://143.198.245.16/artworks

Note that Remove unneeded Core.php lines coincides with an update to ./config/php/fpm/standardebooks.org.conf, so make sure you get that update otherwise the site will break for you.

Note that Remove unneeded Core.php lines coincides with an update to ./config/php/fpm/standardebooks.org.conf, so make sure you get that update otherwise the site will break for you.

Yep, thanks, good to call that out. I did add a link in the commit body ("See 042816c on master"), but I could have been clearer.

I think the plan above is reasonable: merge covers into master, then apply those other 2 commits to master after the merge. I considered other approaches, like cherry-picking 042816c and 12cf83a into covers and doing more work in covers before the merge to master, but the potential for making a mistake is higher.

Another option would be to update our branch with changes from master before merging (either merge master into covers or rebase covers onto master for a slightly cleaner history). This would allow us to make the required updates on our branch and then use the standard GitHub PR workflow.

Anyway, I've run through the steps outlined above and didn't encounter any problems 👍

rebase covers onto master for a slightly cleaner history

Thanks, Job. I'll simulate this to see what it would look like, including bringing those other 2 commits above into
the covers branch. Do you have any other pending changes you'd like to make? Rebasing will rewrite history, so it would be inconvenient if you have pending changes on the covers branch.

Anyway, I've run through the steps outlined above and didn't encounter any problems 👍

Thanks for testing!

rebase covers onto master for a slightly cleaner history

Ok, I've come around to this way of thinking. The steps will be simpler, and there's not much chance of making a mistake. Here's how it would go:

  1. When it's time (not yet), I'll create a PR like this draft one to rebase covers onto master. It will bring everything into the master branch in one step.
  2. Continue with Step 4 in my Oct 10 post above, i.e., add the new MySQL tables.

Pretty easy. As long as we are OK losing the original timestamps of the commits and we don't have any pending changes, this will work fine.

The reason I wrote "not yet" above is that I do have a few more pending changes I'll send out tonight. Sorry for the delay, but there are a handful of things worth changing now.

The state of alternate spellings is not great. Even though we have an AlternateSpellings table, if we follow my steps above, it will be empty. Instead, we'll end up inserting a handful of artists and their alternate spellings as duplicate rows. This will create more work to clean up later. Here are the 11 I've found:

Artist name Alternate spelling
Antonio Zeno Shindler Antonion Zeno Shindler
Edward John Poynter Edward Poynter
Élisabeth Louise Vigée Le Brun Élisabeth Vigée Le Brun
Francisco José de Goya y Lucientes Francisco Goya
Frederic Leighton Frederick Leighton
Ivan Ivanovich Shishkin Ivan Shishkin
Joaquín Sorolla y Bastida Joaquín Sorolla
John Singer Sargent John Sargent
Pierre-Auguste Renoir Auguste Renoir
Raphael Raffaello Sanzio
Rembrandt Harmenszoon van Rijn Rembrandt van Rijn

(I could be convinced to swap the Artist name and Alternate spelling in some of the rows above.)

Here is some SQL we could add to config/sql/se/AlternateSpellings.sql

INSERT INTO AlternateSpellings VALUES 
((SELECT ArtistId FROM Artists WHERE Name = 'Antonio Zeno Shindler'), 'Antonion Zeno Shindler'),
((SELECT ArtistId FROM Artists WHERE Name = 'Edward John Poynter'), 'Edward Poynter'),
((SELECT ArtistId FROM Artists WHERE Name = 'Élisabeth Louise Vigée Le Brun'), 'Élisabeth Vigée Le Brun'),
((SELECT ArtistId FROM Artists WHERE Name = 'Francisco José de Goya y Lucientes'), 'Francisco Goya'),
((SELECT ArtistId FROM Artists WHERE Name = 'Frederic Leighton'), 'Frederick Leighton'),
((SELECT ArtistId FROM Artists WHERE Name = 'Ivan Ivanovich Shishkin'), 'Ivan Shishkin'),
((SELECT ArtistId FROM Artists WHERE Name = 'Joaquín Sorolla y Bastida'), 'Joaquín Sorolla'),
((SELECT ArtistId FROM Artists WHERE Name = 'John Singer Sargent'), 'John Sargent'),
((SELECT ArtistId FROM Artists WHERE Name = 'Pierre-Auguste Renoir'), 'Auguste Renoir'),
((SELECT ArtistId FROM Artists WHERE Name = 'Raphael'), 'Raffaello Sanzio'),
((SELECT ArtistId FROM Artists WHERE Name = 'Rembrandt Harmenszoon van Rijn'), 'Rembrandt van Rijn');

but it would require populating the Artists table before AlternateSpellings. Is that worth the extra complexity?

If we did this, the rest of the site does the right thing when adding new books with an alternate spelling and displaying them:

image

I'm leading toward adding these alternate spellings to config/sql/se/AlternateSpellings.sql, so I created a draft pull request: #289

After this initial data, handling alternate spellings will be a moving target and will require some manual maintenance. For example, deleting a row from Artists, adding a row to AlternateSpellings, and updating the ArtistId column in Artworks. Is that OK?

@acabal: I believe we're ready for a v1!

  1. This PR will merge everything into master: #291
  2. Continue with Step 4 in my Oct 10 post above, i.e., add the new MySQL tables.

I tested the steps again in a new droplet, and they worked. I still had to do a bit of file ownership and permissions shuffling between the se and www-data users, but didn't think that was a big deal.

@jobcurtis: Anything I missed or any other thoughts on going for a v1?

No, I think you've covered everything. I'm happy with this as a v1.

No code changes or changes to the steps, but I updated the link in Step 5 in my Oct 10 post above. We're down to 56 high-quality submissions because I noticed that these 3 were already in use:

Status Temporary link
in use https://143.198.245.16/artworks/william-mcgregor-paxton/the-string-of-pearls
duplicate https://143.198.245.16/artworks/william-mcgregor-paxton/a-string-of-pearls
in use https://143.198.245.16/artworks/auguste-renoir/bal-du-moulin-de-la-galette
duplicate https://143.198.245.16/artworks/pierre-auguste-renoir/le-moulin-de-la-galette
in use https://143.198.245.16/artworks/childe-hassam/isles-of-shoals-broad-cove
duplicate https://143.198.245.16/artworks/childe-hassam/broad-cove-isles-of-shoals

Testers and future volunteers may want to submit more artwork from the Standard Ebooks PD Art Research spreadsheet. There are 529 rows in that spreadsheet. The 56 that I'm including are the ones that:

  1. Have a working URL of a high-res scan (which also means a working thumbnail in the last column, "Thumbnail of hi-res image")
  2. Aren't already in use

100% coverage of the in-use artwork might not be attainable or even high-priority (more approved artwork for producers is more important), but I made 2 changes in #292 that gets us from 96% coverage to 99.9%.

  1. Handle images/cover.source.png and images/cover.source.tif files

I ignored these back in #270, but there are 30 ebooks that have images/cover.source.png or images/cover.source.tif instead of images/cover.source.jpg, including the beautiful Celestial Eyes from The Great Gatsby. I was dreading these because our code is set up only for JPEG files, but the solution was simple: The script can easily convert the PNG or TIFF file to JPEG before calling our PHP code. (They are all temporary files in the working directory anyway.) I tested this, and it works well.

  1. Handle artist-1 in content.opf

There are 3 ebooks that have an artist-1 and an artist-2 in content.opf instead of a single artist. (In one case, there's even an artist-3.) I updated the script to take artist-1 as the painting's artist. This isn't ideal, but it's usually correct. Sometimes artist-2 is for later artwork after the cover. Sometimes it indicates that two artists worked on the cover, which we aren't set up to handle.

With those two changes, there are only 2 ebooks the script can't insert in-use artwork into the database for:

  1. Narrative of the Life of Frederick Douglass, ebook, repo

Problem: There is no artist in content.opf. colophon.xhtml has the artist as "an unidentified artist"

Proposal: Add an Unknown artist. I opened a pull request to do so.

This approach was also taken in The Prophet by Khalil Gibran.

  1. Thuvia, Maid of Mars, ebook, repo

Problem: There is no se:name.visual-art.painting in colophon.xhml

It reads:

<p>The cover page is adapted from<br/>
the original 1920 first edition cover art by<br/>
<a href="http://id.loc.gov/authorities/names/no2018133319"><abbr epub:type="z3998:given-name">P. J.</abbr> Monahan</a>.<br/>

That makes sense, and it's a nice adaptation of the original. The upsert-to-cover-art-database script can't do anything without a se:name.visual-art.painting, though.

Proposal: Ignore this cover for now and consider inserting it manually later. I couldn't find an official title of the artwork, but we could reuse Thuvia, Maid of Mars as both the ebook and artwork title.

I deleted my test droplet, so the https://143.* links above don't work anymore. Sorry if anyone was depending on those links for testing or demoing. I can spin up another test droplet if it would help.

This has been released!