Spike: Get the prod. archive fully reindexed by Google, while mitigating the load on the servers from crawling by the bot

Question

Spike: Get the prod. archive fully reindexed by Google, while mitigating the load on the servers from crawling by the bot

landreev opened this issue 9 months ago · comments

Googlebot crawling was generating a surprising degree of load and causing real problems as of late; in order to mitigate this load we've been experimenting with limiting or stopping bot access to the holdings while we are looking for more efficient ways of feeding the metadata to them. This is now causing problems, as Google appears to have started dropping some previously indexed datasets from searches (not just taking longer to index newly published content, as was intended). So, this is somewhat urgent, to get everything indexed again, while keeping the servers alive.

There's some overlap with #222, as I'm specifically trying to feed the schema.org metadata exports to Google.

landreev · Answer 1 · Sat Sep 16 2023 06:59:39 GMT+0800 (China Standard Time)

Google is in the process of reindexing the prod. archive. I'm going to keep an eye on the datasets that were specifically reported; if they don't get reindexed in the next couple of days, I'll force-request the bot to come crawl them.

landreev · Answer 2 · Thu Sep 21 2023 00:48:36 GMT+0800 (China Standard Time)

I got the dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/D24VWO re-crawled by googlebot repeatedly during the last couple of days. Unfortunately, it's still not showing up prominently in the google search results when I search for the title. Getting a page reindexed in their search engine can take time (they don't give any guarantees about how long), but I am a little bit worried about it. I will read their documentation some more and try to figure out how to address this if this and a few other datasets like it don't start showing up in searches in a few days.

C. Boyd · Answer 3 · Thu Sep 21 2023 23:10:56 GMT+0800 (China Standard Time)

2023/09/21: @landreev I added to sprint ready with tentative size of 3. Please resize as needed for this sprint. Also, I changed the title to indicate that it's a spike/investigation.

Philip Durbin · Answer 4 · Fri Sep 22 2023 19:24:50 GMT+0800 (China Standard Time)

@landreev is it possible we're suffering from this bug for some of our datasets?

IQSS/dataverse#8936

landreev · Answer 5 · Sat Sep 23 2023 01:05:53 GMT+0800 (China Standard Time)

Hmm. That's another issue I was not aware of (thank you for mentioning it). But it doesn't look like sitemap is the issue in our case - the bots appear to be reading it, and they appear to be responsive to what's in it. If I change a date for a dataset in it, they appear to come and get it, not instantly, but fairly quickly.
The datasets I'm keeping an eye on have been recrawled, but are still not appearing in searches.

(it would be weird, if they kept using sitemaps with >50k entries for crawling, but without indexing the crawled content - ?? - Anyway, I clearly need to keep reading up on it)

landreev · Answer 6 · Wed Oct 25 2023 03:39:49 GMT+0800 (China Standard Time)

This is the dataset mentioned earlier, that somebody complained about specifically, that in turn prompted opening this dedicated issue.

C. Boyd · Answer 7 · Thu Jan 04 2024 04:09:12 GMT+0800 (China Standard Time)

2024/01/03: Moved to waiting status during kickoff; need to wait for a while to review.