Archives for DOIs are set to the wrong values
sdruskat opened this issue ยท comments
Hi ๐,
Stumbled over two bad software_archive
s:
It seems it has been inadvertently set to https://joss.theoj.org/papers/10.21105/v1.0.7
after it was previously correctly set to https://doi.org/10.5281/zenodo.10162614
.
The Zenodo DOI has been truncated to https://doi.org/10.5281/zenodo.714. The correct one is https://doi.org/10.5281/zenodo.7143971.
Perhaps this can be fixed in the metadata?
Also, please let me know if
- there is a better place or way for reporting these things, or
- if there is a place I could fix this myself and put up a PR (I'll also have a peep at the infra docs).
See above, put up openjournals/joss-papers#5158 as the only place I could find the offending strings in the GH org :).
Found another one (reported in openjournals/buffy#103 (comment)):
- JOSS paper: https://joss.theoj.org/papers/10.21105/joss.00041
- Existing
software_archive
: https://doi.org/10.5281/zenodo.23671 (a French paper aobut a beww from 1910) - Correct DOI: https://doi.org/10.5281/zenodo.61965
Found another one (reported in openjournals/buffy#103 (comment)):
Fixed in openjournals/joss-papers@2232b31
@sdruskat Thanks for reporting this!
The easy way to correct the wrong values is for an EiC to regenerate the pdf and the metadata files. That can be done reaccepting the paper, so the best place to report these things is the review issue of the affected papers. I'll ping the EiC for these two cases.
the best place to report these things is the review issue of the affected papers.
Thanks for the pointer @xuanxu! I'll do this for future issues. ๐
I'll ping the EiC for these two cases.
Three cases with the ๐, and thanks!
Continuing this, I just went and checked all the archive links in joss-papers
, and these are the papers that have problems:
file | archive | status |
---|---|---|
10.21105.joss.00040.crossref.xml | http://dx.doi.org/10.5281/zenodo.59387 | 500 |
10.21105.joss.00612.crossref.xml | (Missing link) | |
10.21105.joss.00971.crossref.xml | (Missing link) | |
10.21105.joss.02314.crossref.xml | https://doi.org/10.5281/zenodo.3877690 | 500 |
10.21105.joss.04439.crossref.xml | https://dx.doi.org/10.5281/zenodo.6767313. | 404 |
10.21105.joss.04591.crossref.xml | https://dx.doi.org/v1.1.0 | 404 |
10.21105.joss.04684.crossref.xml | https://dx.doi.org/10.5281/zenodo.714 | 404 |
10.21105.joss.05395.crossref.xml | https://dx.doi.org/10.5281/zenodo.10050346 | 410 |
10.21105.joss.05883.crossref.xml | https://dx.doi.org/v1.0.7 | 404 |
not bad overall :)
here's the script (nothing special, just a one-off thing):
expand/collapse
"""
Check whether the archive DOI for each paper resolves to a page.
run this from within the joss-papers directory.
because of the handling of ratelimiting, you'll have to run this a few times
until you no longer skip for ratelimits.
generates
- `joss_archive_links.csv` - see `Results` for columns
- `joss_archive_links_clean.csv` - see `clean_csv`
- `joss_doi_pages` - xz compressed cache of the resolved archive pages
requires:
- requests
- tqdm
- pandas
"""
import csv
from xml.etree import ElementTree
from pathlib import Path
from dataclasses import dataclass, fields, asdict
from typing import Optional, Literal, Union
import lzma
from multiprocessing import Pool, Lock, Event
from time import sleep, time
from math import ceil
import requests
from tqdm import tqdm
import pandas as pd
data_file = Path('joss_archive_links.csv')
cache_dir = Path('joss_doi_pages')
NAMESPACES = {
'rel': "http://www.crossref.org/relations.xsd"
}
@dataclass
class Results:
file: str
archive: Optional[str] = None
valid: bool = False
status: Optional[int] = None
error: Optional[str] = None
retry_after: Optional[float] = None
def process_paper(path:Path) -> Optional[Results]:
out_file = cache_dir / path.with_suffix('.html.xz').name
if out_file.exists():
return
paper = ElementTree.parse(path).getroot()
res = {}
res['file'] = path.name
try:
archive = paper.find(".//rel:inter_work_relation[@relationship-type='references']", NAMESPACES).text
archive = archive.lstrip('โ').rstrip('โ')
if not archive.startswith('http'):
archive = 'https://dx.doi.org/' + archive
res['archive'] = archive
# hold if we are currently in a ratelimit cooldown.
lock.wait()
req = requests.get(res['archive'])
res['status'] = req.status_code
match res['status']:
case 429:
res['retry_after'] = float(req.headers['x-ratelimit-reset'])
case 200:
res['valid'] = True
if res['status'] != 429:
with lzma.open(out_file, 'w') as cache_file:
cache_file.write(req.content)
except Exception as e:
res['error'] = str(e)
return Results(**res)
def init_lock(l):
"""make a lock (now an event) available as a global across processes in a pool"""
global lock
lock = l
def wait(lock:Event, result:Results, message:tqdm):
"""if we get a 429, acquire the lock until we can start again"""
lock.clear()
wait_time = ceil(result.retry_after - time())
message.reset(wait_time)
for i in range(int(wait_time)):
sleep(1)
message.update()
lock.set()
def main():
rate_lock = Event()
rate_lock.set()
cache_dir.mkdir(exist_ok=True)
# ya i know i ruin the generator but i like progress bars with totals
files = list(Path('.').glob("joss*/*crossref.xml"))
try:
all_pbar = tqdm(total=len(files), position=0)
good = tqdm(position=1)
bad = tqdm(position=2)
message = tqdm(position=3)
pool = Pool(16, initializer=init_lock, initargs=(rate_lock,))
if not data_file.exists():
with open(data_file, 'w', newline='') as dfile:
writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])
writer.writeheader()
with open(data_file, 'a', newline='') as dfile:
writer = csv.DictWriter(dfile, [field.name for field in fields(Results)])
for result in pool.imap_unordered(process_paper, files):
all_pbar.update()
if result is None:
continue
if result.retry_after:
wait(rate_lock, result, message)
if result.valid:
good.update()
else:
bad.update()
writer.writerow(asdict(result))
finally:
all_pbar.close()
good.close()
bad.close()
clean_csv()
def clean_csv(path:Path = data_file):
"""
- remove 429s
- deduplicate rows (if identical)
- sorts by `valid` and then `file`
"""
df = pd.read_csv(path)
df = df.loc[df['status'] != 429]
df = df.drop_duplicates()
df = df.sort_values(['valid', 'file'], ignore_index=True)
out_fn = (path.parent / (path.stem + '_clean')).with_suffix('.csv')
df.to_csv(out_fn, index=False)
if __name__ == "__main__":
main()
Closing this issue as PDFs and metadata on the three papers have been corrected and re-deposited.
Quick update:
10.21105.joss.05883.crossref.xml
Now fixed.
10.21105.joss.05395.crossref.xml
Is a 410
which looks to be some kind of "User was blocked" thing. I'm not sure what to do about this one.
10.21105.joss.04684.crossref.xml
Looks like there was some kind of error with the reaccept
compilation here. @xuanxu โ any ideas what is going on there?
10.21105.joss.04591.crossref.xml
Now fixed.
10.21105.joss.04439.crossref.xml
Looks like the paper is missing. I've asked the author to re-add it: openjournals/joss-reviews#4439 (comment)
10.21105.joss.02314.crossref.xml
Seems to resolve for me now?
10.21105.joss.00971.crossref.xml
Looks like it's missing from the PDF and the XML files? This probably needs manual handling.
10.21105.joss.00612.crossref.xml
Same issue as 10.21105.joss.00971.crossref.xml
. It's missing from the paper and the Crossref XML but the DOI is resolving.
10.21105.joss.00040.crossref.xml
I think we should report this to Zenodo as an issue.