geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.

Home Page:https://gchp.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use of bashdatacatalog to find invalid files

cbutenhoff opened this issue · comments

Your name

Chris Butenhoff

Your affiliation

Portland State University

Please provide a clear and concise description of your question or discussion topic.

I recently used bashdatacatalog to download input files for GCHP v14.3.1 for a multi-year simulation. The download took a cpl of days and when finished I noticed some (many?? ) of the file names were corrupted.

For example, HEMCO/OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01/biovoc_05.20060103.nc actually is biovoc_05.20060102.nc according to the nc header info; MERRA2.20070101.I3.05x0625.nc4 is actually MERRA2.20070101.A3mstE.05x0625.nc4, and so on.

I believe this happened because I used the parallel option in xargs -P curl to download the files, and some communication/timing error occurred.

I would like to not download all the files again. I notice that bashdatacatalog-list has the -w option to identify files with incorrect checksums. I tried this to identify files that I know are invalid but bashdatacatalog-list was unable to identify those files.

Here is my usage to find the corrupt biovoc files:

> bashdatacatalog-list -aw -p "OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01" InputDataCatalogs/**/*.csv

I run it in my ExtData directory as I did when I downloaded the files. I have also tried running using pattern "biovoc" but it didn't return any file names either.

I don't know too much how checksums work. In the case where the file is intact but has the wrong filename, would the checksum still match?

Thanks for any help you can provide.

As a follow-up, I was able to write a Python script that renamed the MERRA2 files based on the real file name listed under global attributes in the netCDF metadata. Unfortunately, the metadata in the HEMCO netCDF files does not provide the file name in a consistent format so renaming these files will be more difficult.

Thanks for pointing this out @cbutenhoff. I didn't encounter this issue with xargs -P curl before. Could you let us know how many streams you used to download the data?

We use MD5 checksums, which only verify the content of the file, not the file name.

Unfortunately, the metadata format is different across collections as they are from different sources. Perhaps you can try extracting the key information with regular expressions.

Thanks @yidant. At different times I used 4 and 8 streams. I'm not positive the parallel download caused the problem, but files I downloaded using 'wget' seem fine.

In some (most?) of the HEMCO nc files, there is a 'history' attribute that contains the actual file name, though it's not consistently located. I'm trying to do some checks based on this.

In the end I'll probably spend more time trying to rename corrupt filenames that I would have just redownloading all the input data :).

This comment may be better placed as its own issue, but I noticed the data catalog for GCHPv14.3.0 I believe incorrectly includes 2022 data in the GFED4/v2023-03/2023 folder:

./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202212.nc