FR: get columns from label file easily

Question

FR: get columns from label file easily

michaelaye opened this issue 8 years ago · comments

I have met repeatedly the task to parse the columns out from a PDS cumulative index label file, so that Pandas get's the right columns, when parsing the TAB file.
Hence I believe it could be a nice easy little addition (maybe in a sub module "utils") to collect helpers like that in the pvl module? I believe (CMIIW), that the PDS index files and the cumulative index files structures are stable enough to justify a common helper function for this?

Ross Beyer commented 4 years ago

Perfect!

Trevor Olson · Answer 1 · Sat Aug 13 2016 04:40:45 GMT+0800 (China Standard Time)

Hey @michaelaye. I've done some work on cube file tables before. Is this similar to the a PDS cumulative index label file? Also do you have an example image we could play with?

Austin Godber · Answer 2 · Mon Aug 15 2016 00:41:20 GMT+0800 (China Standard Time)

I think the examples would be like the EDR_CMDX.LBL and EDR_CMDX.TAB files shown here:

http://pds-imaging.jpl.nasa.gov/data/msl/MSLMST_0002/INDEX/
http://pds-imaging.jpl.nasa.gov/data/msl/MSLMST_0003/INDEX/
http://pds-imaging.jpl.nasa.gov/data/msl/MSLMST_0004/INDEX/
...

I don't think these really belong in the PVL module though. The way these files work is that the .LBL is a pvl syntax description of the .TAB file which is just CSV. So, in this case the low level IO routines would be csv and pvl, with a higher level PDS archive object representing those indexes. Keep in mind these are very common representations of mission and instrument specific indices. So every PDS archive has an incremental and cumulative index.

I think this would be more of a pds_archive module. That module could store the appropriate base URLs and use the same key words to access mission/instrument specific archives. It could grab indexes, and from those find and grab products. Though to do this properly the indexes should all be in some backend and exposed as an API. There are things that try to do this, like http://pds-imaging.jpl.nasa.gov/search/ and I think I recall seeing some group that made a modern API backend. Actually, @michaelaye, the fact that you're even asking for this is an indication of the PDS' biggest weakness. I should have participated in their review last year or done a PDART or something.

@michaelaye probably understands these details, I am just giving context for others who may read the ticket.

Michael Aye · Answer 3 · Wed Oct 12 2016 02:13:49 GMT+0800 (China Standard Time)

Unfortunately, the .TAB files are NOT just CSV, because of multiple value arrays inside ONE BLOODY COLUMN, grrr.. which breaks simple pandas parsing. I have finished code to parse the colspecs to read in 'normal' columns, similar to @wtolson 's code for cube file tables, but still am working on dealing with these multi-value thingies. :(

See example column named "EXPECTED_MAXIMUM" in attached label and example data file. (i had to attach .txt to allow upload).

index_head.tab.txt
index.lbl.txt

Michael Aye · Answer 4 · Wed Oct 12 2016 02:16:03 GMT+0800 (China Standard Time)

Probably the group with a modern API you are thinking of is the PDS rings and moons archive, that has an open JSON URL based API to their database:
http://tools.pds-rings.seti.org/opus/api/#/view=search&browse=gallery&colls_browse=gallery&page=1&gallery_data_viewer=true&limit=100&order=time1&cols=&widgets=planet,target&widgets2=&detail=

Michael Aye · Answer 5 · Wed Oct 12 2016 07:47:25 GMT+0800 (China Standard Time)

Correction: you can use pd.read_csv() but the multi-item columns will be read in a split-up fashion automatically, so you need to parse the column names accordingly.
For example in attached example, the "FILTER_NAME" field has 2 entries, and will be split so I am currently automatically creating FILTER_NAME_1 and FILTER_NAME_2 from the label file.
Filling up the column names like that, I then can use pd.read_csv (which interestingly is factor 3 approx faster than pd.read_fwf ) to read the index table in.

Michael Aye · Answer 6 · Wed Oct 12 2016 07:50:08 GMT+0800 (China Standard Time)

To stay on-topic, I'm still unsure if the capability of dealing with multiple items in one column should not be a functionality in pvl, while I agree that reading index tables in should be a pds_tools or pds_archive package/module.

Ross Beyer · Answer 7 · Fri Feb 28 2020 10:28:33 GMT+0800 (China Standard Time)

I agree with @godber this feature that you're after is a 'PDS thing' not a 'PVL thing' and isn't appropriate for this pvl library. Sure, I appreciate that you often encounter this issue when you are also dealing with PVL text, but this pvl library is about being able to correctly parse data out of the PVL text. What you do with it afterwards is a different problem.

Maybe this kind of capability would build on pvl but might properly live in some Python PDS library?

Since this issue is so old, I suspect that you've solved it some other way, so I'm going to close this Issue. Reopen if you still think this is a 'PVL thing.'

Michael Aye · Answer 8 · Fri Feb 28 2020 10:50:35 GMT+0800 (China Standard Time)

this has been solved in https://github.com/michaelaye/planetarypy/blob/master/planetarypy/pdstools/indices.py