dputhier / libgtftk

gtftk C Library and program

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question regarding select_by_key

dputhier opened this issue · comments

Hi,
I was wondering what was the behaviour of libgtftk.select_by_key when invert_match was set to 2. I guess that it implies that the function will return the record iff the key exists for that record. If it exists but its value is set to ".", the record will be returned (confirm plz). This is what is expected. I should implement it in the same way in select_by_regexp. However, I have no way to know whether a key exists for a record. The only thing I can do is extract a column (using extract_data) and check whether it contains a "." which does not tell me whether the key exists for a record or not... This argue for a particular encoding of both information when using extract_data.
We should be able to distinguish :

  • A key that exist and a value set to "." (which should obviously return ".")
  • A key that does not exists and thus has no corresponding value (could be "??" or anything else...)

On my side I would like to be able to implement two different args on select_by_reg_exp, select_by_reg_key, and extract_data: no_na (controls whether I want any "." in the output) and if_key_exists (perform the test only if the key exists). This would requires several modifications on the Python side. I don't know about C side. Tells me. We can discuss about it next week.

Trying tout be a little bit more explicit. My wish would be an additional argument in libgtftk.extract_data (e.g 'explicit') to be more explicit regarding ".". If explicit is set to true (default false for backward compatibility) then records for which the key does not exists would turn to something like "??" Or "^$". I guess these kind if value should be rare...

I did a very small change in extract_data to distinguish missing attributes from "." values.
Now, if an attribute is missing, its value is "?" in the results of extract_data. I think that the "explicit" parameter can be implemented in the Python side. It's probably easier than in C ...

It had also some level of complexity in my Python code as extract_data returns many types of object (dict, list, list of list...). But I will go Python for this anyway.

This feature is available starting fromage 0.9.0 release. Records for which the key does notre exists return "?".