EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

Home Page:https://epistasislab.github.io/pmlb/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question regarding feature metadata

jwehrmann opened this issue · comments

First of all, thank your for putting such an effort to share your great work. Really appreciate that!

I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don't. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?

For instance, in the Irish dataset we have "Prestige_score:discrete" and "Type_school:discrete". Both are integers, though "Type_school" is categorical while "Prestige_score" is quantitative.

I could make use of the original datasets as well if you have them.

First of all, thank your for putting such an effort to share your great work. Really appreciate that!

You're welcome!

I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don't. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?

For instance, in the Irish dataset we have "Prestige_score:discrete" and "Type_school:discrete". Both are integers, though "Type_school" is categorical while "Prestige_score" is quantitative.

This difference is important, but at the moment we aren't capturing it. Right now we have an automated script that checks data type naively. We need to manually check whether a discrete variable is nominal/categorical or ordinal/quantitative. At the moment we are pushing to make this information easier to contribute on our
PMLB 2.0 branch. So we don't capture that data yet, but soon it will be easy for contributors to add it to the metadata.

I could make use of the original datasets as well if you have them.

Same answer here. Our schema for PMLB 2.0 will capture source information, but at the moment it isn't available.

Hope this helps

Hi @jwehrmann thank you for the feedback! I agree that feature type is important for method development.

We're working on adding details to each dataset's metadata on the PMLB 2.0 branch. We're currently putting together a robust framework so that others can contribute and specify feature types (e.g. continuous, ordinal, nominal). We would love to have your contribution once the framework is in place.

I know you would like it for all the datasets, but I believe this is the source for the irish dataset: https://www.openml.org/d/451

Updates: Just now saw @lacava's response. I guess this is to corroborate what has been said.

@lacava @trang1618 Thank you for the answer! I took a look at the referred branch, and I guess one should update the metadata.yaml to include that information, maybe something like the example below:

- name: age
  type: continuous
  description: null # optional but recommended, what the feature measures/indicates, unit
  code: null # optional, coding information, e.g., Control = 0, Case = 1
  transform: ~ # optional, any transformation performed on the feature, e.g., log scaled
  nature: ordinal 
- name: workclass
  type: discrete
  nature: categorical

I can try to help retrieving that kind of information, and submit a PR whenever PMLB 2.0 is stable enough.

Hi @jwehrmann thank you once again for raising this point, and thank you for your patience as we worked on streamlining the contribution workflow. Just a quick update that PMLB v1.0 (previously PMLB 2.0 - sorry for the confusion) has been released. If you have the bandwidth, we would love to have more contribution to make the metadata.yaml files more complete.

Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers

I want to reiterate that, unfortunately, we cannot perform an automatic check on this because different datasets are coded differently from different repositories. A reviewer would have to manually check if a categorical feature is nominal or ordinal. Michael Hoffman also suggested distinguishing between level of measurement (nominal/ordinal/ratio/interval) and domain (real/whole number/etc.), which is somewhat similar to your type vs. nature suggestion. We have not implemented this distinction, but if you think it's helpful for your application, perhaps open another issue (or better yet submit a PR), and we would love to incorporate your idea!

Regarding specifically the irish dataset, its metadata has been reviewed by @daniel0710goldberg.

I'll close the issue for now, but please feel free to reopen/submit related (and unrelated) PRs.

A possible way to do this could be something like:

# check metadata and apply suitable pandas Dtype that supports NaN
# is categorical/binary? convert to category with strings of integers
# is ordinal? convert to object with integers 
# is continuous? maintain float64 as this supports NaN by default
def applyDType(df, dfName):
  url = 'https://raw.githubusercontent.com/EpistasisLab/pmlb/master/datasets/'
  url = url + dfName + '/metadata.yaml'
  dsmd = urllib.request.urlopen(url)
  dsyl = yaml.load(dsmd)['features']
  for c in df.columns:
    for f in dsyl:
      if c == f['name']:
        ft = f['type']
        if ft == 'categorical' or ft == 'binary':
          df[c] = df[c].astype('category')
          df[c] = floatToStrCol(df[c])
        if ft == 'ordinal':
          df[c] = df[c].astype('object')
          df[c] = floatToIntCol(df[c])
  return df.copy(deep=True)

I know that fetch_data has the default dropna=True but in case there are np.nan present you can do something like this and then apply an encoder depending on the Dtype. Maybe someone from the Epistasis team can vet this, here the YAML parsed metadata is pulled in via an external URL, but that might not be necessary if something similar gets incorporated into the package.