ITC-CRIB / fairly

A package to create, publish, and clone research datasets

Home Page:https://fairly.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Precedence in Includes and excludes

manuGil opened this issue · comments

To include and exclude files from a dataset we are using the following operations:

local = fairly.dataset("./test-dataset/mydataset/")

# to include:
local.includes.append("*.jpg")
# to exclude:
local.excludes.append("*.jpg")
# then we save changes to manifest.yaml
local.save()

The above results in the following in manifest.yaml:

files:
  includes:
  - ARP1_.info
  - ARP1_d01.zip
  - my_code.py
  - Survey_AI.csv
  - '*.jpg'
  excludes:
  - '*.jpg'

How fairly mange the precedence of this case? It is based on the order in the file? or have excludes precedence over includes?

Currently includes has precedence over excludes. These rules are used by the _get_files() method in dataset/local.py file. For each file under the dataset folder, the method first checks the includes rules. If there is a match with any of the rules, then the file is included to the file list. If there is not match, then excludes rules are checked and if there is any match, then the file is excluded.

We can provide some feedback to the user is there are any conflicting include and exclude rules, or duplicate rules (e.g. *.jpg repeated twice in the includes). I think we can eliminate duplicate rules automatically, but it is better to let the user know and solve the conflicts. We can consider adding a check_rules() method for that purpose.