Early training bias due to random sampling with uneven # from sites or cameras

Question

Early training bias due to random sampling with uneven # from sites or cameras

abfleishman opened this issue 6 years ago · comments

Another thought that I have been pondering with the active-learning pipeline is how to avoid biasing your detector to the types of images that you labeled first? For instance, say I have 10 cameras and each camera has taken a different number of images from 1000 to 100,000 over a 1-year deployment. If you label 100 randomly selected images to start with, the majority will be from the camera that took the most images, and maybe the background in that camera is distinct. if you train a model with those initial 100 images, it may be highly biased toward detecting things in images from that camera (because of some characteristic of those images). Images from other cameras might not even have detections and might not get "served" to the person tagging?
Essentially I see it as the same idea as the class imbalance, but instead, it is an imbalance in the raw data. How/does this normally get addressed in active learning?

Olga Liakhovich · Answer 1 · Tue Oct 23 2018 23:55:33 GMT+0800 (China Standard Time)

How about down sampling N of images for the camera that took 100K images (vs camera that took 1K)?