[Feature Request]: enhance with EXIF data, specifically geodata and datetimeoriginal

Question

[Feature Request]: enhance with EXIF data, specifically geodata and datetimeoriginal

ohade opened this issue 10 months ago · comments

Ohad Edelstein commented 10 months ago

Feature Name

[Feature Request]: Enhance fastdup with EXIF Data Integration, Including Geodata and DateTimeOriginal

Feature Description

What does the feature do?
Integrates EXIF data (geodata and DateTimeOriginal) into fastdup, allowing for more nuanced sorting, filtering, and deduplication by recognizing the original images.
Why do you think it's important?
EXIF data provides essential context and can detect original images among duplicates, thereby preserving crucial metadata. It's vital for industries that require location and time-specific insights.
How will it benefit users?
Users will gain richer insights, more accurate deduplication, and the preservation of important metadata. This will increase dataset quality, streamline data operations, and potentially reduce costs.

Contact Information [Optional]

No response

Danny Bickson · Answer 1 · Sun Aug 20 2023 01:23:54 GMT+0800 (China Standard Time)

HI @ohade sounds like a good feature request to add.

Can you please point us to a few example images containing exif data we can use to test the support.
In addition, which type of functionality would you like to have once we read the exif information. For example, assume two images are duplicates but their exif data is different. What would you like to do in this case? Would it be useful to show exif data in the galleries. For example for outliers?
Thanks

Ohad Edelstein · Answer 2 · Sun Aug 20 2023 05:07:05 GMT+0800 (China Standard Time)

Hi,
regarding Exif format: https://en.wikipedia.org/wiki/Exif
regarding the data that can be extracted and how:
take any picture taken on a mobile phone and run the following python code:

from PIL import Image, ExifTags
from PIL.ExifTags import TAGS, GPSTAGS

def get_exif_data(image_path):
  img = Image.open(image_path)
  
  image_exif = img.getexif()
  for key, val in image_exif.items():
      if key in ExifTags.TAGS:
          print(f"ID: {key}, TAG: {ExifTags.TAGS[key]}, VAL: {val}")


def get_geotagging(exif):
    if not exif:
        raise ValueError("No EXIF metadata found")

    geotagging = {}
    for (idx, tag) in TAGS.items():
        if tag == 'GPSInfo':
            if idx not in exif:
                raise ValueError("No EXIF geotagging found")

            for (key, val) in GPSTAGS.items():
                if key in exif[idx]:
                    geotagging[val] = exif[idx][key]

    return geotagging

def get_location(image_path):
    image = Image.open(image_path)
    exif = image._getexif()
    geotagging = get_geotagging(exif)
    for key, val in geotagging.items():
        print(key, val)

image_path = ...
get_exif_data(image_path)
get_location(image_path)

For example, here is the data I extracted from a picture I have on my android Samsung phone:

**get_exif_data(image_path)**
ID: 256, TAG: ImageWidth, VAL: 4000
ID: 257, TAG: ImageLength, VAL: 3000
ID: 34853, TAG: GPSInfo, VAL: 696
ID: 296, TAG: ResolutionUnit, VAL: 2
ID: 34665, TAG: ExifOffset, VAL: 238
ID: 271, TAG: Make, VAL: samsung
ID: 272, TAG: Model, VAL: SM-G998B
ID: 305, TAG: Software, VAL: G998BXXU5CVDD
ID: 274, TAG: Orientation, VAL: 6
ID: 306, TAG: DateTime, VAL: 2022:06:18 10:56:28
ID: 531, TAG: YCbCrPositioning, VAL: 1
ID: 282, TAG: XResolution, VAL: 72.0
ID: 283, TAG: YResolution, VAL: 72.0

**get_location(image_path)**
GPSLatitudeRef N
GPSLatitude (32.0, 5.0, 32.412119)
GPSLongitudeRef E
GPSLongitude (34.0, 49.0, 3.7128)
GPSAltitudeRef 0
GPSAltitude 64.0

So a few use cases:

I use it to determine what was the original copy of the image, usually copied pictures will be stripped of the geodata so it's easy to track the copied ones. also, the DateTime tag helps to track the original one.
knowing which was the the original picture can assist in understanding what was the original content and see transformations it went through by time (I am talking about a sequence of copies, each one a copy of the previous generation with small changes. that can create a sort of history path that can lead you to the source.
two exact images with different EXIF data, will probably indicate some bug in my code, because it's not a fingerprint but close enough
time and location can be strong features to users who want to cluster their images

Ohad Edelstein · Answer 3 · Sun Aug 20 2023 05:18:36 GMT+0800 (China Standard Time)

Also, I use the geodata and datetime to convert it to UTC time, like so:

from timezonefinder import TimezoneFinder
import pendulum

def fix_timestamp_using_geoDataExif(latitude, longitude, timestamp):
    if latitude == 0 and longitude == 0:
        return timestamp
    
    tf = TimezoneFinder()
    time_zone_str = tf.timezone_at(lat=latitude, lng=longitude)

    if not time_zone_str:
        return timestamp
    
    local_time = pendulum.from_timestamp(timestamp, tz=time_zone_str)
    utc_time = local_time.in_timezone('UTC')
    return int(utc_time.timestamp())