Skew/Drift Calculation is unclear.
The-Grand-Zand opened this issue · comments
Description of issue (what needs changing):
I have been receiving odd skew results even though the data has a normal standard deviation as well as similar mean, median, missing. From what I have read, the skew is calculated using the L-Infinity norm and Jensen-Shannon Divergence. I have tried to use these two equations to manually calculate the skew/drift of specific features to try and derive the same answer as TFDV returns to no avail.
Would it be possible to explain what variables go into calculating the skew/drift with an example? This would greatly help my understanding and trust for utilizing TFDV for data validation and modeling pipelines.
Current Links to skew/drift explanations
Here are the links I have found with descriptions of the calculation. Unfortunately, none give the granularity that I am looking for.
- https://github.com/tensorflow/tfx/blob/cf04e1c99e7faf644bc91e498cf5cc58ea044eae/docs/tutorials/data_validation/tfdv_basic.ipynb
- https://www.tensorflow.org/tfx/guide/tfdv
- https://stackoverflow.com/questions/59528660/understanding-l-infinity-norm-which-is-used-in-tfdv
visual example of odd skew calculation
Hello @The-Grand-Zand ,
I am sharing with you what I have found for skew/drift calculation in TFDV.
I tried to implement calculation of l-infinity and jensen-shannon divergence manually, and succeeded to implement it using python.
Metrics calculation logic is implemented using c++, and you can refer to https://github.com/tensorflow/data-validation/blob/9fbc050580fb2433f7fbd6276bbf5b0e654787d1/tensorflow_data_validation/anomalies/metrics.cc.
Main confusing area was aligning two different histograms and distributing values while aligning histograms. Refer the below for the detail. I wrote input variables and detail logic including preprocessing steps.
Full python implementation codes are available in my repository.
https://github.com/jinwoo1990/mlops-with-tensorflow/blob/main/tfdv/tfdv_skew_metrics.ipynb
L-Infinity Distance for categorical feature
Input
features[feature_index].string_stats.rank_histogram
- Collections of (label, sample_count) buckets
- e.g. buckets { label: "Private", sample_count: 18117.0 } { ... }
features[feature_idx].string_stats.common_stats.num_non_missing
- Value to perform normalization
Logic
- Normalize rank_histogram with num_non_missing value of each feature
- sample_count / num_non_missing
def get_normalized_label_list_counts(datasets, feature_idx):
label_len = len(datasets.features[feature_idx].string_stats.rank_histogram.buckets)
feature_num_non_missing = datasets.features[feature_idx].string_stats.common_stats.num_non_missing
label_list = []
normalized_count_dict = {}
# num_non_missing 으로 normalize
for i in range(label_len):
label = datasets.features[feature_idx].string_stats.rank_histogram.buckets[i].label
count = datasets.features[feature_idx].string_stats.rank_histogram.buckets[i].sample_count
label_list.append(label)
normalized_count_dict[label] = count / feature_num_non_missing
return label_list, normalized_count_dict
- Union two different rank_histograms
- To align length of histogram_1 and histogram_2
- Calculate normalized difference between histogram_1 and histogram_2
- The normalized value of label that was not included in original histogram is calculated as 0
- Find max values among normalized difference between two histograms (equal to L-Infinity Distance)
def calculate_l_infinity_dist(current_stats, target_stats, dataset_idx, feature_idx):
current_datasets = current_stats.datasets[dataset_idx]
target_datasets = target_stats.datasets[dataset_idx]
current_label_list, current_norm_count_dict = get_normalized_label_list_counts(current_datasets, feature_idx)
target_label_list, target_norm_count_dict = get_normalized_label_list_counts(target_datasets, feature_idx)
# dataset마다 포함한 item이 다를 수 있어 합집합
union_label_list = set(current_label_list) | set(target_label_list)
# 차이 계산
# 다만, 합집합으로 계산하고 없는 값은 일단 0으로 넣어서 차이 계산
res_dict = {}
for label in union_label_list:
res_dict[label] = abs(current_norm_count_dict.get(label, 0) - target_norm_count_dict.get(label, 0))
return max(res_dict.values())
Original TFDV Source Code
data-validation/tensorflow_data_validation/anomalies/metrics.cc
Lines 40 to 50 in 7ee1514
Jensen-Shannon Divergence for numeric feature
Input
features[feature_index].num_stats.histogram
- Collections of (low_value, high_value, sample_count) buckets.
- e.g. buckets { low_value: 17.0, high_value: 24.3, sample_count: 4435.9744 } { ... }
- Origianl length is 10.
- As two different histograms are likely to have different low_value and high_value, thus a modification of histogram through union is often necessary.
features[feature_index].num_stats.common_stats.num_missing
- To check wheter to add nan value bucket
Logic
- Get union boundaries of two histograms using low_value and high_value.
- Each low_value and high_value is treated as boundary.
- If two histogram have different boundaries, these boundaries needed to be unioned in order to compare each bucket.
- e.g.
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
union[15, 25, 35, 45, 55, 65, 75, 85, 95, 105]
-->[10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105]
def get_union_boundaries(current_histogram, target_histogram):
boundaries = set()
for bucket in current_histogram['buckets']:
boundaries.add(bucket['low_value'])
boundaries.add(bucket['high_value'])
for bucket in target_histogram['buckets']:
boundaries.add(bucket['low_value'])
boundaries.add(bucket['high_value'])
boundaries = sorted(list(boundaries))
return boundaries
- Add values to newly created bucket. (Rebucketing)
- Fill in empty buckets up to the first bucket in the existing histogram.
- If existing bucket needed to be divided because of new boundary, distribute values based on uniform distribution.
{low_value: 10, high_value: 20, sample_count: 100}
->{low_value: 10, high_value: 15, sample_count:50}, {low_value: 15, high_value: 20, sample_count: 50}
- Fill in empty buckets after the last bucket in the existing histogram.
def add_buckets_to_histogram(bucket_boundaries,
total_sample_count,
total_range_covered,
histogram):
"""
e.g.
# bucket 원래 값
low_value: 19214.0, high_value: 165763.1, sample_count: 4158.425273604061
# 비교 histogram 참조하면서 추가된 boundaries (중간에 156600.0 boundary가 추가되면서 히스토그램을 두 개로 쪼개야 되는 상황 생김)
[19214.0, 156600.0, 165763.1]
# uniform distribution을 가정해 (low_value - high_value) / cover_range 비율씩 sample_count를 분배
# cover_range: 165763.1 - 19214.0
{'low_value': 19214.0, 'high_value': 156600.0, 'sample_count': 3898.4163985951977}
{'low_value': 156600.0, 'high_value': 165763.1, 'sample_count': 260.0088750088632}
"""
num_new_buckets = len(bucket_boundaries) - 1 # boundaries는 buckets보다 1개 값이 더 있음
for i in range(num_new_buckets):
new_bucket = {}
new_bucket['low_value'] = bucket_boundaries[i]
new_bucket['high_value'] = bucket_boundaries[i+1]
# uniform distribution으로 쪼개는 코드
new_bucket['sample_count'] = ((bucket_boundaries[i+1] - bucket_boundaries[i]) / total_range_covered) * total_sample_count
histogram['buckets'].append(new_bucket)
def rebucket_histogram(boundaries, histogram):
rebuck_hist = {}
rebuck_hist['buckets'] = []
rebuck_hist['num_nan'] = histogram['num_nan']
index = 0
max_index = len(boundaries) - 1
for bucket in histogram['buckets']:
low_value = bucket['low_value']
high_value = bucket['high_value']
sample_count = bucket['sample_count']
# 원래 자신이 가지고 있던 값보다 작은 buket들 값 설정 (0으로 설정)
while low_value > boundaries[index]:
new_bucket = {}
new_bucket['low_value'] = boundaries[index]
index += 1
new_bucket['high_value'] = boundaries[index]
new_bucket['sample_count'] = 0.0
rebuck_hist['buckets'].append(new_bucket)
# 추가 예외 처리 부분인데 일단 비활성화
# if low_value == high_value and low_value == boundaries[index]:
# new_bucket = {}
# new_bucket['low_value'] = boundaries[index]
# index += 1
# new_bucket['high_value'] = boundaries[index]
# new_bucket['sample_count'] = sample_count
# rebuck_hist.append(new_bucket)
# continue
# 쪼개야 되는 범위 산출
covered_boundaries = []
while high_value > boundaries[index]:
covered_boundaries.append(boundaries[index])
index += 1
covered_boundaries.append(boundaries[index]) # 같은 값일 때 자기 포함
if len(covered_boundaries) > 0:
add_buckets_to_histogram(covered_boundaries, sample_count, high_value - low_value, rebuck_hist)
# 원래 자신의 범위 값 넘어선 범위들 bucket에 대한 값 설정 (0으로 설정)
for i in range(index, max_index):
new_bucket = {}
new_bucket['low_value'] = boundaries[index]
new_bucket['high_value'] = boundaries[index + 1]
new_bucket['sample_count'] = 0.0
rebuck_hist['buckets'].append(new_bucket)
return rebuck_hist
def align_histogram(boundaries, current_histogram, target_histogram):
boundaries = get_union_boundaries(current_histogram, target_histogram)
current_histogram_rebucketted = rebucket_histogram(boundaries, current_histogram)
target_histogram_rebucketted = rebucket_histogram(boundaries, target_histogram)
return current_histogram_rebucketted, target_histogram_rebucketted
- Normalize rebucketted histogram sample_count. (Make it to 0.0 ~ 1.0)
- Calculate through Jensen-Shannon Divergence formula.
- Kullback-Leibler Divergence:
Dkl(p||q) = E[log(pi)-log(qi)] = sum(pi*log(pi/qi))
- Jensen-Shannon Divergence:
JSD(p||q) = 1/2*Dkl(p||m) + 1/2*Dkl(q||m) where m=(p+q)/2
- Summation of JSD values
- Kullback-Leibler Divergence:
# Calculate the kl divergence
def kl_divergence(p, q):
res = 0
for i in range(len(p)):
if p[i] > 0 and q[i] > 0:
res += p[i] * log2(p[i]/q[i])
return res
# Calculate the js divergence
def js_divergence(p, q):
m = 0.5 * (p + q)
return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)
Original TFDV Source Code
data-validation/tensorflow_data_validation/anomalies/metrics.cc
Lines 248 to 256 in 7ee1514
Python code implementation for skew calculation
Full python implementation codes are available in my repository.
https://github.com/jinwoo1990/mlops-with-tensorflow/blob/main/tfdv/tfdv_skew_metrics.ipynb
Thank you so much for this info! This is exactly what I was looking for. :)