tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Skew/Drift Calculation is unclear.

The-Grand-Zand opened this issue · comments

Description of issue (what needs changing):

I have been receiving odd skew results even though the data has a normal standard deviation as well as similar mean, median, missing. From what I have read, the skew is calculated using the L-Infinity norm and Jensen-Shannon Divergence. I have tried to use these two equations to manually calculate the skew/drift of specific features to try and derive the same answer as TFDV returns to no avail.

Would it be possible to explain what variables go into calculating the skew/drift with an example? This would greatly help my understanding and trust for utilizing TFDV for data validation and modeling pipelines.

Current Links to skew/drift explanations

Here are the links I have found with descriptions of the calculation. Unfortunately, none give the granularity that I am looking for.

visual example of odd skew calculation

Policy_tenure_custom_graphs
Policy_tenure_tfdv_graphs

Hello @The-Grand-Zand ,

I am sharing with you what I have found for skew/drift calculation in TFDV.
I tried to implement calculation of l-infinity and jensen-shannon divergence manually, and succeeded to implement it using python.
Metrics calculation logic is implemented using c++, and you can refer to https://github.com/tensorflow/data-validation/blob/9fbc050580fb2433f7fbd6276bbf5b0e654787d1/tensorflow_data_validation/anomalies/metrics.cc.
Main confusing area was aligning two different histograms and distributing values while aligning histograms. Refer the below for the detail. I wrote input variables and detail logic including preprocessing steps.

Full python implementation codes are available in my repository.
https://github.com/jinwoo1990/mlops-with-tensorflow/blob/main/tfdv/tfdv_skew_metrics.ipynb

L-Infinity Distance for categorical feature

Input

  • features[feature_index].string_stats.rank_histogram
    • Collections of (label, sample_count) buckets
    • e.g. buckets { label: "Private", sample_count: 18117.0 } { ... }
  • features[feature_idx].string_stats.common_stats.num_non_missing
    • Value to perform normalization

Logic

  • Normalize rank_histogram with num_non_missing value of each feature
    • sample_count / num_non_missing
def get_normalized_label_list_counts(datasets, feature_idx):
    label_len = len(datasets.features[feature_idx].string_stats.rank_histogram.buckets)
    feature_num_non_missing = datasets.features[feature_idx].string_stats.common_stats.num_non_missing
    label_list = []
    normalized_count_dict = {}
    # num_non_missing 으로 normalize
    for i in range(label_len):
        label = datasets.features[feature_idx].string_stats.rank_histogram.buckets[i].label
        count = datasets.features[feature_idx].string_stats.rank_histogram.buckets[i].sample_count
        label_list.append(label)
        normalized_count_dict[label] = count / feature_num_non_missing
    
    return label_list, normalized_count_dict
  • Union two different rank_histograms
    • To align length of histogram_1 and histogram_2
  • Calculate normalized difference between histogram_1 and histogram_2
    • The normalized value of label that was not included in original histogram is calculated as 0
  • Find max values among normalized difference between two histograms (equal to L-Infinity Distance)
def calculate_l_infinity_dist(current_stats, target_stats, dataset_idx, feature_idx):
    current_datasets = current_stats.datasets[dataset_idx]
    target_datasets = target_stats.datasets[dataset_idx]
    
    current_label_list, current_norm_count_dict = get_normalized_label_list_counts(current_datasets, feature_idx)
    target_label_list, target_norm_count_dict = get_normalized_label_list_counts(target_datasets, feature_idx)

    # dataset마다 포함한 item이 다를 수 있어 합집합
    union_label_list = set(current_label_list) | set(target_label_list)

    # 차이 계산
    # 다만, 합집합으로 계산하고 없는 값은 일단 0으로 넣어서 차이 계산
    res_dict = {}
    for label in union_label_list:
        res_dict[label] = abs(current_norm_count_dict.get(label, 0) - target_norm_count_dict.get(label, 0))
    
    return max(res_dict.values())

Original TFDV Source Code

std::pair<string, double> GetLInftyNorm(const std::map<string, double>& vec) {
std::pair<string, double> best_so_far;
for (const auto& pair : vec) {
const string& key = pair.first;
const double value = std::abs(pair.second);
if (value >= best_so_far.second) {
best_so_far = {key, value};
}
}
return best_so_far;
}

Jensen-Shannon Divergence for numeric feature

Input

  • features[feature_index].num_stats.histogram
    • Collections of (low_value, high_value, sample_count) buckets.
    • e.g. buckets { low_value: 17.0, high_value: 24.3, sample_count: 4435.9744 } { ... }
    • Origianl length is 10.
    • As two different histograms are likely to have different low_value and high_value, thus a modification of histogram through union is often necessary.
  • features[feature_index].num_stats.common_stats.num_missing
    • To check wheter to add nan value bucket

Logic

  • Get union boundaries of two histograms using low_value and high_value.
    • Each low_value and high_value is treated as boundary.
    • If two histogram have different boundaries, these boundaries needed to be unioned in order to compare each bucket.
    • e.g. [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] union [15, 25, 35, 45, 55, 65, 75, 85, 95, 105] --> [10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105]
def get_union_boundaries(current_histogram, target_histogram):
    boundaries = set()
    for bucket in current_histogram['buckets']:
        boundaries.add(bucket['low_value'])
        boundaries.add(bucket['high_value'])
        
    for bucket in target_histogram['buckets']:
        boundaries.add(bucket['low_value'])
        boundaries.add(bucket['high_value'])
    
    boundaries = sorted(list(boundaries))
    
    return boundaries
  • Add values to newly created bucket. (Rebucketing)
    • Fill in empty buckets up to the first bucket in the existing histogram.
    • If existing bucket needed to be divided because of new boundary, distribute values based on uniform distribution.
      • {low_value: 10, high_value: 20, sample_count: 100} -> {low_value: 10, high_value: 15, sample_count:50}, {low_value: 15, high_value: 20, sample_count: 50}
    • Fill in empty buckets after the last bucket in the existing histogram.
def add_buckets_to_histogram(bucket_boundaries, 
                            total_sample_count,
                            total_range_covered,
                            histogram):
    """
    e.g.
    # bucket 원래 값
    low_value: 19214.0, high_value: 165763.1, sample_count: 4158.425273604061
    # 비교 histogram 참조하면서 추가된 boundaries (중간에 156600.0 boundary가 추가되면서 히스토그램을 두 개로 쪼개야 되는 상황 생김)
    [19214.0, 156600.0, 165763.1]
    # uniform distribution을 가정해 (low_value - high_value) / cover_range 비율씩 sample_count를 분배
    # cover_range: 165763.1 - 19214.0
    {'low_value': 19214.0, 'high_value': 156600.0, 'sample_count': 3898.4163985951977}
    {'low_value': 156600.0, 'high_value': 165763.1, 'sample_count': 260.0088750088632}

    """
    
    num_new_buckets = len(bucket_boundaries) - 1  # boundaries는 buckets보다 1개 값이 더 있음
    for i in range(num_new_buckets):
        new_bucket = {}
        new_bucket['low_value'] = bucket_boundaries[i]
        new_bucket['high_value'] = bucket_boundaries[i+1]
        # uniform distribution으로 쪼개는 코드
        new_bucket['sample_count'] = ((bucket_boundaries[i+1] - bucket_boundaries[i]) / total_range_covered) * total_sample_count
        
        histogram['buckets'].append(new_bucket)

def rebucket_histogram(boundaries, histogram):
    rebuck_hist = {}
    rebuck_hist['buckets'] = []
    rebuck_hist['num_nan'] = histogram['num_nan']
    
    index = 0
    max_index = len(boundaries) - 1

    for bucket in histogram['buckets']:
        low_value = bucket['low_value']
        high_value = bucket['high_value']
        sample_count = bucket['sample_count']
        
        # 원래 자신이 가지고 있던 값보다 작은 buket들 값 설정 (0으로 설정)
        while low_value > boundaries[index]:
            new_bucket = {}
            new_bucket['low_value'] = boundaries[index]
            index += 1
            new_bucket['high_value'] = boundaries[index]
            new_bucket['sample_count'] = 0.0

            rebuck_hist['buckets'].append(new_bucket)

        # 추가 예외 처리 부분인데 일단 비활성화
        # if low_value == high_value and low_value == boundaries[index]:
        #     new_bucket = {}
        #     new_bucket['low_value'] = boundaries[index]
        #     index += 1
        #     new_bucket['high_value'] = boundaries[index]
        #     new_bucket['sample_count'] = sample_count

        #     rebuck_hist.append(new_bucket)
        #     continue

        # 쪼개야 되는 범위 산출
        covered_boundaries = []
        while high_value > boundaries[index]:
            covered_boundaries.append(boundaries[index])
            index += 1
        covered_boundaries.append(boundaries[index])  # 같은 값일 때 자기 포함

        if len(covered_boundaries) > 0:
            add_buckets_to_histogram(covered_boundaries, sample_count, high_value - low_value, rebuck_hist)
    
    # 원래 자신의 범위 값 넘어선 범위들 bucket에 대한 값 설정 (0으로 설정)
    for i in range(index, max_index):
        new_bucket = {}
        new_bucket['low_value'] = boundaries[index]
        new_bucket['high_value'] = boundaries[index + 1]
        new_bucket['sample_count'] = 0.0

        rebuck_hist['buckets'].append(new_bucket)
    return rebuck_hist

def align_histogram(boundaries, current_histogram, target_histogram):
    boundaries = get_union_boundaries(current_histogram, target_histogram)
    current_histogram_rebucketted = rebucket_histogram(boundaries, current_histogram)
    target_histogram_rebucketted = rebucket_histogram(boundaries, target_histogram)
    return current_histogram_rebucketted, target_histogram_rebucketted
  • Normalize rebucketted histogram sample_count. (Make it to 0.0 ~ 1.0)
  • Calculate through Jensen-Shannon Divergence formula.
    • Kullback-Leibler Divergence: Dkl(p||q) = E[log(pi)-log(qi)] = sum(pi*log(pi/qi))
    • Jensen-Shannon Divergence: JSD(p||q) = 1/2*Dkl(p||m) + 1/2*Dkl(q||m) where m=(p+q)/2
    • Summation of JSD values
# Calculate the kl divergence
def kl_divergence(p, q):
    res = 0
    for i in range(len(p)):
        if p[i] > 0 and q[i] > 0:
            res += p[i] * log2(p[i]/q[i])
    
    return res

# Calculate the js divergence
def js_divergence(p, q):
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

Original TFDV Source Code

Status UpdateJensenShannonDivergenceResult(const FeatureStatsView& a,
const FeatureStatsView& b,
double& result) {
const absl::optional<Histogram> maybe_histogram_1 = a.GetStandardHistogram();
const absl::optional<Histogram> maybe_histogram_2 = b.GetStandardHistogram();
if (!maybe_histogram_1 || !maybe_histogram_2) {
return tensorflow::errors::InvalidArgument(
"Both input statistics must have a standard histogram in order to "
"calculate the Jensen-Shannon divergence.");

Python code implementation for skew calculation

Full python implementation codes are available in my repository.
https://github.com/jinwoo1990/mlops-with-tensorflow/blob/main/tfdv/tfdv_skew_metrics.ipynb

Thank you so much for this info! This is exactly what I was looking for. :)