tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TFDV on Dataflow getting OOMs frequently

cyc opened this issue · comments

I am using TFDV 1.2.0 and have a problem where I am consistently getting workers OOMing on Dataflow even with very large instance types (e.g. n2-highmem-16 and n2-highmem-32). I've tried decreasing --number_of_worker_harness_threads to half the available number of CPUs in hopes of decreasing memory usage to no avail. The OOMs typically happen at the very end of the job when it is done loading the data but still trying to combine the stats together.

The dataset itself is quite large, with 4000-5000 features spanning ~1e9 rows (and some features are quite long varlen). Do you have any tips for how to decrease memory usage on workers? I suspect that the issue may come from some high-cardinality (~1e8) string features we have, as computing the frequency counts for these features is probably very memory-intensive.

Hi Chris!

Do you have the beam stage names where the OOM is occuring?

During the combine phase (when calculating statistics), we have a limit on the size of the record batch (which is what is stored in memory by each beam worker) [1]. Would tweaking this value help?

We also have an experimental StatsOption: experimental_use_sketch_based_topk_uniques[2]. This might help because it allows the top_k to be a mergeable statistic, with a compact property (i.e. beam may be able to compress the in mem representation when it chooses to).

[1] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_impl.py#L602-L608

[2] https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/stats_options.py#L141-L142

Hi @andylou2 I believe it is RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators/CombinePerKey.

I did try using experimental_use_sketch_based_topk_uniques but it still ran out of memory unfortunately. I will try running it again with some of the high-cardinality string features removed from the dataset and see if that fixes the issue; if so, that may be the root cause.

I am not sure whether changing the rb size threshold would help. Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them? Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

Could you clarify what this does exactly? When we load the recordbatch from the dataset it comes in a specific batch size (typically tfdv.constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE or some batch size specified by the user in StatsOptions). Is this saying that the recordbatches will be concatenated up to no more than 20MB in size prior to computing stats on them?

when _MERGE_RECORD_BATCH_BYTE_SIZE_THRESHOLD is reached, the beam worker will compute the stats on the current batch (even if desired batch size isn't reached), and subsequently clear the record batches from RAM.

Given that the highmem instances have quite a lot of RAM, would you expect that the 20MB recordbatch size would be problematic?

It probably isn't. But I do believe each instance can have multiple beam workers. And this threshold only applies per worker.

I don't know enough about dataflow as I wish I do, but I think --number_of_worker_harness_threads may be the wrong flag. It looks like --max_num_workers may be the correct one: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling. Although, the note below says that the autoscaling is based on the number of threads.. so maybe not so helpful.

I've tried a bunch of different things to try to fix this issue on my end, but haven't had too much success. The only thing that consistently seems to work is to set StatsOptions.sample_rate=0.001, which seems pretty extreme.

The problem actually seems to be in BasicStatsGenerator, not TopKUniques. If I set StatsOptions.add_default_generators=False and then manually add just BasicStatsGenerator to the list of generators and exclude TopK, I still get OOM (even on a n1-highmem-64, which has something like 416GB of RAM). The OOM occurs after all the data is loaded and processed and the resulting stats are being combined together by just a few workers.

I will open a GCP support ticket and reference the job ids of the failed dataflow runs in question since that might help debug it a little better.

Support case number is 28706271 and includes some more details, I can cc people on the ticket if need be

Can you clarify whether there are any workarounds for this issue? We are pretty constantly seeing this OOM issue with 1.2.0. The only solution seems to be to change StatsOptions.sample_rate to a very small number such that the num_instances counter is less than 1e6. Given that I can probably load 1e6 rows entirely into memory on a n1-highmem-16 it seems a bit strange that the memory overhead for computing/combining stats would be so high.

I was having this same issue with my datasets and opened a gcp ticket and everything. Eventually it came down to my instances not being large enough even though I thought they were. After experimenting I ended up on n2-highmem-8. I wonder with how many features you have, you are having the same issue.

You said you tried n2-highmem-16. As a test have you tried spinning up much much larger instances just to see if they can get through your dataset? Like some exaggeration like n2-highmem-48.

@cyc - TFDV 1.4.0 was released last week, and we expect that it will reduce the number of OOM errors. Has anyone tried it yet?

Hi @rcrowe-google, thanks for the heads up. Could you point to the specific changes or commits that should help with OOM errors? Or let me know if there are any particular settings I should try adding, such as the experimental sketch-based generators? I can experiment with upgrading to 1.4.0 later this week.

I tried TFDV 1.4.0 and compared to 1.2.0 on a relatively small dataset (~100m rows, ~5k columns) and still ran into OOM issues on dataflow with n2-highmem-64 machine types for both.

@cyc - Thanks for taking the time to test that, and I'm very disappointed that it didn't work. I think the that what we need most at this point is code and data that we can use to reproduce the exact problem. Can you share that either here or in the GCP case?