Large portion of the time spent on "All others" category

Question

Large portion of the time spent on "All others" category

ethanyanjiali opened this issue 4 years ago · comments

profiling log:
https://drive.google.com/file/d/1A8gilaW6BguoPc1x8G6DxPajNnsKMoQJ/view?usp=sharing

I'm using profiler with my custom loop training. training step is wrapped by tf.function just like the distributed strategy tutorial. and for profiling, i only ran 5 fake short epochs. I'm also using tf.data APIs with all the prefetch and cache tricks. so in general i don't see much custom python overhead in my code.

When inspecting the tracer, it seems like Iterator::Prefetch is most expensive one, but I can't figure out what does it mean. My questions are 1) does this "All others" category also mistakenly include the time spent on my tf.data input pipeline? 2) could you give me some suggestion on how to debug in this situation? "All others" category doesn't give me too much information about where to optimize.

I also checked this issue: #2
however i'm not using any py_function in my tf.data pipeline, just all the standard stuff like resizing, decoding, random jittering, etc

Ethan Yanjia Li · Answer 1 · Wed May 13 2020 01:36:07 GMT+0800 (China Standard Time)

btw i also see Segmentation Fault and stack traces like this occasionally:

7fcb1a679000-7fcb1a6a7000 rw-p 00000000 00:00 0 
7fcb1a6a7000-7fcb1a6ca000 r-xp 00000000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a6ca000-7fcb1a6cb000 rw-s 00000000 00:06 39962                      /dev/nvidiactl
7fcb1a6cb000-7fcb1a6cc000 r--s 00000000 00:06 11387                      /dev/nvidia7
7fcb1a6cc000-7fcb1a6dc000 -w-s 00000000 00:06 39963                      /dev/nvidia0
7fcb1a6dc000-7fcb1a71c000 rw-p 00000000 00:00 0 
7fcb1a71c000-7fcb1a8b7000 r--p 00000000 08:01 524291                     /usr/lib/locale/locale-archive
7fcb1a8b7000-7fcb1a8bb000 rw-p 00000000 00:00 0 
7fcb1a8bb000-7fcb1a8bc000 r--s 00000000 00:06 11386                      /dev/nvidia6
7fcb1a8bc000-7fcb1a8bd000 r--s 00000000 00:06 26741                      /dev/nvidia5
7fcb1a8bd000-7fcb1a8be000 r--s 00000000 00:06 11385                      /dev/nvidia4
7fcb1a8be000-7fcb1a8bf000 r--s 00000000 00:06 26740                      /dev/nvidia3
7fcb1a8bf000-7fcb1a8c0000 r--s 00000000 00:06 26739                      /dev/nvidia2
7fcb1a8c0000-7fcb1a8c1000 r--s 00000000 00:06 26738                      /dev/nvidia1
7fcb1a8c1000-7fcb1a8c2000 r--s 00000000 00:06 39963                      /dev/nvidia0
7fcb1a8c2000-7fcb1a8c3000 rwxp 00000000 00:00 0 
7fcb1a8c3000-7fcb1a8ca000 r--s 00000000 08:01 526696                     /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7fcb1a8ca000-7fcb1a8cb000 r--p 00023000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a8cb000-7fcb1a8cc000 rw-p 00024000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a8cc000-7fcb1a8cd000 rw-p 00000000 00:00 0 
7ffd6dce3000-7ffd6dd04000 rw-p 00000000 00:00 0                          [stack]
7ffd6dda2000-7ffd6dda4000 r--p 00000000 00:00 0                          [vvar]
7ffd6dda4000-7ffd6dda6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

ckluk · Answer 2 · Wed May 13 2020 10:26:12 GMT+0800 (China Standard Time)

Hi, I suspect that "All others" includes the time spent on the tf.data input pipeline. We have recently fixed a related issue. To confirm, can you install tf-nightly from https://pypi.org/project/tf-nightly/ and collect the profile again? Once we confirm that this is a tf.data issue, we can ask for advice from the tf.data team. Thanks, -ck

…

On Tue, May 12, 2020 at 10:18 AM Ethan Yanjia Li ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/8679679/81724011-8b0a2a00-9438-11ea-8b66-652c9a8b1d5a.png> profiling log: https://drive.google.com/file/d/1A8gilaW6BguoPc1x8G6DxPajNnsKMoQJ/view?usp=sharing I'm using profiler with my custom loop training. training step is wrapped by tf.function just like the distributed strategy tutorial. and for profiling, i only ran 5 fake short epochs. I'm also using tf.data APIs with all the prefetch and cache tricks. so in general i don't see much custom python overhead in my code. When inspecting the tracer, it seems like Iterator::Prefetch is most expensive one, but I can't figure out what does it mean. My questions are 1) does this "All others" category also mistakenly include the time spent on my tf.data input pipeline? 2) could you give me some suggestion on how to debug in this situation? "All others" category doesn't give me too much information about where to optimize. I also checked this issue: #2 <#2> however i'm not using any py_function in my tf.data pipeline, just all the standard stuff like resizing, decoding, random jittering, etc — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3LWCESAICOQDNNN4EDRRGAEZANCNFSM4M7AQCEA> .

Cheng Ren · Answer 3 · Tue Jun 16 2020 06:26:05 GMT+0800 (China Standard Time)

@ckluk can you share the PR to the fix? Also since which version the fix was in?

ckluk · Answer 4 · Wed Jun 17 2020 09:16:36 GMT+0800 (China Standard Time)

This PR just says that the "All Others" time may be due to tf.data input pipeline; it doesn't actually fix any bug.

…

On Mon, Jun 15, 2020 at 3:26 PM Cheng Ren ***@***.***> wrote: @ckluk <https://github.com/ckluk> can you share the PR to the fix? Also since which version the fix was in? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3MGLC3EB3ZQAZE3NODRW2NYVANCNFSM4M7AQCEA> .

Benjamin Weinstein · Answer 5 · Tue Nov 24 2020 03:42:53 GMT+0800 (China Standard Time)

Was there any update here? My dominant class is "Other"

I can't seem to scale beyond a single GPU with any performance gain. Looking at trace I see alot of "ExecutorDoneCallback" which has a 'user friendly category' of 'other'. I cannot find any docs with this function in it.

ckluk · Answer 6 · Wed Nov 25 2020 07:30:07 GMT+0800 (China Standard Time)

If you compare the left and right pie charts, you can see that the "others" in the right pie chart is "Dataset" ops. They are the tf.data ops used in your input pipeline.

…

On Mon, Nov 23, 2020 at 11:43 AM Ben Weinstein ***@***.***> wrote: Was there any update here? My dominant class is "Other" [image: Screen Shot 2020-11-23 at 11 40 18 AM] <https://user-images.githubusercontent.com/1208492/100007299-bb843680-2d80-11eb-80e3-27c7e00760d9.png> I can't seem to scale beyond a single GPU with any performance gain. Looking at trace I see alot of "ExecutorDoneCallback" which has a 'user friendly category' of 'other'. I cannot find any docs with this function in it. [image: Screen Shot 2020-11-23 at 11 41 51 AM] <https://user-images.githubusercontent.com/1208492/100007441-eff7f280-2d80-11eb-8939-75a44183e027.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3N542WYWX6Y5QRODA3SRK3MXANCNFSM4M7AQCEA> .

Benjamin Weinstein · Answer 7 · Wed Nov 25 2020 08:17:34 GMT+0800 (China Standard Time)

Thanks for following up. By the tf.data ops do you mean the time elapsed during things like dataset.batch
and dataset.zip, but not the map functions, since thats what the rest of the pie shows? I've played around with a number of designs, but looking for a way to allow customizable parsers (below) to build up a dataset. Perhaps that's what's going on. I've taken these operations out and haven't noticed any difference.

def tf_dataset(tfrecords,
               batch_size=2,
               shuffle=True,
               RGB=True,
               HSI=True,
               labels=True,
               ids = False,
               metadata=True,
               submodel=False,
               augmentation = True,
               cache=False,
               cores=32):
    """Create a tf.data dataset that yields sensor data and ground truth
    Args:
        tfrecords: path to tfrecords, see generate.py
        RGB: Include RGB data
        HSI: Include HSI data
        ids: include box ids
        metadata: include metadata 
        labels: training record labels
        submodel: Logical. "spectral" or "spatial submodels" have three label inputs
    Returns:
        dataset: a tf.data dataset yielding crops and labels for train: True, crops and raster indices for train: False
        """
    AUTO = tf.data.experimental.AUTOTUNE

    inputs = [ ]
    
    dataset = tf.data.TFRecordDataset(tfrecords, num_parallel_reads=cores)   
    
    if shuffle:
        dataset = dataset.shuffle(10)      
    
    if ids:
        ids_dataset = dataset.map(_box_index_parse_, num_parallel_calls=cores) 
            
    if HSI:
        HSI_dataset = dataset.map(_HSI_parse_, num_parallel_calls=cores) 
        if augmentation:
            HSI_dataset = HSI_dataset.map(augment, num_parallel_calls=cores)   
                
        inputs.append(HSI_dataset)        
        
    if RGB:
        RGB_dataset = dataset.map(_RGB_parse_, num_parallel_calls=cores) 
        if augmentation:
            RGB_dataset = RGB_dataset.map(augment, num_parallel_calls=cores)    
        inputs.append(RGB_dataset)    
        
    if metadata:        
        height_dataset = dataset.map(_height_parse_, num_parallel_calls=cores)     
        inputs.append(height_dataset)   
        
        elevation_dataset = dataset.map(_elevation_parse_, num_parallel_calls=cores)                 
        inputs.append(elevation_dataset)   
        
        site_dataset = dataset.map(_site_parse_, num_parallel_calls=cores)                 
        inputs.append(site_dataset)   
        
    if labels:
        labels_dataset = dataset.map(_label_parse_, num_parallel_calls=cores) 
        
        if submodel:
            labels_dataset = tf.data.Dataset.zip((labels_dataset, labels_dataset, labels_dataset))
            
    if ids:
        if labels:
            zipped_dataset = tf.data.Dataset.zip((ids_dataset, tuple(inputs), labels_dataset))
        else:
            zipped_dataset = tf.data.Dataset.zip((ids_dataset, tuple(inputs)))        
    else:
        if labels:
            zipped_dataset = tf.data.Dataset.zip((tuple(inputs), labels_dataset))
        else:
            zipped_dataset = tf.data.Dataset.zip(tuple(inputs))              
          
    #batch and shuffle
    if shuffle:
        zipped_dataset = zipped_dataset.shuffle(buffer_size=10)   
    
    zipped_dataset = zipped_dataset.batch(batch_size=batch_size)
    if cache:
        zipped_dataset = zipped_dataset.cache()
    zipped_dataset = zipped_dataset.prefetch(buffer_size=1)    
    
    return zipped_dataset

tensorflow / profiler

Large portion of the time spent on "All others" category

GPU utilization

CPU utilization