tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large portion of the time spent on "All others" category

ethanyanjiali opened this issue · comments

image

profiling log:
https://drive.google.com/file/d/1A8gilaW6BguoPc1x8G6DxPajNnsKMoQJ/view?usp=sharing

I'm using profiler with my custom loop training. training step is wrapped by tf.function just like the distributed strategy tutorial. and for profiling, i only ran 5 fake short epochs. I'm also using tf.data APIs with all the prefetch and cache tricks. so in general i don't see much custom python overhead in my code.

When inspecting the tracer, it seems like Iterator::Prefetch is most expensive one, but I can't figure out what does it mean. My questions are 1) does this "All others" category also mistakenly include the time spent on my tf.data input pipeline? 2) could you give me some suggestion on how to debug in this situation? "All others" category doesn't give me too much information about where to optimize.

I also checked this issue: #2
however i'm not using any py_function in my tf.data pipeline, just all the standard stuff like resizing, decoding, random jittering, etc

btw i also see Segmentation Fault and stack traces like this occasionally:

7fcb1a679000-7fcb1a6a7000 rw-p 00000000 00:00 0 
7fcb1a6a7000-7fcb1a6ca000 r-xp 00000000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a6ca000-7fcb1a6cb000 rw-s 00000000 00:06 39962                      /dev/nvidiactl
7fcb1a6cb000-7fcb1a6cc000 r--s 00000000 00:06 11387                      /dev/nvidia7
7fcb1a6cc000-7fcb1a6dc000 -w-s 00000000 00:06 39963                      /dev/nvidia0
7fcb1a6dc000-7fcb1a71c000 rw-p 00000000 00:00 0 
7fcb1a71c000-7fcb1a8b7000 r--p 00000000 08:01 524291                     /usr/lib/locale/locale-archive
7fcb1a8b7000-7fcb1a8bb000 rw-p 00000000 00:00 0 
7fcb1a8bb000-7fcb1a8bc000 r--s 00000000 00:06 11386                      /dev/nvidia6
7fcb1a8bc000-7fcb1a8bd000 r--s 00000000 00:06 26741                      /dev/nvidia5
7fcb1a8bd000-7fcb1a8be000 r--s 00000000 00:06 11385                      /dev/nvidia4
7fcb1a8be000-7fcb1a8bf000 r--s 00000000 00:06 26740                      /dev/nvidia3
7fcb1a8bf000-7fcb1a8c0000 r--s 00000000 00:06 26739                      /dev/nvidia2
7fcb1a8c0000-7fcb1a8c1000 r--s 00000000 00:06 26738                      /dev/nvidia1
7fcb1a8c1000-7fcb1a8c2000 r--s 00000000 00:06 39963                      /dev/nvidia0
7fcb1a8c2000-7fcb1a8c3000 rwxp 00000000 00:00 0 
7fcb1a8c3000-7fcb1a8ca000 r--s 00000000 08:01 526696                     /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7fcb1a8ca000-7fcb1a8cb000 r--p 00023000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a8cb000-7fcb1a8cc000 rw-p 00024000 08:01 393316                     /lib/x86_64-linux-gnu/ld-2.24.so
7fcb1a8cc000-7fcb1a8cd000 rw-p 00000000 00:00 0 
7ffd6dce3000-7ffd6dd04000 rw-p 00000000 00:00 0                          [stack]
7ffd6dda2000-7ffd6dda4000 r--p 00000000 00:00 0                          [vvar]
7ffd6dda4000-7ffd6dda6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted
commented

@ckluk can you share the PR to the fix? Also since which version the fix was in?

commented

Was there any update here? My dominant class is "Other"
Screen Shot 2020-11-23 at 11 40 18 AM

I can't seem to scale beyond a single GPU with any performance gain. Looking at trace I see alot of "ExecutorDoneCallback" which has a 'user friendly category' of 'other'. I cannot find any docs with this function in it.
Screen Shot 2020-11-23 at 11 41 51 AM

commented

Thanks for following up. By the tf.data ops do you mean the time elapsed during things like dataset.batch
and dataset.zip, but not the map functions, since thats what the rest of the pie shows? I've played around with a number of designs, but looking for a way to allow customizable parsers (below) to build up a dataset. Perhaps that's what's going on. I've taken these operations out and haven't noticed any difference.

def tf_dataset(tfrecords,
               batch_size=2,
               shuffle=True,
               RGB=True,
               HSI=True,
               labels=True,
               ids = False,
               metadata=True,
               submodel=False,
               augmentation = True,
               cache=False,
               cores=32):
    """Create a tf.data dataset that yields sensor data and ground truth
    Args:
        tfrecords: path to tfrecords, see generate.py
        RGB: Include RGB data
        HSI: Include HSI data
        ids: include box ids
        metadata: include metadata 
        labels: training record labels
        submodel: Logical. "spectral" or "spatial submodels" have three label inputs
    Returns:
        dataset: a tf.data dataset yielding crops and labels for train: True, crops and raster indices for train: False
        """
    AUTO = tf.data.experimental.AUTOTUNE

    inputs = [ ]
    
    dataset = tf.data.TFRecordDataset(tfrecords, num_parallel_reads=cores)   
    
    if shuffle:
        dataset = dataset.shuffle(10)      
    
    if ids:
        ids_dataset = dataset.map(_box_index_parse_, num_parallel_calls=cores) 
            
    if HSI:
        HSI_dataset = dataset.map(_HSI_parse_, num_parallel_calls=cores) 
        if augmentation:
            HSI_dataset = HSI_dataset.map(augment, num_parallel_calls=cores)   
                
        inputs.append(HSI_dataset)        
        
    if RGB:
        RGB_dataset = dataset.map(_RGB_parse_, num_parallel_calls=cores) 
        if augmentation:
            RGB_dataset = RGB_dataset.map(augment, num_parallel_calls=cores)    
        inputs.append(RGB_dataset)    
        
    if metadata:        
        height_dataset = dataset.map(_height_parse_, num_parallel_calls=cores)     
        inputs.append(height_dataset)   
        
        elevation_dataset = dataset.map(_elevation_parse_, num_parallel_calls=cores)                 
        inputs.append(elevation_dataset)   
        
        site_dataset = dataset.map(_site_parse_, num_parallel_calls=cores)                 
        inputs.append(site_dataset)   
        
    if labels:
        labels_dataset = dataset.map(_label_parse_, num_parallel_calls=cores) 
        
        if submodel:
            labels_dataset = tf.data.Dataset.zip((labels_dataset, labels_dataset, labels_dataset))
            
    if ids:
        if labels:
            zipped_dataset = tf.data.Dataset.zip((ids_dataset, tuple(inputs), labels_dataset))
        else:
            zipped_dataset = tf.data.Dataset.zip((ids_dataset, tuple(inputs)))        
    else:
        if labels:
            zipped_dataset = tf.data.Dataset.zip((tuple(inputs), labels_dataset))
        else:
            zipped_dataset = tf.data.Dataset.zip(tuple(inputs))              
          
    #batch and shuffle
    if shuffle:
        zipped_dataset = zipped_dataset.shuffle(buffer_size=10)   
    
    zipped_dataset = zipped_dataset.batch(batch_size=batch_size)
    if cache:
        zipped_dataset = zipped_dataset.cache()
    zipped_dataset = zipped_dataset.prefetch(buffer_size=1)    
    
    return zipped_dataset

GPU utilization

GPU Utilization

CPU utilization

CPU Utilization