yui-mhcp / data_processing

Data processing utilities in keras3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸ˜‹ Data processing utilities

This repository contains a set of utilities for data processing in audio, image and text. It also provides generic functions for data vizualisation (cf utils/plot_utils.py).

It is part of the main project on Deep Learning : it is a submodule used in all other projects for data processing / monitoring.

Check the CHANGELOG file to have a global overview of the latest updates / new features ! πŸ˜‹

Project structure

β”œβ”€β”€ example_data        : some data example / plots / ...
β”œβ”€β”€ loggers             : special funny and useful loggers
β”‚   β”œβ”€β”€ telegram_handler.py : allows to log messages via Telegram bot
β”‚   β”œβ”€β”€ time_logger.py      : utilities to log functions' performances'
β”‚   β”œβ”€β”€ tts_handler.py      : allow to log messages via the TTS models
β”œβ”€β”€ unitests            : unitest framework (experimental)
β”‚   β”œβ”€β”€ __init__.py         : defines a custom TestCase for more powerful assertEqual definition
β”‚   └── test_*.py           : unit testing files
β”œβ”€β”€ utils               : main utilities directory
β”‚   β”œβ”€β”€ audio           : audio processing utilities 
β”‚   β”‚   β”œβ”€β”€ audio_annotation.py     : allows to annotate an audio by adding information to frames
β”‚   β”‚   β”œβ”€β”€ audio_augmentation.py   : augmentation utilities
β”‚   β”‚   β”œβ”€β”€ audio_io.py             : loading / writing audio utilities
β”‚   β”‚   β”œβ”€β”€ audio_processing.py     : some processing functions such as silence trimming
β”‚   β”‚   β”œβ”€β”€ audio_search.py         : utility class used in the Speech-To-Text project \*
β”‚   β”‚   β”œβ”€β”€ mkv_utils.py            : processing functions for .mkv files
β”‚   β”‚   └── stft.py                 : multiple implementations of the STFT and MFCC in tensorflow
β”‚   β”œβ”€β”€ distance        : distance utilities (clustering / comparison)
β”‚   β”‚   β”œβ”€β”€ clustering.py           : generic clustering functions + wrapper
β”‚   β”‚   β”œβ”€β”€ distance_method.py      : some distance functions
β”‚   β”‚   β”œβ”€β”€ k_means.py              : K-means implementation in tensorflow
β”‚   β”‚   β”œβ”€β”€ knn.py                  : K-Nearest Neighbors implementation in tensorflow
β”‚   β”‚   β”œβ”€β”€ label_propagation.py    : custom algorithm to cluster based on label propagation
β”‚   β”‚   └── spectral_clustering.py  : tensorflow implementation of the spectral clustering algorithm
β”‚   β”œβ”€β”€ image           : image utilities
β”‚   β”‚   β”œβ”€β”€ box_utils           : utilities for Bounding Box manipulation
β”‚   β”‚   β”‚   β”œβ”€β”€ bounding_box.py     : BoundingBox class used in the YOLO project
β”‚   β”‚   β”‚   β”œβ”€β”€ box_filters.py      : experimental box filtering strategies for OCR streaming
β”‚   β”‚   β”‚   β”œβ”€β”€ box_functions.py    : general functions to convert format, draw, crop, ...
β”‚   β”‚   β”‚   β”œβ”€β”€ geo_utils.py        : general functions for geometrical operations on boxes
β”‚   β”‚   β”‚   └── nms_methods.py      : implementations of the Non Maximum Suppression strategies
β”‚   β”‚   β”œβ”€β”€ custom_cameras.py       : custom objects usable in the stream_camera function
β”‚   β”‚   β”œβ”€β”€ image_augmentation.py   : some augmentation functions for images
β”‚   β”‚   β”œβ”€β”€ image_io.py             : image / video writing / loading functions
β”‚   β”‚   β”œβ”€β”€ image_normalization.py  : normalization schemes used by model architectures
β”‚   β”‚   β”œβ”€β”€ image_utils.py          : some utilities
β”‚   β”‚   β”œβ”€β”€ mask_utils.py           : utility functions for masking
β”‚   β”‚   └── video_utils.py          : video functions (copy / extract audio, ...)
β”‚   β”œβ”€β”€ text            : functions for text processing (encoding / decoding)
β”‚   β”‚   β”œβ”€β”€ abreviations            : json files of abreviations in multiple languages (experimental)
β”‚   β”‚   β”œβ”€β”€ document_parser         : functions to extract text from documents (experimental)
β”‚   β”‚   β”‚   β”œβ”€β”€ docx_parser.py
β”‚   β”‚   β”‚   β”œβ”€β”€ html_parser.py
β”‚   β”‚   β”‚   β”œβ”€β”€ parser.py
β”‚   β”‚   β”‚   β”œβ”€β”€ parser_utils.py
β”‚   β”‚   β”‚   β”œβ”€β”€ pdf_parser.py
β”‚   β”‚   β”‚   └── txt_parser.py
β”‚   β”‚   β”œβ”€β”€ bpe.py                  : Byte-Pair-Encoding (BPE) functions
β”‚   β”‚   β”œβ”€β”€ cleaners.py             : many functions to clean text
β”‚   β”‚   β”œβ”€β”€ cmudict.py
β”‚   β”‚   β”œβ”€β”€ ctc_decoder.py          : CTC decoding methods (greedy / beam-search)
β”‚   β”‚   β”œβ”€β”€ f1.py                   : F1-metric function (inspired from SQUAD evaluation script)
β”‚   β”‚   β”œβ”€β”€ numbers.py              : functions to convert numbers to text
β”‚   β”‚   β”œβ”€β”€ sentencepiece_encoder.py    : TextEncoder subclass to support sentencepiece encoders
β”‚   β”‚   β”œβ”€β”€ text_augmentation.py    : some operations for text masking
β”‚   β”‚   β”œβ”€β”€ text_encoder.py         : main TextEncoder class 
β”‚   β”‚   └── text_processing.py      : some text processing functions (such as splitting)
β”‚   β”œβ”€β”€ thread_utils        : producer-consumer framework
β”‚   β”‚   β”œβ”€β”€ consumer.py         : main Consumer class that consumes a stream and produces another
β”‚   β”‚   β”œβ”€β”€ grouper.py          : special Consumer that groups multiple items
β”‚   β”‚   β”œβ”€β”€ pipeline.py         : special Consumer that executes multiple tasks with some features
β”‚   β”‚   β”œβ”€β”€ producer.py         : main Producer class that produces a stream based on a generator
β”‚   β”‚   β”œβ”€β”€ splitter.py         : special Consumer class that splits an item into multiple items
β”‚   β”‚   β”œβ”€β”€ threaded_dict.py    : special thread-safe dict-like class with blocking get
β”‚   β”‚   └── threaded_queue.py   : deprecated, prefer to use the Consumer class which is more stable
β”‚   β”œβ”€β”€ comparison_utils.py : utility functions to compare data
β”‚   β”œβ”€β”€ embeddings.py       : utility functions to manipulate embeddings (save / load / select)
β”‚   β”œβ”€β”€ file_utils.py       : loading / saving data from multiple file's formats'
β”‚   β”œβ”€β”€ generic_utils.py    : generic utility functions
β”‚   β”œβ”€β”€ pandas_utils.py     : utilities to manipulated pandas.DataFrame
β”‚   β”œβ”€β”€ plot_utils.py       : plot functions that groups multiple features from matplotlib.pyplot
β”‚   β”œβ”€β”€ sequence_utils.py   : utilities for sequence manipulation (such as pad_batch)
β”‚   β”œβ”€β”€ stream_utils.py     : utilities for functions streaming
β”‚   β”œβ”€β”€ tensorflow_utils.py : convenient tensorflow functions / features
β”‚   └── wrapper_utils.py    : useful wrappers to make beautiful and well-documented codes !
β”œβ”€β”€ example_audio.ipynb
β”œβ”€β”€ example_clustering.ipynb
β”œβ”€β”€ example_generic.ipynb
β”œβ”€β”€ example_image.ipynb
β”œβ”€β”€ example_producer_consumer.ipynb
β”œβ”€β”€ example_text.ipynb
└── example_unitest.ipynb

* This feature is provided in the Speech To Text project

Available features

It is not an exhaustive list of all the available features / functions but an interesting overview of the most useful ones. You can check the example notebooks for more concrete examples.

  • Audio (module utils.audio)
Feature Fuction / class Description
display display_audio display audio in jupyter notebook
loading load_audio load audio from multiple formats such as wav, mp3, ...
writing write_audio save audio to wav or mp3
noise reduction reduce_noise use the noisereduce library to reduce noise
silence trimming trim_silence trim silence (start / end) or remove it
STFT / MFCC MelSTFT class abstract class supporting multiple STFT implementation (compatible with different models), all supporting tensorflow 2.x graph
  • Image (module utils.image) :
Feature Fuction / class Description
display display_iamge display image in jupyter notebook
loading load_image load image from multiple formats in tensorflow 2.x
writing write_image facility to cv2.imwrite supporting multiple types
writing video write_video save a list of images as a video
camera streaming stream_camera apply a given function to each camera frame and show result
image augmentation augment_image apply multiple transformation on an image
mask color create_color_mask create a mask on a given color (with threshold)
mask transform apply_mask apply a transformation on a part of the image
box manipulation box_utils.py many utilities to create, show and transform boxes
  • Text (module utils.text) :
Feature Fuction / class Description
cleaning cleaners.py multiple cleaners for text normalization
encoding / decoding TextEncoder class to facilitate text encoding / decoding / cleaning
splitting split_text multi-phase text splitting for a more consistent split
Byte-Pair Encoding (BPE) bpe / TextEncoder you can use the TextEncoder for token-based encoding and BPE based encoding
  • Distance (module utils.distance) :
Feature Fuction / class Description
distance distance allow computing distance supporting multiple metrics (euclidian, manhattan, levenstein)
K-NN KNN fully optimizerd KNN implementation in tensorflow 2.x graph
KMeans kmeans implementation of the KMeans clustering algorithm ein tensorflow (also supports methodology to determine best value for k)
Label Propagation label_propagation custom clustering algorithm to propagate label based on a similarity_matrix (useful for the Siamese Networks)
Spectral Clustering spectral_clustering Implementation of the spectral clustering algorithm in pure tensorflow
  • Generic (module utils) :
Feature Fuction / class Description
plot plot utility that combines multiple matplotlib functions in one bigger function
plot spectrogram plot_spectrogram utility that calls plot with some default parameters specific to spectrograms
plot embeddings plot_embedding utility that calls plot after creating a 2D representation of N-D vectors (the embeddings)
plot confusion matrix plot_confusion_matrix creates a heatmap representing the confusion matrix (based on labels / predictions).
load / dump data {load / dump}_data utilities for safe json file manipulation
DataFrame manipulation pandas_utils.py utilities to aggregate / filter DataFrame
embeddings manipulation embeddings.py utilities to load / save / ... embeddings data

Installation and usage

  1. Clone this repository : git clone https://github.com/yui-mhcp/data_processing.git
  2. Go to the root of this repository : cd data_processing
  3. Install requirements : pip install -r requirements.txt
  4. Open an example notebook and follow the instructions !

The utils/{data_type} modules are not loaded by default, meaning that it is not required to install the requirements for a given submodule if you do not want to use it. In this case, you can simply remove the submodule and run the pipreqs command to compute a new requirements.txt file !

For audio processing : ffmpeg is required for some audio processing functions (especially the .mp3 support).

Important Note : some heavy requirements are removed in order to avoid unnecessary installation of such packages (e.g. torch and transformers), as they are only used in very specific functions. It is therefore possible that some ImportError occurs when using specific functions, such as TextEncoder.from_transformers_pretrained(...).

TO-DO list

  • Make example for audio processing
  • Make example for image processing
  • Make example for text processing
  • Make example for plot utils
  • Make example for embeddings
  • Make example for the producer-consumer utility
  • Make example for audio annotation
  • Comment all functions to facilitate their comprehension / usage

Future improvments

  • Audio processing :
    • Improve the audio annotation procedure
    • Allow to extract the audio from videos
    • Allow to add subtitles on a video
    • Enables audio playing without IPython.display autoplay feature
  • Image processing :
    • Clean and optimize the code.
    • Add image loading / writing support.
    • Add video loading / writing support.
    • Improve the stream_camera to better synchronize the frames with audio (when play_audio = True)
    • Add support for segmentation masking
      • Add support for polygon masks
      • Add support for RLE masks
    • Add support for rotated bounding boxes
    • Add Locality-Aware NMS
  • Text processing :
    • Add support for token-splitting instead of word-splitting in the TextEncoder.
    • Add better support for Transformers.
    • Add support for sentencepiece encoders.
    • Add text extraction from documents (experimental)
      • Add support for .txt
      • Add support for .pdf
      • Add support for .docx
      • Add support for .html
      • Add support for .epub
  • Thread utilities :
    • Add the possibility to create a pipeline based on a list of functions.
    • Allow to plot a producer-consumer pipeline with graphviz.
  • Plot functions :
    • Better support subplots.
    • Allow to plot embeddings as subplot.
    • Add support for audio plot (to plot the time (in sec) on the x-axis)
    • Add 3D plot support.

Contacts and licence

You can contact me at yui-mhcp@tutanota.com or on discord at yui#0732

The objective of these projects is to facilitate the development and deployment of useful application using Deep Learning for solving real-world problems and helping people. For this purpose, all the code is under the Affero GPL (AGPL) v3 licence

All my projects are "free software", meaning that you can use, modify, deploy and distribute them on a free basis, in compliance with the Licence. They are not in the public domain and are copyrighted, there exist some conditions on the distribution but their objective is to make sure that everyone is able to use and share any modified version of these projects.

Furthermore, if you want to use any project in a closed-source project, or in a commercial project, you will need to obtain another Licence. Please contact me for more information.

For my protection, it is important to note that all projects are available on an "As Is" basis, without any warranties or conditions of any kind, either explicit or implied. However, do not hesitate to report issues on the repository's project or make a Pull Request to solve it πŸ˜„

If you use this project in your work, please add this citation to give it more visibility ! πŸ˜‹

    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}

Notes and references

  • [1] The text cleaning (text.cleaners) part is highly inspired from NVIDIA tacotron2 repository, which I modified to be used in tensorflow. I have also optimized the audio/stft.py in pure tensorflow, and to also support for other kind of STFT computation (inspired from Jasper, DeepSpeech, ...) projects.

  • [2] The provided embeddings in example_data/embeddings/embeddings_256_voxforge.csv are samples from the VoxForge dataset and embedded with my AudioSiamese audio_siamese_256_mel_lstm model.


Data processing utilities in keras3



Language:Jupyter Notebook 84.9%Language:Python 15.1%