daniel servén's starred repositories
the-algorithm
Source code for Twitter's Recommendation Algorithm
super-gradients
Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
deepchecks
Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.
segment-geospatial
A Python package for segmenting geospatial data with the Segment Anything Model (SAM)
stable-diffusion-tensorflow
Stable Diffusion in TensorFlow / Keras
deepscatter
Zoomable, animated scatterplots in the browser that scales over a billion points
poetry-dynamic-versioning
Plugin for Poetry to enable dynamic versioning based on VCS tags
SpanMarkerNER
SpanMarker for Named Entity Recognition
gnome-shell-extension-alt-tab-scroll-workaround
Quick fix to the bug where scrolling in one application is repeated in another when switching between them using Alt+Tab (e.g., VS Code and Chrome)
social-media-tutorials
Code dumps of Youtube/Twitter tutorials
flash-genomics-model
My own attempt at a long context genomics model, leveraging recent advances in long context attention modeling (Flash Attention + other hierarchical methods)
sequence-learn
With sequence-learn, you can build models for named entity recognition as quickly as if you were building a sklearn classifier.
E3C-Corpus
E3C is a freely available multilingual corpus (Italian, English, French, Spanish, and Basque) of semantically annotated clinical narratives to allow for the linguistic analysis, benchmarking, and training of information extraction systems. It consists of two types of annotations: (i) clinical entities: pathologies, symptoms, procedures, body parts, etc., according to standard clinical taxonomies (i.e. SNOMED-CT, ICD-10); and (ii) temporal information and factuality: events, time expressions, and temporal relations according to the THYME standard. The corpus is organised into three layers, with different purposes. Layer 1: about 25K tokens per language with full manual annotation of clinical entities, temporal information and factuality, for benchmarkingand linguistic analysis. Layer 2: 50-100K tokens per language with semi-automatic annotations of clinical entities, to be used to train baseline systems. Layer 3: about 1M tokens per language of non-annotated medical documents to be exploited by semi-supervised approaches. Researchers can use the benchmark training and test splits of our corpus to develop and test their own models. We trained several deep learning based models and provide baselines using the benchmark. Both the corpus and the built models will be available through the ELG platform.
bulk-labeling
A tool for quickly adding labels to unlabeled datasets
mlconfound
Tools for analyzing and quantifying effects of confounder variables on machine learning model predictions.