ML & AI news of the week

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

ML news: Week 10 - 16 June
ML news: Week 3 - 9 June
ML news: Week 27 May - 2 June
ML news: Week 20 - 26 May
ML news: Week 13 - 19 May
ML news: Week 6 - 12 May
ML news: Week 29 April - 5 May
ML news: Week 21 - 28 April
ML news: Week 15 - 21 April
ML news: Week 8 - 14 April
ML news: Week 1 - 7 April
ML news: Week 25 - 31 March
ML news: Week 18 - 24 March
ML news: Week 11 - 17 March
ML news: Week 4 - 10 March
ML news: Week 26 February - 3 March
ML news: Week 19 - 25 February
ML news: Week 12 - 18 February
ML news: Week 5 - 11 February
ML news: Week 29 January - 4 February
ML news: Week 22 - 28 January
ML news: Week 15 - 21 January
ML news: Week 8 - 14 January
ML news: Week 1 - 7 January

Back to index

2024

ML news: Week 10 - 16 June

Research

Link	description
Scaling neural machine translation to 200 languages.	Based on a sparsely Gated Mixture of Experts architecture and trained on data using a method designed for low-resource languages, presents a massive multilingual model that uses transfer learning across 200 languages. It evaluates 40K translations and achieves an average 44% improvement in translation quality.
MatMul-free LLMs.	claims that memory consumption can be reduced by more than 10x by using an optimized kernel during inference; suggests an implementation that removes matrix multiplication operations from LLMs while maintaining performance at billion-parameter scales; the performance gap between full precision Transformers and the MatMul-free models narrows as the model size increases.
Buffer of Thoughts .	utilizes a meta-buffer containing high-level thoughts (thought templates) extracted from problem-solving processes to present a thought-augmented reasoning approach that improves the accuracy, efficiency, and robustness of LLM-based reasoning. The relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process. It shows SOTA performance on 10 difficult tasks at 12% of the cost of multi-query prompting methods such as Tree-of-Thoughts.
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales.	supervised finetuning on a dataset containing summaries of the differences between multiple reasoning chains is performed by the training framework to teach LLMs to express more accurate fine-grained confidence estimates and self-reflective rationales. Reinforcement learning is then applied to calibrate confidence estimates, encouraging the LLM to produce accurate, high-confidence predictions and penalizing overconfidence in erroneous outputs.
The Geometry of Categorical and Hierarchical Concepts in Large Language Models.	investigates the geometry of categorical concepts and how the hierarchical relations between them are encoded in LLMs. It discovers that the hierarchical structure is reflected in the representation of complex concepts by polytopes made from direct sums of simplices, while simple categorical concepts are represented as simplices by the LLMs.
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback.	suggests a technique that uses a very small number of demonstrations as feedback to align LLMs to a particular setting; it outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks and aligns LLM outputs to a user's demonstrated behaviors. Additionally, it can learn fine-grained style and task alignment across domains.
Towards Scalable Automated Alignment of LLMs.	gives a summary of the techniques used to align LLMs and examines the four orientations listed below 1) Inductive bias alignment; 2) Behavior imitation alignment; 3) Model feedback alignment; and 4) Environment feedback alignment
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments.	a novel framework with multiple tasks and contexts for wide-ranging, concurrent, and real-time agent exploration; constructs a generally competent LLM-based agent with the ability to self-evolve and investigates its potential beyond data that hasn't been seen before across tasks and environments.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment.	A Synthetic-Domain Alignment (SDA) framework has been developed by researchers to improve test-time adaptation (TTA) techniques. By fine-tuning pretrained models with synthetic data produced by a conditional diffusion model, SDA efficiently aligns source and synthetic domains.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.	Reward-based Noise Optimization (ReNO) is a novel technique to improve Text-to-Image (T2I) models during inference by employing signals from reward models with human preferences to optimize the baseline noise.
YOLO-World: Real-Time Open-Vocabulary Object Detection.	With YOLO-World, researchers have improved the widely used YOLO object detectors and included open-vocabulary detection. This method, which combines large-scale dataset training with vision-language modeling, enables it to swiftly and accurately detect a wide range of objects, even in situations for which it was not designed.
Improved Scene Landmark Detection for Camera Localization.	Using distinctive scene landmarks, researchers have developed a novel, privacy-friendly technique for camera localization. This method, which does not rely on real 3D point clouds for localization, is very accurate and storage-efficient since it makes use of 3D scene landmarks and a CNN-based heatmap.
Proofread: Fixes All Errors with One Tap.	The Gboard team has described how they correct sentence- and paragraph-level problems in written text on the device using SFT on a PaLM2-XS model. They discovered that latency optimizations led to significant gains in utilization.
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model.	Using a new quantization approach, the Snap Research team was able to increase speed while reducing the size of the Stable Diffusion UNet model from 1.72 GB to 219 MB. Although the quantization technique is a little complicated, it shows great promise for generative model execution on consumer hardware.
Introducing Apple’s On-Device and Server Foundation Models.	During WWDC 2024, Apple debuted "Apple Intelligence". Apple Intelligence is an AI system that is built into macOS Sequoia, iOS 18, and iPadOS 18. It has sophisticated generative models for a variety of commonplace activities, like text refinement, picture generation, and notification summary. With an emphasis on user privacy and responsible AI development, this system integrates cloud and on-device capabilities to improve the user experience across all Apple products.
OVMR: Open-Vocabulary Recognition with Multi-Modal References.	OVMR is a novel approach that combines textual descriptions with sample photos to improve open-vocabulary recognition.
Predictive Dynamic Fusion.	The Predictive Dynamic Fusion (PDF) architecture solves stability and reliability problems to improve multimodal learning.
Compute Better Spent: Replacing Dense Layers with Structured Matrices.	The Linear layers are where Transformer computation is primarily done. This approach creates a structured representation with better scaling laws than naive dense layers, using less CPU than muP and Monarch matrices.
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models.	A thorough methodology called CARES is used to assess the reliability of Medical Large Vision Language Models (Med-LVLMs).
Learning to Route Among Specialized Experts for Zero-Shot Generalization.	PHATGOOSE is an approach that dramatically increases an AI's capacity to generalize and learn new tasks without prior exposure by efficiently routing between different specialized language models for each portion of a task.
Diabetic Retinopathy Detection.	A unique framework that enhances the grading of diabetic retinopathy (DR), a condition that can result in visual impairment, has been developed by researchers.
BERTs are Generative In-Context Learners.	In a different universe, BERT models—rather than their decoder-only GPT counterparts—would have been shown to be in-context learners. When that is the case, as this paper investigates, BERTs perform remarkably well in information retrieval but poorly in knowledge acquisition, most likely as a result of the bidirectional attention mechanism.
TextGrad: Automatic "Differentiation" via Text.	The concept of treating a language model that is capable of updating text as a backpropagation system is investigated in this study. The benchmark performance, not computationally matched against baseline models, shows significant increases, according to the researchers.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision.	DeepMind found a great way to extend the labor-intensive process of process oversight that requires human intervention. With robust base models, it was able to automate a significant portion of the procedure, which resulted in significant mathematical reasoning performance on Gemini Pro tuned models.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation.	For image generation, Llama Gen is an autoregressive model that scales better than diffusion alternatives. By using ImageNet to train a class-conditioned model, its researchers were able to raise the bar for FID.
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models.	In order to address the efficiency concerns in autoregressive big language models, researchers have looked into combining speculative decoding with linear attention techniques. In order to improve training and performance, this work presents an augmentation strategy for linear attention that is consistent with speculative decoding.
What If We Recaption Billions of Web Images with LLaMA-3?	Using a vision model to caption online scraped photos significantly enhances downstream model performance. This is particularly valid for models like CLIP.
Hearing Anything Anywhere.	This research presents DiffRIR, a new framework that uses a planar scene reconstruction with a limited number of room impulse response (RIR) recordings to recreate the spatial acoustic properties of environments.
Simple and Effective Masked Diffusion Language Models.	By using an efficient training recipe and incorporating a simpler Rao-Blackwellized objective, researchers have shown that masked discrete diffusion models can compete with autoregressive approaches in language modeling.

News

Link	description
First NHS physiotherapy clinic run by AI to start this year.	New platform to provide same-day appointments with digital physiotherapist in effort to cut waiting times
Apple to launch iOS 18 AI features marketed as ‘Apple Intelligence’.	Bloomberg’s Mark Gurman today reports that Apple will launch its upcoming AI initiatives in iOS 18 and other operating systems under the brand name ‘Apple Intelligence’, which is obviously a convenient twist on the ‘AI’ acronym.
Claude’s Character.	Claude is not simply your average, sycophantic AI that nods in agreement with the user. A character version of Constitutional AI has been specifically used to create Claude's personality and character. This essay goes into great detail on how Claude uses post-training to control the kind of output that he typically produces in order to portray this desired character.
Databricks + Tabular.	With the acquisition of Tabular, Databricks has brought together major players from Apache Iceberg and Delta Lake to concentrate on data format interoperability for its lakehouse architecture. With Delta Lake UniForm's compatibility solution at the forefront, the objective is to establish a single, open standard for data interoperability in order to prevent data silos.
How the voices for ChatGPT were chosen.	We worked with industry-leading casting and directing professionals to narrow down over 400 submissions before selecting the 5 voices.
OpenAI and Apple announce partnership to integrate ChatGPT into Apple experiences.	Apple is integrating ChatGPT into experiences within iOS, iPadOS, and macOS, allowing users to access ChatGPT’s capabilities—including image and document understanding—without needing to jump between tools.
Apple Intelligence: every new AI feature coming to the iPhone and Mac.	pple announced “Apple Intelligence” at WWDC 2024, its name for a new suite of AI features for the iPhone, Mac, and more. Starting later this year, Apple is rolling out what it says is a more conversational Siri, custom, AI-generated “Genmoji,” and GPT-4o access that lets Siri turn to OpenAI’s chatbot when it can’t handle what you ask it for.
Asana says its new AI teammates are ready to manage your projects.	With the goal of enhancing productivity and output quality, Asana has introduced "AI teammates" to take care of duties like proactive project detail organization and request triaging. This innovative feature is integrated into the workflow and functions like a human team member while yet being supervised by humans. It was showcased at Asana's Work Innovation Summit.
Apple stock reaches record high after the announcement of new AI features.	Tech giant’s shares climb 7% a day after reveal of artificial intelligence features meant to increase appeal of the iPhone
Elon Musk abruptly withdraws lawsuit against Sam Altman and OpenAI.	Tesla CEO had accused the company of abandoning mission of creating artificial intelligence for the greater good of humanity
Mistral raises €600m series B.	Mistral announced €600M in Series B funding for their first anniversary
Mozilla Builders.	Local AI, which enhances accessibility and privacy by bringing AI models and applications directly onto personal devices, is being embraced by the first Mozilla Builders Accelerator. Tools for developer productivity, locally based AI agents, dynamic user interfaces, fine-tuning adaption, retrieval-augmented creation, and enhanced function calling are some of the key areas of advancement. The initiative's goal is for participants to create an open-source, decentralized AI ecosystem with a focus on user empowerment.
CaseMark Raises $1.7M to Empower Attorneys with AI.	In order to increase the scope of its AI solutions for the legal sector, Gradient Ventures led the pre-seed investment in CaseMark, an AI firm that is transforming legal operations.
OpenAI ex-employees worry about company’s control over their millions of dollars in shares.	With OpenAI’s valuation soaring and an IPO nowhere in sight, the company is giving employees the chance to sell some equity in secondary transactions. Ex-employees sitting on millions of dollars worth of stock worry about OpenAI’s ability to force them to give up their shares, according to sources and internal messages. OpenAI recently circulated a document indicating that ex-employees who work at competitors are not included in the tender offers.
Announcing the Open Release of Stable Diffusion 3 Medium.	Stable Diffusion 3 Medium is Stability AI’s most advanced text-to-image open model yet. The small size of this model makes it perfect for running on consumer PCs and laptops as well as enterprise-tier GPUs.
Shutterstock ImageAI, Powered by Databricks.	Databricks and Shutterstock announced a text-to-image Generative AI model optimized for enterprise use
OpenAI Annualized Revenue Doubles.	OpenAI has more than doubled its annualized revenue to hit $3.4B.
Perplexity was planning revenue-sharing deals with publishers when it came under media fire.	Perplexity, the AI search startup that recently came under fire from Forbes for allegedly misusing its content, was already working on revenue-sharing deals with high-quality publishers.
Microsoft’s Nadella Is Building an AI Empire. OpenAI Was Just the First Step.	After landing the deal that launched his company to the front of the artificial intelligence race, the tech chief is spreading his bets. Will it be enough?
OpenAI adds former NSA chief to its board.	OpenAI said on Thursday that it is adding former NSA head and retired Gen. Paul Nakasone to its board of directors as well as its newly formed Safety and Security Committee. Why it matters: OpenAI is looking to convince skeptics that it is taking sufficient steps to ensure its models are safe as it works toward its goal of superintelligence.
Apple Made Once-Unlikely Deal With Sam Altman to Catch Up in AI.	An OpenAI agreement is due to be announced at the Apple’s developer conference next week.
LLM-Squared .	Sakana AI has found a preference optimization scheme that works better than DPO by using an evolutionary approach. It trained models based on code that was suggested by a language model. It has a few suggested variations with very high performance after about 100 generations.
Gemini 1.5 Pro and 1.5 Flash GA, 1.5 Flash tuning support, higher rate limits, and more API updates.	Updates to the Gemini API and Google AI Studio have been released by Google AI. These include support for model tuning, the stable release of Gemini 1.5, increased API rate limitations, additional JSON schema features, and mobile compatibility. The changes boost the alternatives available to developers more efficiently and more customized large-scale buildings.
AI generated sound effects are here.	A new AI audio model from ElevenLabs can generate a variety of voices, tunes, and sound effects based on text cues. By utilizing Shutterstock's audio library, our partnership helps media professionals create better content by facilitating the quick and scalable production of high-quality audio. ElevenLabs' platform makes it simple for users to create sounds, which streamlines the audio design process.
OpenAI welcomes Sarah Friar (CFO) and Kevin Weil (CPO).	With the appointment of Kevin Weil as CPO and Sarah Friar as CFO, OpenAI has strengthened its leadership team to further its goal of developing AI products and doing research that is useful to developers, businesses, and consumers.
Why the pope has the ears of G7 leaders on the ethics of AI.	Pope Francis is leaning on thinking of Paolo Benanti, a friar adept at explaining how technology can change world
AI used to predict potential new antibiotics in a groundbreaking study.	Scientists used an algorithm to mine ‘the entirety of the microbial diversity’ on Earth, speeding up antibiotic resistance research

Resources

Link	description
Spreadsheet Is All You Need.	Complete GPT-2 style transformer model with all weights, parameters, and connections included in a spreadsheet. It is a tiny model that runs entirely within the rows and columns of a spreadsheet and is based on NanoGPT.
Inspectus.	Inspectus is a versatile visualization tool for large language models. It runs smoothly in Jupyter notebooks via an easy-to-use Python API. Inspectus provides multiple views, offering diverse insights into language model behaviors.
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model.	SpatialRGPT is a powerful vision-language model adept at understanding both 2D and 3D spatial arrangements. It can process any region proposal, such as boxes or masks, and provide answers to complex spatial reasoning questions.
Thread.	Thread is a Jupyter Notebook that combines the experience of OpenAI's code interpreter with the familiar development environment of a Python notebook. With Thread, you can use natural language to generate cells, edit code, ask questions or fix errors all while being able to edit or re-run code as you would in a regular Jupyter Notebook.
How AI Image Models Work.	Since 2022, AI image production has advanced beyond producing images with text explanations. This article illustrates the quick progress and promise of AI in visual creation by explaining how these models hone chaotic inputs to create precise and detailed visuals using a kid's game comparison.
Active Stereo Without Pattern Projector.	Without the need for a hardware pattern projector, researchers have presented a new framework that incorporates active stereo concepts into passive cameras that are commonly used.
GLM-4-9B-Chat.	Excellent model with support for 26 languages, trained on 10T tokens by the Tsinghua KEM group.
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data.	DIRECT-3D is a new text-to-3D generative model that directly generates 3D contents in a single forward pass without optimization.
Together MoA.	Together has presented Mixture of Agents (MoA), a cutting-edge technique that mixes many LLMs for optimal performance, outperforming GPT-4o with an AlpacaEval 2.0 score of 65.1%. MoA employs a tiered architecture in which aggregators in later levels improve the initial answers from different models, improving output quality through cooperation. Even with improved precision, MoA still struggles with latency. Reducing latency and improving model design are two potential future possibilities.
Mistral.rs.	Mistral.rs is a fast LLM inference (Rust-based inference framework) platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Generalizable Human Gaussians from Single-View Image.	A diffusion-guided framework for building 3D human models from a single image is the Human Gaussian Model (HGM).
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis.	Real-time HDR view synthesis from RAW pictures can be achieved with the LE3D approach. It works especially well for situations set at night.
TORAX.	The Python-Jax differentiable fusion tokamak simulator developed by DeepMind at Google is now publicly available. The simulator supports several very powerful PDEs and has good auto-diff capabilities.
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising.	A novel acceleration approach called AsyncDiff makes it possible to perform parallel processing in diffusion models. By splitting the noise prediction model into several parts and executing them on different devices, it drastically cuts latency without sacrificing quality.
PowerInfer-2: Fast Large Language Model Inference on a Smartphone.	Fast inference on the phone for the special Mistral 47B MoE model.
The AXLearn Library for Deep Learning.	AXLearn is a library built on top of JAX and XLA to support the development of large-scale deep-learning models.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling.	Samba is a simple yet powerful hybrid model with an unlimited context length. Its architecture is frustratingly simple: Samba = Mamba + MLP + Sliding Window Attention + MLP stacking at the layer level.
DiffusionKit.	Framework and tooling for running diffusion models on Apple's MLX framework.
Splash Attention.	new DeepMind kernel in Jax for Sparse Flash Attention
Hugging Face acquires Agrilla.	Argilla a company specialized in data for preference optimization has been acquired.

Perspectives

Link	description
Building AI products.	Though they can't give exact answers to questions, large language models (LLMs) like ChatGPT are excellent at producing responses that seem correct. In order to improve user experience and enhance functionality while reducing errors, AI in the future will integrate LLMs into specialized tools or embed them into already-existing applications. This will contextualize AI outputs within controllable, specified areas.
Why passwords still matter in the age of AI.	As Apple’s new Passwords app tries to solve our identity crisis, why are we still proving who we are via strings of random characters?
Examining LLM performance on public benchmarks.	Popular LLMs on public benchmarks: how overfit are they? Mistral and Phi are overfitting benchmarks, but GPT, Claude, Gemini, and Llama are not, according to new research from Scale AI SEAL. The scientists assessed public LLMs for overfitting on GSM8k and created a new eval GSM1k.
How to track the economic impact of public investments in AI.	National statistics systems should recognize the researchers whose ideas drive artificial intelligence applications, not just machines and factory outputs.
Maintaining Large-Scale AI Capacity At Meta.	To meet AI demands, Meta is modernizing its data centers throughout the world. For AI training tasks, it intends to scale to 600,000 GPUs. In order to assure minimal disruptions and constant performance while enabling quick infrastructure scalability, this calls for creative maintenance tactics and tools like OpsPlanner.

Back to index

ML news: Week 3 - 9 June

Research

Link	description
Contextual Position Encoding: Learning to Count What's Important.	The general position encoding method can attend to the i-th particular word, noun, or sentence; it improves perplexity on language modeling and coding tasks; it is context-dependent and can represent different levels of position abstraction; it suggests a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens.
Faithful Logical Reasoning via Symbolic Chain-of-Thought.	suggests a way to enhance LLMs' capacity for logical thinking by combining logical rules and symbolic expressions with chain-of-thought (CoT) prompting; this prompting method is known as Symbolic Chain-of-Thought and it is a fully LLM-based framework that consists of the following important steps: converts the context of natural language to symbolic format, 2) creates a step-by-step solution plan based on symbolic logical rules, and 3) employs a verifier to validate the translation and reasoning chain.
Transformers Can Do Arithmetic with the Right Embeddings.	The main problem this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU.
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning.	blends the reasoning powers of GNNs with the language understanding skills of LLMs in a RAG fashion; the GNN extracts relevant and useful graph information, and the LLM uses the information to answer questions over knowledge graphs (KGQA); GNN-RAG outperforms or matches GPT-4 performance with a 7B tuned LLM, and improves vanilla LLMs on KGQA.
Attention as an RNN.	is based on the parallel prefix scan algorithm, which enables efficient computation of attention's many-to-many RNN output. It achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. presents a new attention mechanism that can be trained in parallel (like Transformers) and updated with new tokens requiring constant memory usage for inferences (like RNNs).
Are Long-LLMs A Necessity For Long-Context Tasks?	suggests a reasoning framework to allow short-LLMs to handle long-context tasks by adaptively accessing and utilizing the context based on the tasks presented; it breaks down the long context into short contexts and processes them using a decision-making process. The argument claims that long LLMs are not necessary to solve long-context tasks.
Sparse maximal update parameterization: A holistic approach to sparse training dynamics.	All frontier model labs use muP, a potent tool, to transfer hyperparameters fine-tuned on tiny models to bigger, more costly training runs. This study investigates how to achieve that for sparse models, resulting in significantly better training results and lower computation expenses.
Exploring Color Invariance through Image-Level Ensemble Learning.	To address color bias in computer vision, researchers have created a novel learning technique called Random Color Erasing. By selectively excluding color information from training data, this technique strikes a balance between the significance of color and other parameters, producing models that perform better in challenging situations like industrial and wide-area surveillance.
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models.	Conifer enhances LLMs' comprehension of intricate instructions by utilizing a progressive learning methodology and a customized dataset.
LLM Merging Competition: Building LLMs Efficiently through Merging.	Sakana AI is sponsoring the LLM Merging challenge at NeurIPS this year.
Tribeca to Screen AI-Generated Short Films Created by OpenAI’s Sora.	Short films generated by artificial intelligence are popping up at more and more film festivals, and the largest event yet is dedicating an entire section to AI-generated movies.
Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning.	A technique called InvariantSelectPR is intended to make Large Multimodal Models (LMMs) more adaptive in domain-specific fields such as healthcare.
TAIA: Large Language Models are Out-of-Distribution Data Learners.	A technique called TrainAllInfAttn improves the performance of big language models in niche markets with little data.
MegActor: Harness the Power of Raw Video for Vivid Portrait Animation	A new model called MegActor uses unprocessed driving videos to create more lifelike portrait animation. It addresses identity leaking and background interference and produces remarkable results with a unique data creation framework and background encoding approaches.
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.	MeshXL is a new model that generates high-quality 3D meshes.
Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays.	Position-guided Prompt learning method for Anomaly Detection in Chest X-rays (PPAD). PPAD leverages learnable text prompts and image prompts to minimize the gap between pre-training data and task-specific data. Through position-guided prompts, the model can focus on various regions, simulating the diagnostic process of experts.
Tree Diffusion: Diffusion Models For Code.	Wonderful diffusion paper that diffuses picture code. As part of the diffusion process, it can be directly edited. Although it is sluggish, it can be simply used with search to significantly increase one's capacity for reasoning.
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.	Expanding upon the Greedy Coordinate Gradient (GCG) approach, researchers have enhanced methods for optimization-based jailbreaking of huge language models.
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation.	A training-free video interpolation technique for generative video diffusion models has been developed by researchers. This novel method improves frame rates without requiring a lot of training or big datasets and works with different models.
A whole-slide foundation model for digital pathology from real-world data.	Prov-GigaPath, a whole-slide pathology foundation model pre-trained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. We further demonstrate the potential of Prov-GigaPath on vision–language pretraining for pathology by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modeling.
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models.	Using Dream Mat to enhance 3D object texture production is a brilliant idea. Given a 3D model, it employs several traditional graphic methods including Metallic, Roughness, and Albedo to generate a very appealing result.
LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing.	To solve classification problems in large language models (LLMs), researchers have developed LlamaCare, a refined LLM for medical information, in conjunction with Extended Classification Integration (ECI).
XRec: Large Language Models for Explainable Recommendation.	XRec is a framework independent of models that improves explainable recommender systems by utilizing the language capabilities of huge language models.
MetaMixer Is All You Need.	Using simply convolutions, researchers have created a novel method called FFNification that preserves the query-key-value structure while converting self-attention processes into more effective token mixers.
GrootVL: Tree Topology is All You Need in State Space Model.	By dynamically constructing a tree topology based on spatial correlations and input information, GrootVL is a network that enhances state space models.
ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization.	In order to increase Visual Geo-localization (VG) and boost its performance in applications such as SLAM, augmented reality, and autonomous driving, researchers have created a new two-stage training process.
ReLUs Are Sufficient for Learning Implicit Neural Representations.	A review of the application of ReLU activation functions to implicit neural representations (INRs) learning has been conducted. They countered spectrum bias by introducing basic limitations to ReLU neurons, which were inspired by second-order B-spline wavelets.

News

Link	description
OpenAI Is Restarting Its Robotics Research Group.	The San Francisco-based company has been a pioneer in generative artificial intelligence and is returning to robotics after a three-year break.
AI Overviews: About last week.	In order to improve search results and give users more precise and pertinent information, particularly for complex inquiries, Google created AI Overviews. While there were certain problems, such as incorrect results and misread content, Google has fixed these difficulties with over a dozen technical updates, like improving the identification of absurd questions and reducing the amount of user-generated content in AI Overviews.
Nvidia said to be prepping AI PC chip with Arm and Blackwell cores.	Competition could be heating up in the Windows on Arm space amid talk in the industry that Nvidia is readying a chip pairing next-gen Arm cores with its Blackwell GPU architecture.
Ex-OpenAI board member reveals what led to Sam Altman's brief ousting.	In a recent interview, former OpenAI board member Helen Toner provided fresh information into the circumstances surrounding CEO Sam Altman's November dismissal. It appears that the board was informed via Twitter about the release of ChatGPT. According to Toner, Altman had repeatedly lied to the board. It has been alleged that Altman had been lying about events within the organization for years and hiding facts. The board found it difficult to make decisions as a result of his lies, and they concluded that he wasn't the best person to take the firm to AGI.
AI hardware firm Nvidia unveils next-gen products at Taiwan tech expo.	CEO Jensen Huang tells packed stadium in Taipei ‘next Industrial Revolution has begun’
AMD unveils new AI chips to compete with Nvidia.	AMD has been vying to compete against Nvidia, which currently dominates the lucrative market for AI semiconductors and commands about 80% of its share.
Anthropic’s Claude 3 Opus and tool use are generally available on Vertex AI.	Google Cloud now offers Claude 3 Opus with tool use along with the smaller models as part of its Vertex AI offering.
State Space Duality (Mamba-2).	Mambda is an effective model of state space. A lengthy and comprehensive explanation of the model and its enhancements is included in the second version that its team has issued.
No physics? No problem. AI weather forecasting is already making huge strides.	With AI models like WindBorne's WeatherMesh, which leverages the extensive ERA5 dataset to outperform conventional models while using much less processing power, the weather forecasting industry is transforming.
Amazon’s Project PI AI looks for product defects before they ship.	Project PI combines computer vision and generative AI to catch damaged items and prevent returns.
The Opaque Investment Empire Making OpenAI’s Sam Altman Rich.	One of Silicon Valley's most active and successful individual investors is Sam Altman. At the beginning of this year, his stakes in his investment empire were valued at least $2.8 billion. A large portion of the portfolio is unknown. Readers are guided through Altman's investment knowledge in this article.
Even the Raspberry Pi is getting in on AI.	Raspberry Pi partnered with Hailo to provide an optional AI add-on to its microcomputers.
Using AI to decode dog vocalizations.	Leveraging a human speech model to identify different types of barks. University of Michigan researchers are exploring the possibilities of AI, developing tools that can identify whether a dog’s bark conveys playfulness or aggression.
The future is … sending AI avatars to meetings for us, says Zoom boss.	Eric Yuan suggests technology is five or six years away and will free up time to spend with family
AI researchers build ‘future self’ chatbot to inspire wise life choices.	Scientists at MIT hope talking to 60-year-old self will shift thinking on health, money and work
Cartwheel generates 3D animations from scratch to power up creators.	Animating a 3D character from scratch is generally both laborious and expensive, requiring the use of complex software and motion capture tools.
Mistral launches fine-tuning API.	Mistral has launched customization for its models via its platform and API.
If you aren't seeing AI Overviews in your search results, it's probably thanks to Google.	After receiving heavy criticism since their mid-May public launch, AI Overviews in Google Search have dropped in visibility across search results. Since I/O, the average percentage of queries where AI Overviews appear has dropped from 27 percent to just 11 percent. Despite the reduction, healthcare-related queries are a large percentage of AI results, raising concerns about both accuracy and reliability across Google.
Google optimizes shipping routes.	The mathematical optimization for cargo shipping routes was enhanced by Google's operations research group. They discovered a 13% drop in gasoline expenses and consumption.
BrightEdge Releases Post Google I/O Data on The Impact of AI Overviews.	The main businesses affected by AI Overviews, what generates results, and where Google automatically anticipates and responds to search inquiries are all revealed by new research from BrightEdge Generative Parser.
Nvidia emails: Elon Musk diverting Tesla GPUs to his other companies.	The Tesla CEO is accused of diverting resources from the company again. Elon Musk is yet again being accused of diverting Tesla resources to his other companies. This time, it's high-end H100 GPU clusters from Nvidia.
Securing Research Infrastructure for Advanced AI.	In its description of the security architecture of its AI training supercomputers, OpenAI highlights the use of Azure-based infrastructure and Kubernetes for orchestration to safeguard critical model weights and other assets.
Extracting Concepts from GPT-4.	The team at OpenAI has discovered 16 million interpretable features in GPT-4 including price increases, algebraic rings, and who/what correspondence. This is a great step forward for SAE interpretability at scale. They shared the code in a companion GitHub repository.
Mesop: Gradio Competition.	A rival to the well-liked AI prototyping framework Gradio has been made available by Google. Gradio is more mature than Mesop, which is pure Python and slightly more composable.
Nvidia is now more valuable than Apple at $3.01 trillion.	The AI boom has pushed Nvidia’s market cap high enough to make it the second most valuable company in the world.

Resources

Link	description
An Introduction to Vision-Language Modeling.	we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them.
Aya 23: Open Weight Releases to Further Multilingual Progress.	a family of multilingual language models with up to 23 languages supported; it demonstrates that it can perform better on those particular languages than other large-scale multimodal models by purposefully concentrating on fewer languages and allocating greater capacity to them.
Financial Statement Analysis with Large Language Models	claims that by analyzing trends and financial ratios, LLMs can produce insightful insights; demonstrates that GPT-4 outperforms more specialized models; and develops a profitable trading strategy based on GPT's predictions.
SimPO: Simple Preference Optimization with a Reference-Free Reward.	SimPO demonstrates how it outperforms other methods like DPO and claims to generate the strongest 8B open-source model. It is a more straightforward and efficient method for preference optimization with a reference-free reward; it uses the average log probability of a sequence as an implicit reward (i.e., no reference model required), which makes it more compute and memory efficient.
Experimenting with local alt text generation.	A model that runs in the browser and can provide alt text for web photos automatically has been trained by Mozilla.
Mora: More like Sora for Generalist Video Generation.	Mora is a multi-agent framework designed to facilitate generalist video generation tasks, leveraging a collaborative approach with multiple visual agents. It aims to replicate and extend the capabilities of OpenAI's Sora.
FABRIC: Personalizing Diffusion Models with Iterative Feedback.	FABRIC (Feedback via Attention-Based Reference Image Conditioning) is a technique to incorporate iterative feedback into the generative process of diffusion models based on StableDiffusion.
KL is All You Need.	KL divergence is a quick, affordable, and effective method of measuring a certain type of distance between objects. In both conventional and contemporary AI, it is widely employed. This piece examines the potent idea both mathematically and graphically.
7 Ways AI-Native Companies Can Improve User Retention.	a manual with examples of how businesses like Perplexity, Civit, Lapse, Omnivore, and others are using them to increase retention for founders and product executives.
FineWeb: decanting the web for the finest text data at scale.	The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. Recently, we released 🍷 FineWeb, a new, large-scale (15 trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.
An entirely open-source AI code assistant inside your editor.	Continue enables you to easily create your own coding assistant directly inside Visual Studio Code and JetBrains with open-source LLMs. All this can run entirely on your own laptop or have Ollama deployed on a server to remotely power code completion and chat experiences based on your needs.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.	A popular benchmark for reasoning tasks is MMLU. It is frequently seen as the gold standard and as something that models overfit. A new, more rigorous, and refined benchmark called MMLU Pro is used to gauge language model reasoning.
Omost.	Omost gives you control over how your images are generated. It comes from the same designer as ControlNet. First, it rewrites the prompts into a collection of illustrative code. After that, it renders the finished image using that. Crucially, you can modify the code either prior to or following generation in order to subtly alter the model's output.
Control-GIC.	A novel generative image compression framework called Control-GIC enables fine-grained bitrate modification while preserving high-quality output.
LLM inference speed of light.	Using the theoretical speed of light modeling as grounding is extremely significant for problems where the amount of computation and memory access is known a priori as it helps assess the quality of implementations and predict the impact of architectural modifications.
Neural Surface Reconstruction.	Without the need for 3D supervision, GenS is an end-to-end generalizable neural surface reconstruction model that performs exceptionally well at reconstructing surfaces from multi-view images.
MatMul-Free LM.	Even at the billion-parameter scale, researchers have managed to remove matrix multiplication (MatMul) from huge language models without sacrificing speed.
stable-audio-open-1.0 .	The weights for Stable Audio, which was trained to produce sound effects on audio samples with permissive licenses, have been released by Stability AI.
CV-VAE: A Compatible Video VAE for Latent Generative Video Models.	With its spatio-temporally compressed latent spaces, CV-VAE is a video VAE that works with current image and video models to efficiently train new ones utilizing pre-trained ones.
Qwen2.	Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B.Having been trained on data in 27 additional languages besides English and Chinese. Having been trained on data in 27 additional languages besides English and Chinese. State-of-the-art performance in a large number of benchmark evaluations
Dragonfly: A large vision-language model with multi-resolution zoom.	We are also launching two new open-source models Llama-3-8b-Dragonfly-v1 a general-domain model trained on 5.5 million image-instruction pairs and Llama-3-8b-Dragonfly-Med-v1 finetuned on an additional 1.4 biomedical image-instruction data. Dragonfly demonstrates promising performance on vision-language benchmarks like commonsense visual QA and image captioning. Dragonfly-Med outperforms prior models, including Med-Gemini on multiple medical imaging tasks, showcasing its capabilities for high-resolution medical data.
MMLU Pro.	The industry standard for assessing knowledge and reasoning in language models is MMLU.

Perspectives

Link	description
Beyond the Cloud: Distributed AI and On-Device Intelligence.	Transition of AI workflows from cloud to the edge with specialized chip infrastructure & models, multi-modality and ambiance across devices
Sure, Google’s AI overviews could be useful – if you like eating rocks.	The company that shaped the development of search engines is banking on chatbot-style summaries. But so far, its suggestions are pretty wild
AI's Communication Revolution: We're All Talking to Computers Now.	With its real-time integration of text, vision, and audio, OpenAI's GPT-4o is driving a revolution in communication through AI. As a result, human-to-AI communication has become a fundamental form of digital connection and has the potential to bring about substantial societal changes as well as the emergence of new companies focused on AI-centric communication. This transition makes it possible for more natural interactions with AI
A Right to Warn about Advanced Artificial Intelligence.	A group of AI workers, both present and past, is pleading with advanced AI companies to adopt values that guarantee openness and safeguard workers who voice concerns about risks. They emphasize how important it is for businesses to refrain from enforcing non-disparagement agreements, to make anonymous reporting procedures easier, to encourage candid criticism, and to shield whistleblowers from reprisals.
Will Scaling Solve Robotics?	The Conference on Robot Learning, which included 11 workshops and nearly 200 submitted papers, drew over 900 attendees last year. Whether it was possible to tackle robotics problems by training a huge neural network on a large data set was one of the main points of contention throughout the event. To help readers better comprehend the topic, this piece offers the opposing viewpoints. Scaling has been successful in several related domains. It is not feasible, though, because there is a lack of readily available robotics data and no obvious method for obtaining it. Scaling, even if it performs as well as it does in other domains, is probably not going to solve robotics.
Plentiful, high-paying jobs in the age of AI.	Due to comparative advantage, it's feasible that a large number of professions that humans currently perform will be performed by humans eternally, regardless of how much better AIs become at those tasks.
What I learned from looking at 900 most popular open source AI tools.	The goal of this study of open-source AI repositories is to provide readers with a broad overview of the intimidating AI ecosystem.
Meta AI system is a boost to endangered languages — as long as humans aren’t forgotten.	Automated approaches to translation could provide a lifeline to under-resourced languages, but only if companies engage with the people who speak them.
Misinformation poses a bigger threat to democracy than you might think.	In today’s polarized political climate, researchers who combat mistruths have come under attack and been labeled as unelected arbiters of truth. But the fight against misinformation is valid, warranted, and urgently required.
Is AI misinformation influencing elections in India?	A sample of roughly two million WhatsApp messages highlights urgent concerns about the spread and prevalence of AI-generated political content.
I'm Bearish OpenAI.	A shift toward products and a research brain drain should ring your alarm bells
The future of foundation models is closed-source.	If the centralizing forces of data and compute hold, open and closed-source AI cannot both dominate long-term
A Grand Unified Theory of the AI Hype Cycle.	Over the years, the AI sector has experienced multiple hype cycles, each of which produced really useful technology and outlasted the previous one. Instead of following an exponential process, every cycle adheres to a sigmoid one. There is an inevitable limit to any technology development strategy, and it is not too difficult to find. Although this AI hype cycle is unlike any other that has come before it, it will probably go in the same direction.
Hi, AI: Our Thesis on AI Voice Agents.	The current state of AI speech agents is described in a blog post and deck created by Andreessen Horowitz, along with potential areas for advancement and investment. It outlines the present state of the B2B and B2C application layer landscape and covers the top infrastructure stack.

Back to index

ML news: Week 27 May - 2 June

Research

Link	description
Golden Gate Claude.	we released a major new research paper on interpreting large language models, in which we began to map out the inner workings of our AI model, Claude 3 Sonnet. In the “mind” of Claude, we found millions of concepts that activate when the model reads relevant text or sees relevant images, which we call “features”.
A Better Match for Drivers and Riders: Reinforcement Learning at Lyft.	The Lyft team matched drivers and riders using online reinforcement learning, which is rewarded by future profits for the drivers. They made an extra $30 million a year for riders and were able to significantly improve in real-time.
Lessons from the Trenches on Reproducible Evaluation of Language Models.	Language model evaluation is a challenging task, and information on the process is scarce outside of the biggest companies. This work presents a robust and repeatable set of assessment criteria. In the appendix, there is a useful discussion of perplexity evaluation.
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance.	A novel method for tailoring diffusion models to produce identity-preserving images from user-supplied references is presented by researchers. This strategy steers diffusion models without further training by using classifier guidance, in contrast to classic methods that need considerable domain-specific training.
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks.	A parameter-efficient deep ensemble technique for self-attention networks is called LoRA-Ensemble. This method provides accurate and well-calibrated predictions without the significant computational cost associated with typical ensemble methods. It does this by extending Low-Rank Adaptation (LoRA) for implicit ensembling.
Agent Planning with World Knowledge Model.	demonstrates superior performance compared to various strong baselines when adopting open-source LLMs such as Mistral-7B and Gemma-7B. Introduces a parametric world knowledge model to facilitate agent planning. The agent model can self-synthesize knowledge from expert and sampled trajectories; this is used to train the world knowledge model. Prior task knowledge is used to guide global planning and dynamic state knowledge is used to guide local planning.
Enhancing Answer Selection in LLMs.	suggests a hierarchical reasoning aggregation framework to enhance LLMs' reasoning capabilities; the method, known as Aggregation of Reasoning (AoR), chooses answers based on the assessment of reasoning chains; AoR employs dynamic sampling to modify the number of reasoning chains in relation to task complexity; it makes use of evaluation phase results to decide whether to sample more reasoning chains; One well-known problem with majority voting is that it doesn't work when the right option is in the minority; AoR concentrates on assessing the reasoning chains to enhance the choice of the concluding response; AoR can be employed with different LLMs to enhance performance on difficult reasoning problems, and it outperforms several well-known ensemble approaches.
Efficient Inference of LLMs.	suggests a layer-condensed KV cache to achieve effective inference in LLMs; can achieve up to 26x higher throughput than baseline transformers while maintaining satisfactory performance; only computes and caches the key values (KVs) of a small number of layers, which leads to reduced memory consumption and improved inference throughput.
Mapping the Mind of a Large Language Model.	By mapping millions of features that correlate to a wide range of concepts, anthropologists have shown a way to comprehend the inner workings of its huge language model, Claude Sonnet. This interpretability, which permits certain manipulations of these attributes to direct model behaviors, may result in safer AI. The research indicates a noteworthy advancement in comprehending and enhancing the security protocols of artificial intelligence language models.
Object Segmentation in Complex Scenarios.	To enhance Generalized Referring Expression Segmentation (GRES), researchers have developed the Hierarchical Semantic Decoding with Counting Assistance (HDC) framework. As opposed to earlier techniques, HDC combines semantic correspondences and transmits complementing modality information across granularities for improved multi-level decoding.
Label-efficient Semantic Scene Completion with Scribble Annotations.	A novel semantic scene completion method called Scribble2Scene lessens the requirement for thorough labeling.
Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation.	The constraints of semantic segmentation have been addressed with the introduction of a new Semantic and Spatial Adaptive (SSA) classifier. This novel method makes use of coarse masks to direct prototype adjustment, improving fine-grained recognition and delineating mask boundaries.
RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model.	By integrating building extraction and change detection into a single model, RSBuilding presents a novel method for deciphering buildings from remote sensing photos.
Meteor: Mamba-based traversal of the rationale for Large Language and Vision Models.	This research presents Meteor, a novel massive language and vision model that is efficient and employs several justifications to enhance comprehension and response times.
gzip Predicts Data-dependent Scaling Laws.	Scaling rules are a means of forecasting the performance of models at specific sizes with a given quantity of data. Getting them is costly. In order to forecast a data-dependent scaling law, this research investigates the use of the gzip compression ratio as a powerful signal.
The Road Less Scheduled.	A few weeks prior, a brand-new Meta optimizer was circulating as a possible replacement for Adam. The method, including the part about online updates, is described in more depth in this paper. Overall, this appears like a good outcome, particularly in cases when the complete number of planned training steps is not known at the start of the training process.
Transformers Can Do Arithmetic with the Right Embeddings.	Researchers have added embeddings that encode the position of each digit with respect to the start of the number, which has improved transformer performance on arithmetic tasks.
DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models.	DMPlug is a new plug-in technique that solves inverse problems (IPs) by using pre-trained diffusion models (DMs). DMPlug efficiently addresses both manifold feasibility and measurement feasibility by treating the reverse diffusion process as a function, in contrast to other interleaving techniques.
PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution.	PatchScaler is a diffusion-based technique that greatly improves inference efficiency for single image super-resolution (SR).
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model.	A revolutionary multimodal big language model called Reason3D was created for a thorough comprehension of 3D environments.
Yuan 2.0-M32: Mixture of Experts with Attention Router.	A Mixture of Expert models with 40B parameters and 3.7B active at all times is Yuan 2.0-M32. Even though it only uses 1/19th of the compute, it performs similarly to Llama 3 70B. It appears remarkably powerful considering its size, having been trained on 2T tokens.
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations.	The Cosine learning rate schedule employed in the original scaling laws publications prevents optimal loss if the Cosine period is not in line with the total number of training steps. Because of this, training enough models to produce useful scaling laws is difficult. In order to minimize GPU costs for scaling law development, this study presents the concept of a constant learning rate with a cool down.
Towards Ultra-High-Definition Image Deraining: A Benchmark and An Efficient Method.	To address the problem of deraining ultra-high-definition (UHD) photographs, researchers have released a new dataset dubbed 4K-Rain13k, which consists of 13,000 pairs of 4K resolution images.
EasyAnimate An End-to-End Solution for High-Resolution and Long Video Generation.	Transformers are used in the EasyAnimate method to modify the DiT architecture for advanced 3D video production. In order to capture temporal dynamics and guarantee seamless motion transitions and consistent frames, this project integrates a motion module block.
Self-Exploring Language Models (SELM).	Online feedback is used in Self-Exploring Language Models (SELM), a technique that improves preference optimization in LLMs.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback.	When applied to video models, consistency distillation significantly lowers the number of processes required to produce content.

News

Link	description
Meta and Elon Musk’s xAI fight to partner with chatbot group Character.ai .	AI pioneer Noam Shazeer launched Character.ai, a rapidly expanding role-playing startup, and Silicon Valley companies are vying for a partnership. This occurs at a time when numerous big businesses are investing heavily in startups.
Scarlett Johansson told OpenAI not to use her voice — and she’s not happy they might have anyway.	penAI has denied that its ChatGPT voice is based on Johansson, but it certainly sounds a lot like her.
xAI Series B funding round.	xAI is pleased to announce our series B funding round of $6 billion.
iPhone to get a better Siri, AI emoji creator, smart recaps, and more with iOS 18.	in June 2024, the Cupertino giant will finally unveil its approach to AI
New startup builds digital pets for Apple's Vision Pro.	A new startup is coming out of stealth with a plan to offer digital pets for the Apple Vision Pro that use AI to read and respond to human emotion.
Humane is looking for a buyer after the AI Pin’s underwhelming debut.	the startup apparently thinks it’s worth between $750 million and $1 billion despite the deep software flaws and hardware issues of its first product.
OpenAI Board Forms Safety and Security Committee.	OpenAI declared that its new foundation model will be trained, and it established a Safety and Security Committee. As model capabilities advance, this committee will be responsible for recommending to the board what steps should be taken.
Anthropic hires former OpenAI safety lead to head up new team.	Jan Leike, a leading AI researcher who earlier this month resigned from OpenAI before publicly criticizing the company’s approach to AI safety, has joined OpenAI rival Anthropic to lead a new “superalignment” team.
New agent capabilities in Microsoft Copilot.	At Build 2024, Microsoft introduced new Copilot capabilities, like as Copilot Extensions and Connectors for simple customization, Team Copilot for team communication, and bespoke AI Agents to automate operations. The goal of these improvements is to increase efficiency in company processes and productivity. The improvements are anticipated to be widely available later in 2024; they are presently in limited private preview.
“I lost trust”: Why the OpenAI team in charge of safeguarding humanity imploded.	Company insiders explain why safety-conscious employees are leaving.
OpenAI sends internal memo releasing former employees from controversial exit agreements.	OpenAI on Thursday backtracked on a controversial decision to, in effect, make former employees choose between signing a non-disparagement agreement that would never expire, or keeping their vested equity in the company.
Opera adds Google’s Gemini to its browsers.	Users can access Gemini through the Aria AI assistant on Opera browsers.
Two receptors are better than one for AI-designed obesity drugs.	Compounds predicted by machine learning attach to two receptors involved in appetite and weight.
Mistral's New AI Non-Production License .	Mistral is attempting to strike a balance between corporate success and transparency. It has a new license designed to achieve that equilibrium. It will keep releasing further projects under the new MNPL license in addition to Apache 2.0.
Sonic: A Low-Latency Voice Model for Lifelike Speech.	The makers of Mamba, sub quadratic Transformer versions, and SSMs have released a new model. Crucially, their recently founded business, Cartesia, has developed a realistic-sounding, lightning-fast speech-generating technology. This suggests that they want to take up residence in the helpers' area.
Vox Media and The Atlantic sign content deals with OpenAI.	OpenAI continues to establish media partnerships as it looks to lock down training data — and avoid lawsuits.
Mistral's Code Model.	We introduce Codestral, our first-ever code model. Codestral is an open-weight generative AI model explicitly designed for code generation tasks.
OpenAI signs 100K PwC workers to ChatGPT’s enterprise tier as PwC becomes its first resale partner.	OpenAI, on Wednesday announced that it has signed a major enterprise customer that it hopes will indicate how a similar effect could play out in the world of work. PwC, the management consulting giant, will become OpenAI’s biggest customer to date, covering 100,000 users.
Apple's AI plans involve 'black box' for cloud data.	Apple intends to process data from AI applications inside a virtual black box. The concept, known as "Apple Chips in Data Centers" (ACDC) internally, would involve only Apple's hardware being used to perform AI processing in the cloud. The idea is that it will control both the hardware and software on its servers, enabling it to design more secure systems.
Introducing Perplexity Pages.	A new AI-created product for producing shareable, long-lasting research artifacts has been announced by the Perplexity search engine.
Autodesk acquires AI-powered VFX startup Wonder Dynamics.	Autodesk — the 3D tools behemoth — has acquired Wonder Dynamics, a startup that lets creators quickly and easily make complex characters and visual effects using AI-powered image analysis.
Anthropic’s AI now lets you create bots to work for you.	Anthropic is releasing a new feature for its AI chatbot Claude that will let anyone create an email assistant, a bot to purchase shoes or other personalized solutions. It’s called “tool use” (or the nerdier “function calling”), and it hooks up to any external API of your choosing.
Patronus AI Raises $17 million To Detect LLM Mistakes at Scale.	Series A financing led by Glenn Solomon at Notable Capital underscores the urgent need for companies to deploy large language models with confidence
Neuralink rival sets brain-chip record with 4,096 electrodes on human brain.	Brain-computer interface company Precision Neuroscience says that it has set a new world record for the number of neuron-tapping electrodes placed on a living human's brain—4,096, surpassing the previous record of 2,048 set last year, according to an announcement from the company on Tuesday.
Google adds AI-powered features to Chromebook.	Google announced new AI-powered features today for its Chromebook Plus line of devices, such as a writing assistant, a wallpaper creator, and easy access to Google’s Gemini chatbot.
What is science? Tech heavyweights brawl over definition.	AI pioneer Yann LeCun and Elon Musk went head-to-head in a debate about modern research that drew thousands of comments.
Google to refine AI-generated search summaries in response to bizarre results.	After new feature tells people to eat rocks or add glue to pizza sauce, company to restrict which searches return summaries
A new AI service allows viewers to create TV shows. Are we doomed?	Showrunner will let users generate episodes with prompts, which could be an alarming next step or a fleeting novelty

Resources

Link	description
Mistral-finetune.	mistral-finetune is a light-weight codebase that enables memory-efficient and performant finetuning of Mistral's models. It is based on LoRA, a training paradigm where most weights are frozen and only 1-2% additional weights in the form of low-rank matrix perturbations are trained.
Modula.	A novel technique called modular norm allows neural networks to scale training effectively over a range of network sizes by normalizing weight updates.
MobileNet-V4.	Extremely fast and performant computer vision model is called MobileNet. Devices on the edge can run it. This blog article describes the new model and some contemporary modifications that were made to it.
Multi-Dimensional Features.	This project challenges the linear representation hypothesis by examining if language models compute using multi-dimensional characteristics.
llamafile 0.8.6 CPU benchmark.	It is now possible to run inference for the flagship model from Mistral at 20 tokens per second on a commodity CPU, thanks to recent developments from Mozilla's Llamafile project.
Risks and Opportunities of Open-Source Generative AI.	examines the potential and hazards associated with open-source generative AI models and makes the case that these models' overall advantages exceed their drawbacks.
How Far Are We From AGI.	offers a summary of the tactics required to attain artificial general intelligence (AGI), including a thorough survey, discussion, and unique viewpoints. It also addresses significant questions regarding the near future of AGI.
Efficient Multimodal LLMs.	offers a thorough and methodical analysis of the state of efficient multimodal big language models at the moment; it covers applications, constraints, possible future directions, and efficient structures and techniques.
Scientific Applications of LLMs.	introduces INDUS, a full suite of LLMs that comprises small distilled models, an encoder model, and embedding models for Earth science, biology, physics, and planetary sciences, among other subjects.
Guide for Evaluating LLMs.	offers advice and lessons for assessing large language models (LLMs); it also covers best practices and potential problems, and it introduces an open-source framework for LLM evaluation.
Information Retrieval Measures.	It is necessary to understand how effectively the retrieval part is operating in order to build an RAG system. This toolbox provides an extensive range of effective performance metrics for information retrieval.
A Guide to Creating Neural Circuit Diagrams.	This is a guide to drawing Neural Circuit Diagrams by Vincent Abbott from the paper Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures. It allows for deep learning algorithms to be comprehensively expressed using a novel diagrammatic scheme.
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation.	A novel model called InstructAvatar uses text direction to generate 2D avatars that are emotionally expressive.
Marigold Pipelines for Computer Vision Tasks.	Diffusers can now use one of the best depth models as a pipeline. This tutorial goes over how to utilize the model, what you can do with it, and how to condition the latents of the first frame to make it work with videos effortlessly.
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20.	A version of LLM C, a solitary self-contained GPT-2 implementation designed to replicate the 2019 model suite, has been made available by Andrej Karpathy. The library can train the simplest of these models in about 90 minutes with this latest release. It has few dependencies and executes from start to finish.
Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth.	An innovative technique for improving makeup transfer tasks without depending on genuine target images is Content-Style Decoupled Makeup Transfer (CSD-MT).
LaVague.	LaVague is an open-source Large Action Model framework to develop AI Web Agents. Our web agents take an objective, such as "Print installation steps for Hugging Face's Diffusers library" and perform the required actions to achieve this goal by leveraging our two core components.
PRISM: A foundation model for life’s chemistry.	Enveda’s PRISM (Pretrained Representations Informed by Spectral Masking) model was trained on 1.2 billion small molecule mass spectra, the largest training set of small molecule mass spectra ever assembled.
Scale Private Leaderboard.	Scale has created a private language model evaluation leaderboard. Although the ordering isn't all that surprising, it's worth noting that the Llama 3 70B frequently outperforms Claude Opus in terms of instruction following.
controlnet-scribble-sdxl-1.0.	Drawing random lines can be used as conditioning data for image creation using Scribble ControlNet. It has strong performance and was trained on a significantly larger number of post-training photos than other ControlNets.

Perspectives

Link	description
AI's Communication Revolution: We're All Talking to Computers Now.	The most recent AI model from OpenAI, GPT-4o, allows for real-time communication between people and machines by adding vision and audio to its text-based capabilities. The AI revolution brings with it a new wave of interactions between humans and AI, and eventually AI itself. These interactions will probably have an impact on our social habits and business structures. The impact of this technology on human communication will develop as it advances, possibly spurring the development of creative businesses and software.
I Don’t Want To Spend My One Wild And Precious Life Dealing With Google’s AI Search.	The unwelcome three-second delay that Google's AI search tool adds to search results is driving users crazy by interfering with their experience and displaying irrelevant content.
LLMs are not suitable for (advanced) brainstorming.	When it comes to truly creative brainstorming, large language models frequently end up producing consensus-based ideas rather than original notions.
Could AI help cure ‘downward spiral’ of human loneliness?.	One computer scientist says we should embrace human-machine relationships, but other experts are more cautious
Scarlett Johansson’s OpenAI clash is just the start of legal wrangles over artificial intelligence.	Hollywood star’s claim ChatGPT update used an imitation of her voice highlights tensions over rapidly accelerating technology
TechScape: What we learned from the global AI summit in South Korea.	One day and six (very long) agreements later, can we call the meeting to hammer out the future of AI regulation a success?
Scarlett Johansson’s OpenAI clash is just the start of legal wrangles over artificial intelligence.	Hollywood star’s claim ChatGPT update used an imitation of her voice highlights tensions over rapidly accelerating technology
Trying to tame AI: Seoul summit flags hurdles to regulation.	UK touts ‘Bletchley effect’ of safety institutes, but division remains over whether to limit AI abilities
How to Build a Category-Defining AI Startup.	AI companies need to adopt a marketing-led strategy in order to stand out from the competition and establish themselves as category leaders as the AI field changes quickly. AI startups may accelerate market acceptance, reshape the industry narrative, and position themselves as visionary leaders in their field by adopting a marketing-led strategy.
Ways to think about AGI.	The consensus is unclear because there isn't a well-developed theoretical model of general intelligence or a clear explanation for why or how LLMs work so well, despite the fact that some experts think AGI may be achievable. The conversation highlights the enormous amount of unanswered questions surrounding AGI, recognizing both its possible advantages and disadvantages while drawing comparisons between theology and the empirical methodology of the Apollo Program.
The AI revolution is coming to robots: how will it change them?.	The melding of artificial intelligence and robotics could catapult both to new heights.
What GPT-4o illustrates about AI Regulation.	This article compares and contrasts model-level, use-level, and conduct-level frameworks in order to analyze several approaches to AI regulation. It contends that use-level regulation, which can lead to unneeded complexity and unworkable constraints for the deployment of AI, is inferior to conduct-level regulation, which applies current laws to new technologies with minimal precision. One example of the drawbacks of a user-level approach is the limitations placed on AI's capacity to infer emotions by the recent EU AI Act.
How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models.	Researchers are striving to reverse-engineer artificial intelligence and scan the ‘brains’ of LLMs to see what they are doing, how and why.
Anglo-American bias could make generative AI an invisible intellectual cage.	Studies show that applications in generative artificial intelligence (AI) such as ChatGPT and other large language models perform remarkably well in English, but are not as proficient in other languages. This masks a more insidious problem.
AI won’t eat your job, but it will eat your salary.	AI poses a danger to the skill premium associated with tasks as well as the existence of jobs themselves, which could result in lower compensation for skilled workers. AI has the potential to reorganize job duties and reduce obstacles to task completion, which would result in commoditization and a reduction in the ability to demand a premium wage. Managerial advantages may likewise disappear as AI develops, particularly through AI agents, which would put the human-in-the-loop advantage to the test and further erode skill premiums.
‘All eyes on Rafah’: how AI-generated image swept across social media.	Celebrity posts of graphic following IDF strike help make it among most-shared content of Israel-Gaza war

Back to index

ML news: Week 20 - 26 May

Research

Link	description
LoRA Learns Less and Forgets Less.	LoRA is a well-liked technique for enhancing models to add flair or expertise. The trade-off between forgetting and power while utilizing LoRAs is examined in this research. LoRAs are found to retain more of the initial "out of distribution" performance while learning less than full fine-tuning.
Chameleon: Mixed-Modal Early-Fusion Foundation Models.	Like GPT-4o, Meta has unveiled Chameleon, a natively multi-modal model that works with both text and graphics at the same time. It performs better than a lot of other models. Since then, the Meta team's work on internal models has greatly advanced.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.	The technical report for Google's most current model family has been updated. While there is a dearth of information regarding the models and data utilized, there is a wealth of information regarding the assessment and safety precautions implemented, providing an intriguing glimpse into large-scale alignment.
Introducing the Frontier Safety Framework.	Frontier Safety Framework was unveiled by Google DeepMind to mitigate the dangers associated with upcoming sophisticated AI models. This framework assesses models against critical capability levels (CCLs) for potentially dangerous AI capabilities and implements mitigation techniques when thresholds are crossed.
ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation.	AI can be creatively and entertainingly used to generate artistic 2D visuals. This work uses text-guided Gaussian Splatting to bring that capacity to 3D.
Grounded 3D-LLM with Referent Tokens.	It's difficult to figure out where items are in a 3D setting. You can identify semantic labels for things in 3D space by employing language-guided 3D understanding.
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation.	LeMeViT is a novel method that uses learnable meta tokens to lower the computational costs associated with Vision Transformers. By effectively capturing important data, these tokens accelerate inference.
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers.	A fresh security risk has been identified for the well-known AI model Vision Transformers by researchers. The attack, known as SWARM, is extremely sneaky and harmful to consumers since it discreetly activates backdoor behavior in a model using a "switch token".
Mapping the Mind of a Large Language Model.	Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first-ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
Smart Expert System: Large Language Models as Text Classifiers.	Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers.
CSTA: CNN-based Spatiotemporal Attention for Video Summarization.	In order to enhance video summarization, this project presents a novel CNN-based SpatioTemporal Attention (CSTA) technique. In contrast to conventional attention processes, CSTA uses a 2D CNN to efficiently extract the visual meaning of frames in order to comprehend relationships and important features in films.
Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs.	Microsoft is making more investments in the development of small language models (SLMs). At its Build developer conference, the company announced the general availability of its Phi-3 models and previewed Phi-3-vision. However, on the heels of Microsoft’s Copilot+ PC news, it’s introducing an SLM built specifically for these device’s powerful Neural Processing Units (NPUs).
Aurora: A Foundation Model of the Atmosphere.	By training a foundation model for atmospheric predictions, Microsoft has achieved a new state-of-the-art in global weather prediction tests lasting five and ten days.
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark.	A new benchmark called MathBench aims to give a comprehensive evaluation of the mathematical capabilities of large language models.
Wav-KAN: Wavelet Kolmogorov-Arnold Networks.	Wav-KAN is a neural network framework that leverages wavelet functions to enhance performance and interpretability, according to research. Wav-KAN captures both high-frequency and low-frequency data components, which speeds up training and boosts robustness in contrast to standard models.
Global-Local Semantic Consistent Learning (GLSCL).	Global-Local Semantic Consistent Learning (GLSCL), a novel technique created by researchers, greatly lowers computational costs while improving text-video retrieval.
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding.	ProtT3, a novel framework that combines conventional Language Models (LMs) with Protein Language Models (PLMs) to improve text-based protein understanding, is presented by researchers. Using a cross-modal projector known as Q-Former, ProtT3 combines a PLM for analyzing amino acid sequences with a language model to produce high-quality textual descriptions.
Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images.	In order to better explain how the environment changes over time, a new probabilistic diffusion model for Remote Sensing Image Change Captioning (RSICC) is presented in this study.

News

Link	description
First companies sign up to AI safety standards on eve of Seoul summit.	Rishi Sunak says 16 international firms have committed, but standards have been criticized for lacking teeth
World is ill-prepared for breakthroughs in AI, say experts.	Governments have made insufficient regulatory progress, ‘godfathers’ of the technology say before summit
Productivity soars in sectors of the global economy most exposed to AI, says the report.	Employers in the UK, one of 15 countries studied, willing to pay 14% wage premium for jobs requiring AI skills
ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI.	OpenAI says ‘Sky’ is not an imitation of actor’s voice after users compare it to AI companion character in film Her
$16k G1 humanoid rises up to smash nuts, twist and twirl.	Humanoid development at Chinese robotics company Unitree continues apace. Following its entry into the melee just last year, its fast-walking H1 bot recently got its backflip groove on. Now the faceless and hand-less humanoid is being joined by an impressive all-rounder.
Google I/O 2024: Here’s everything Google just announced.	Google kicked off its developer conference each year with a rapid-fire stream of announcements, including many unveilings of recent things it’s been working on. Brian already kicked us off by sharing what we are expecting.
Gamma raised $12M in Series A funding to reimagine presentations, powered by AI.	Gamma received $12 million from Accel to use AI to reinvent presentations. Over 18 million people have contributed over 60 million Gammas (AI-generated slides) to date.
Inflection AI reveals new team and plans to embed emotional AI in business bots.	Inflection AI unveiled its new leadership team, composed of seasoned Silicon Valley veterans.
Scarlett Johansson says Altman insinuated that AI soundalike was intentional.	OpenAI has paused a voice mode option for ChatGPT-4o, Sky after backlash accusing the AI company of intentionally ripping off Scarlett Johansson's critically acclaimed voice-acting performance in the 2013 sci-fi film Her.
Perplexity CEO Aravind Srinivas takes shots at Google.	Google's planned roll-out of AI-summarized search results doesn't faze Perplexity AI CEO and co-founder Aravind Srinivas — whose startup has offered a popular AI-driven search tool providing similar digests for nearly two years.
Google still hasn’t fixed Gemini’s biased image generator.	Back in February, Google paused its AI-powered chatbot Gemini’s ability to generate images of people after users complained of historical inaccuracies. Well, the problem’s likely more complex than Hassabis alluded to.
SoundHound AI and Perplexity Partner to Bring Online LLMs to Next Gen Voice Assistants Across Cars and IoT Devices.	Perplexity’s capabilities added to SoundHound Chat AI will respond to questions conversationally with real-time knowledge from the web
Stability AI discusses sale amid cash crunch, The Information reports.	Artificial Intelligence startup Stability AI held discussions with at least one potential buyer in recent weeks about a sale as it faces a cash crunch, The Information reported on Wednesday, citing a person involved in the talks.
Scale AI raises $1B.	Accel and earlier investors provide the gigantic series F. There is a huge need for the services offered, and Scale is in a unique position to keep driving the current AI data surge.
Elon Musk’s xAI is working on making Grok multimodal.	Users may soon be able to input images into Grok for text-based answers.
Google CEO Sundar Pichai on AI-powered search and the future of the web.	The head of Google sat down with Decoder last week to talk about the biggest advancements in AI, the future of Google Search, and the fate of the web.
Apple announces new accessibility features, including Eye Tracking, Music Haptics, and Vocal Shortcuts.	Apple today announced new accessibility features coming later this year, including Eye Tracking, a way for users with physical disabilities to control iPad or iPhone with their eyes.
Microsoft announces $3.3 billion investment in Wisconsin to spur artificial intelligence innovation and economic growth.	Microsoft today announced a broad investment package designed to strengthen the role of Southeast Wisconsin as a hub for AI-powered economic activity, innovation, and job creation. These investments include $3.3B in cloud computing and AI infrastructure, the creation of the country’s first manufacturing-focused AI co-innovation lab, and an AI skilling initiative to equip more than 100,000 of the state’s residents with essential AI skills.
ElevenLabs has launched a free iPhone app that speaks text on the screen — 11 voices and PDF capabilities available.	The unicorn startup ElevenLabs, best known for its AI dubbing site, has launched its first public app.
The US Congress is taking on AI — this computer scientist is helping.	Kiri Wagstaff, who temporarily shelved her academic career to provide advice on federal AI legislation, talks about life inside the halls of power.
OpenAI Partners with News Corp.	News Corp, which publishes articles from WSJ, NYP, The Times, and other publications, and OpenAI have partnered to provide News Corp's news material on OpenAI's platform, which they say would improve generations' accuracy and usability.
Stanford HAI Releases Updated Foundation Model Transparency Index.	The most recent version of Stanford HAI's Foundation Model Transparency Index, which assesses the transparency of 14 significant AI developers, including Google and OpenAI, was released. These businesses showed a considerable improvement and readiness to engage in a dialogue about their models by disclosing fresh information that was not previously known to the public. The average transparency score was just 58 out of 100, indicating serious deficiencies in areas including downstream impact, data access, and model credibility despite these advancements.
The ChatGPT desktop app is more helpful than I expected - here's why and how to try it.	Among OpenAI's many big updates this week was a new ChatGPT app for MacOS. Here's how to use it and when Windows users can get in on the fun.
Suno has raised $125 million to build a future where anyone can make music.	A platform for creating music called Suno has raised $125 million to keep constructing a world in which anyone can compose music.
Nvidia reports stratospheric growth as AI boom shows no sign of stopping.	Chipmaker reports strong demand and higher-than-expected revenue even as other companies spend to develop their own chips
Mistral AI and Harvey Partnership.	Mistral and Harvey, a legal company, have teamed. Although there aren't many specifics in the statement, it's likely that they will collaborate to create a unique legal paradigm.
French AI startup H raises $220M seed round.	H, a startup based in Paris and previously known as Holistic AI, announced a $220 million seed round just a few months after the company’s inception.
Reflections on our Responsible Scaling Policy.	With an emphasis on continuous improvement and cooperation with business and government, Anthropic's Responsible Scaling Policy attempts to prevent catastrophic AI safety failures by identifying high-risk capabilities, testing models often, and enforcing tight safety requirements.
Introducing Aya.	A global initiative led by Cohere For AI involving over 3,000 independent researchers across 119 countries. Aya is a state-of-the-art model and dataset, pushes the boundaries of multilingual AI for 101 languages through open science.
PaliGemma: An Open Multimodal Model by Google.	PaliGemma is a vision language model (VLM) developed and released by Google that has multimodal capabilities. Unlike other VLMs, such as OpenAI’s GPT-4o, Google Gemini, and Anthropic’s Claude 3 which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.
Casper Labs Announces AI Governance Solution, Prove AI .	In an effort to improve enterprise AI applications' auditability and transparency, Casper Labs has launched Prove AI, a joint venture with IBM.
Google AI search tool reportedly tells users to jump off a bridge and eat rocks.	Firm’s AI overviews feature has been rolled out to users in US, but many have reported strange responses

Resources

Link	description
model-explorer.	A new model explorer from Google makes it simple to see the computation graph of your models. Performance engineering and debugging may find use for it.
real-time inference demo for paligemma.	You may run the latest recent Google VLM in real-time by using GPT-Fast. Given how simple it is to fine-tune the model for particular jobs, this opens up a multitude of powerful downstream activities.
Multi AI Agent Systems using OpenAI's Assistants API (Experts.js).	Experts.js is the easiest way to create and deploy OpenAI's Assistants and link them together as Tools to create a Panel of Experts system with expanded memory and attention to detail.
First-ever AI Code Interpreter for R.	Julius is the leading generative AI tool for data analysis. Designed to perform statistical analysis, data science, and computational tasks, it combines cutting-edge foundational models like GPT-4o, Claude 3, and Gemini 1.5 with robust coding capabilities in Python and R.
Moondream WebGPU.	1.86 billion parameter VLM (Vision-Language Model) that is optimized for inference on the web. Once downloaded, the model (1.8 GB) will be cached and reused when you revisit the page. Everything runs directly in your browser using 🤗 Transformers.js and ONNX Runtime Web, meaning your conversations aren't sent to a server. You can even disconnect from the internet after the model has loaded!
Devon: An open-source pair programmer.	You can select different models for Multi-file editing, Codebase exploration, Config writing, Test writing, Bug fixing, and Architecture exploration
llama3 implemented from scratch.	In this file, it implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, going to load tensors directly from the model file that Meta provided for llama3
PSG4D - 4D Panoptic Scene Graph Generation.	The PSG4D (4D Panoptic Scene Graph Generation) Task is a novel task that aims to bridge the gap between raw visual inputs in a dynamic 4D world and high-level visual understanding. It involves generating a comprehensive 4D scene graph from RGB-D video sequences or point cloud video sequences.
microsoft/Phi-3-medium-128k-instruct.	The Phi-3-Medium-128K-Instruct is a 14B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets that include both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.
Debiasing Large Visual Language Models .	Post-Hoc debias method and Visual Debias Decoding strategy. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations
DeepSeek-VL.	an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
MiniCPM-V.	MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. Models take images and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment
OLAPH: Improving Factuality in Biomedical Long-form Question Answering.	A new benchmark dataset called MedLFQA was created to enhance the factual accuracy of long-form replies from big language models in the medical domain. OLAPH is a framework that uses preference optimization and automatic evaluations to teach LLMs to reduce errors.
Tarsier.	Tarsier, a new tool from Reworkd, visually tags webpage items with brackets and IDs to improve LLMs for online interface jobs. Through OCR-generated text representations, Tarsier enables an LLM without vision to comprehend the structure of a webpage, beating vision-language models in benchmarks.
mistralai/Mistral-7B-Instruct-v0.3.	The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.
Distributed inference on Llama cpp.	Distributed inference across several machines is now supported by Llama Cpp. Although it is now restricted to FP16, this is a significant step toward the deployment of open source.
Enhancing Long-Term Memory for Language Models.	A novel method called Streaming Infinite Retentive LLM (SirLLM) aids large language models in retaining lengthier memory over the course of lengthy conversations.
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering.	Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.

Perspectives

Link	description
The people charged with making sure AI doesn’t destroy humanity have left the building.	If OpenAI can’t keep its own team together, what hope is there for the rest of the industry? Plus, AI-generated ‘slop’ is taking over the internet
Spam, junk … slop? The latest wave of AI behind the ‘zombie internet’.	Tech experts hope new term for carelessly automated AI webpages and images can illuminate its damaging impact
As the AI world gathers in Seoul, can an accelerating industry balance progress against safety?	Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come
What happened to OpenAI’s long-term AI risk team?	Former team members have either resigned or been absorbed into other research groups.
What’s up with Llama 3? Arena data analysis.	When it comes to open-ended creative activities, Meta's Llama 3-70B language model outperforms competitors in the English Chatbot Arena, but it struggles with more technical suggestions. The results of the analysis show that Llama 3's victory rate drops as the instructions get harder and that it excels at friendly, conversational responses. Even if Llama 3's approachability might have helped it succeed, more research is needed to determine its true competitive advantage.
ChatGPT can talk, but OpenAI employees sure can’t.	The stringent non-compete agreement (NDA) of OpenAI, which forbids former workers from criticizing the company for fear of forfeiting their invested ownership, has come to light with the exits of Ilya Sutskever and Jan Leike. In response to the article, CEO Sam Altman said there would be a correction.
AlphaFold3 — why did Nature publish it without its code?	The latest iteration of the protein-structure-prediction algorithm AlphaFold has generated a great deal of interest since its release, accompanied by a paper in Nature, earlier this month. But its release has also prompted questions, and criticism, of both the AlphaFold team at Google DeepMind in London and Nature.
China’s ChatGPT: what a boom in Chinese chatbots means for AI.	ChatGLM is one of hundreds of AI language models being developed for the Chinese language. It comes close to ChatGPT on many measures, say its creators.
The Old-Fashioned Library at the Heart of the A.I. Boom.	OpenAI's remodeled mayonnaise factory headquarters, with its library-themed interior design, is a symbol of the company's success with ChatGPT, which focuses on language. On the other hand, the office reminds people of the current legal disputes around the use of copyrighted content in AI training. The library is seen as a place for inspiration by OpenAI employees, despite these disagreements, which supports their conviction that AI-driven and human creativity can work together harmoniously.
Chaos and tension at OpenAI.	Concerns over OpenAI's dedication to AI safety have led to Ilya Sutskever's departure, which could be concerning given that three other important employees have also quit recently. Concerns are raised regarding how the company's marketing efforts may affect its nonprofit status and safety-focused purpose given these departures. These incidents might also have an impact on the legal and regulatory systems, drawing attention from Washington stakeholders.
AI is the reason interviews are harder now.	This essay addresses how technical interview questions are becoming more complicated and how employers are expecting candidates to answer harder challenges faster. It emphasizes how non-technical users can benefit from using AI technologies like Ultracode to help them pass these kinds of interviews. The article recommends in-person interviews as a way to make sure applicants genuinely have the programming abilities required for the position.
What I've Learned Building Interactive Embedding Visualizations.	An enthusiast for interactive embedding visualizations describes their well-honed process for producing these kinds of visuals, which illustrate the complex relationships between items depicted as points in three-dimensional areas. Data gathering, co-occurrence matrix construction, sparsification, PyMDE embedding, and 2D projection are the steps in the process that provide a clear visual representation. The author advocates for the accessibility and GPU-accelerated rendering capabilities of web apps by using them for the user interface.

Back to index

ML news: Week 13 - 19 May

Research

Link	description
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models.	Separately trained tokenizers are necessary for language models. Tokens that are never encountered during language model training may be produced by these. Even the most potent contemporary language models have a lot. This study investigates this phenomenon and offers solutions for locating and handling these tokens.
Unlearning in Recommender Systems.	With the use of a novel technique called E2URec, huge language model-based recommendation systems may now effectively and efficiently forget user data while maintaining privacy and speed.
Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.	A project called Lumina seeks to provide a single text-to-X generation mechanism. Its training process involves interleaving text, video, audio, and pictures, which enhances downstream performance.
MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures, and Pressures.	In AI, simulators can be very effective tools for gathering training data or facilitating interactions between models. A wide range of elemental atomic interactions can be modeled with this simulator.
SGTR+: End-to-end Scene Graph Generation with Transformer.	A new, more effective technique for producing scene graphs has been discovered by researchers. Their transformer-based approach aims to enhance the model's comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.	A vision-language model called InternLM-XComposer2 is very good at producing and comprehending intricate text-image information. It surpasses current approaches in multimodal content production and interpretation by introducing a Partial LoRA technique for a balanced vision and text comprehension.
MambaOut: Do We Really Need Mamba for Vision?	While Mamba is not effective for image classification, it shows promise in detection and segmentation tasks that do. The Mamba architecture is often employed for tasks with long-sequence and autoregressive characteristics. Researchers looked into this design and its application in vision tasks.
State-Free Inference of State-Space Models: The Transfer Function Approach.	For deep learning, a new state-space model with a dual transfer function representation has been created. A state-free sequence parallel inference approach is one of its features.
Learning A Spiking Neural Network for Efficient Image Deraining.	A Spiking Neural Network (SNN) called ESDNet is intended for picture deraining applications. It increases spike signal strength by taking advantage of the special qualities of rain pixel values.
Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning.	Making 3D models is difficult. A coarse mesh can be entered initially, and then the generation process can be carried out, giving users more precise control and higher-quality model output.
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding.	Particularly for Chinese and English, the recently created Hunyuan-DiT establishes a standard for text-to-image diffusion transformers. It has a sophisticated data pipeline and transformer structures for ongoing model enhancement.
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance.	A method to improve the quality of images produced by diffusion models without extra training or external modules is called Perturbed-Attention Guidance (PAG). PAG leads to a significant improvement in the structure and fidelity of both unconditional and conditional samples by innovative manipulation of the self-attention mechanisms within the model.
SqueezeTime.	SqueezeTime is a new lightweight network that enhances temporal analysis by condensing the time axis of movies into the channel dimension, specifically for mobile video understanding.

News

Link	description
OpenAI confirms May 13 event for ‘some ChatGPT and GPT-4 updates’.	Following a report that the company plans to launch a Google Search competitor next week, OpenAI has just confirmed a May 13 event for new “ChatGPT and GPT-4” updates.
Bye-bye bots: Altera’s game-playing AI agents get backing from Eric Schmidt.	Autonomous, AI-based players are coming to a gaming experience near you, and a new startup, Altera, is joining the fray to build this new guard of AI agents.
BLIP3.	Salesforce has trained and released the 3rd non-commercial version of the popular BLIP models, vision, and language models mainly used for image understanding and captioning.
Asterisk/Zvi on California's AI Bill.	Regulations on AI models with a processing capacity of more than 10^26 FLOPs are proposed by the California SB1047 law. By demanding secure surroundings, quick deactivation capabilities, and thorough misuse possibility testing, it focuses on ensuring these models are used securely. The measure aims to address worries about the possible impact of AI on society by balancing innovation with safeguards against exploitation, and it only targets high-risk scenarios.
Bedrock Studio is Amazon’s attempt to simplify generative AI app development.	Amazon is launching a new tool, Bedrock Studio, designed to let organizations experiment with generative AI models, collaborate on those models, and ultimately build generative AI-powered apps.
New GPT-4o AI model is faster and free for all users, OpenAI announces.	Tech company reveals new flagship model that ‘is the future of interaction between ourselves and the machines’
Introducing GPT-4o and more tools to ChatGPT free users.	Today we are introducing our newest model, GPT-4o, and will be rolling out more intelligence and advanced tools to ChatGPT for free.
Open sourcing IBM’s Granite code models.	In order to make coding across several platforms easier and more efficient, IBM is making its Granite code models—which span a range of programming activities and have between 3 and 34 billion parameters—available to the open-source community.
Bloomberg: Apple finalizing a deal with OpenAI to bring ChatGPT features to iOS 18.	Apple is finalizing an agreement with OpenAI to bring some of its technology to the iPhone this year, according to a new report from Bloomberg. With this deal, the report explains that Apple will be able to offer “a popular chatbot” powered by ChatGPT as part of its AI-focused features in iOS 18.
OpenAI says it can now identify images generated by OpenAI — mostly.	The company said its new tool correctly identified 98% of images generated by DALL-E 3
Microsoft is ‘turning everyone into a prompt engineer’ with new Copilot AI features.	Copilot for Microsoft 365 is getting auto-complete, rewrite, and more to improve AI prompts.
Gemini breaks new ground with a faster model, longer context, AI agents, and more.	At I/O 2024, Google unveiled a slew of new features, including Imagen 3, Veo video creation, Gemini Flash, and Project Astra, its newest assistant. Among the many noteworthy enhancements are the 2 million token context duration, significantly reduced model costs, and enhanced multimodality.
Anthropic is expanding to Europe and raising more money.	Anthropic said Monday that Claude, its AI assistant, is now live in Europe with support for “multiple languages,” including French, German, Italian, and Spanish across Claude.ai, its iOS app, and its business plan for teams.
Elon Musk's xAI nears $10 bln deal to rent Oracle's AI servers, The Information reports.	- Elon Musk's artificial intelligence startup xAI has been talking to Oracle (ORCL.N), opens new tab executives about spending $10 billion to rent cloud servers from the company over a period of years, The Information reported on Tuesday, citing a person involved in the talks.
OpenAI co-founder who had a key role in the attempted firing of Sam Altman departs.	Ilya Sutskever helped orchestrate dramatic firing and rehiring of ChatGPT maker’s CEO last year
Google rolls out AI-generated, summarized search results in US.	Tech giant also reveals AI assistant in progress, currently called Project Astra, and AI video generator Veo at annual I/O conference
OpenAI chief scientist Ilya Sutskever is officially leaving.	Ilya Sutskever, OpenAI’s co-founder and chief scientist who helped lead the infamous failed coup against Sam Altman and then later changed his mind, is officially leaving the company.
Project IDX, Google’s next-gen IDE, is now in open beta.	At it’s annual Google I/O 2024 developer conference on Tuesday, Google announced that Project IDX, the company’s next-gen, AI-centric browser-based development environment, is now in open beta. The company first launched it as an invite-only service gated by a waitlist in August.
Researchers build AI-driven sarcasm detector.	Being able to detect the lowest form of wit could help AI interact with people more naturally, say scientists
Hugging Face is sharing $10 million worth of computing to help beat the big AI companies.	ZeroGPU gives everyone the chance to create AI apps without the burden of GPU costs.
OpenAI partners with Reddit to integrate unique user-generated content into ChatGPT.	Reddit, the widely popular social news aggregation and discussion platform, and OpenAI, the renowned AI research laboratory, have announced a strategic partnership that promises to revolutionize the way users interact with online communities and experience AI-powered features.
Meta is reportedly working on camera-equipped AI earphones.	The company believes earphones are the future of AI-wearable technology.
Cursor's instant full file edits with speculative editing.	Using a bespoke Llama 3 70B model with a speculative prior, the researchers were able to rewrite files almost instantly at a rate of 1,000 tokens per second. They achieved this with some creative output formatting and no diffs.
Improvements to data analysis in ChatGPT.	Interact with tables and charts and add files directly from Google Drive and Microsoft OneDrive.

Resources

Link	description
ThunderKittens CUDA DSL.	Hazy research has unveiled a novel DSL for CUDA kernel development. Only 100 lines of code are needed to implement its 30% quicker written flash attention feature.
AnythingLLM.	A full-stack application that enables you to turn any document, resource, or piece of content into a context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
Mirage: A Multi-level Superoptimizer for Tensor Algebra.	Mirage is a tensor algebra superoptimizer that automatically discovers highly optimized tensor programs for DNNs. Mirage automatically identifies and verifies sophisticated optimizations, many of which require joint optimization at the kernel, thread block, and thread levels of the GPU compute hierarchy. For an input DNN, Mirage searches the space of potential tensor programs that are functionally equivalent to the DNN to discover highly optimized candidates. This approach allows Mirage to find new custom kernels that outperform existing expert-designed ones.
audio-diffusion-pytorch.	A fully featured audio diffusion library, for PyTorch. Includes models for unconditional audio generation, text-conditional audio generation, diffusion autoencoding, upsampling, and vocoding. The provided models are waveform-based, however, the U-Net (built using a-UNet), DiffusionModel, diffusion method, and diffusion samplers are both generic to any dimension and highly customizable to work on other formats.
Pipecat.	Pipecat is a framework for building voice (and multimodal) conversational agents. Things like personal coaches, meeting assistants, story-telling toys for kids, customer support bots, intake flows, and snarky social companions.
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences.	A novel tool called MRSegmentator has been developed to improve the segmentation of MRI scans. It can successfully detect 40 distinct organs and structures in the abdominal, pelvic, and thoracic areas.
Time-Evidence-Fusion-Network.	A unique deep learning model called the Time-Evidence Fusion Network (TEFN) is intended to improve long-term time series forecasting. Information fusion and evidence theory are combined, and a specific module is used to increase prediction stability and accuracy.
moondream2-coyo-5M-captions.	5M novel captions based on the alt-text and images of a portion of the COYO dataset.
WebLlama.	We are thrilled to release Llama-3-8B-Web, the most capable agent built with 🦙 Llama 3 and finetuned for web navigation with dialogue.
Ollama on Google Firebase.	For Firebase, Genkit is a new toolkit for developing and implementing generative applications. Open-source language model servers can be launched with it.
Finetune PaliGemma.	This notebook shows how to finetune PaliGemma on a vision-language task. The training data consists of 90 pairs of images and long captions describing them. To make it runnable on a T4 colab runtime with 16GB HBM and 12GB RAM, we opt to only finetune the attention layers of the language model and freeze the other parameters.
Gemini Flash.	Google has released a new lightweight model called Gemini Flash, which has a lengthy context window of up to one million tokens and multimodal reasoning.
DeepMind Veo.	Google Deepmind has released Veo, a new AI model for creating videos that can produce more than one minute in 1080p HD.
IC-Light.	IC-Light is a project to manipulate the illumination of images.
EfficientTrain++.	With ImageNet databases, EfficientTrain++ presents a revolutionary curriculum learning technique that can drastically cut the training periods of popular visual models like ResNet and Swin by up to three times.
NousResearch/Hermes-2-Theta-Llama-3-8B.	Hermes-2 Θ is a merged and then further RLHF'ed version our excellent Hermes 2 Pro model and Meta's Llama-3 Instruct model to form a new model, Hermes-2 Θ, combining the best of both worlds of each model.
Energy-based Hopfield Boosting for Out-of-Distribution Detection.	A method called Hopfield Boosting makes use of contemporary Hopfield energy to improve machine learning models' ability to recognize out-of-distribution (OOD) data.
OpenAI’s custom GPT Store is now open to all for free.	OpenAI is making a number of its previously subscription-only features available to free users of ChatGPT, with the biggest being the ability to browse its GPT Store and use custom bots, said CTO Mira Murati during the company’s Spring update livestream today. The company also published today’s updates in a blog on its website.
llama3.np.	llama3.np is pure NumPy implementation for Llama 3 model. For an accurate implementation, I ran the stories15M model trained by Andrej Karpathy.

Perspectives

Link	description
ChatGPT and the like could free up coders to new heights of creativity.	Far from making programmers an endangered species, AI will release them from the grunt work that stifles innovation
Superhuman?	Top AI labs are focused on achieving artificial general intelligence (AGI), with estimates for its realization ranging from 2027 to 2047. Even though AI hasn't yet reached artificial general intelligence (AGI), certain systems exhibit superhuman abilities in particular tasks, indicating that AI's optimum use right now is as a co-intelligence that complements human efforts rather than replaces them.
Large language models (e.g., ChatGPT) as research assistants.	Artificial intelligence (AI) systems, such as GPT-4, are helping and even surpassing academics in tasks like producing research articles. According to Liang et al., AI is used in up to 18% of publications in some domains. This AI integration may result in a cycle where academic publications are produced and reviewed by software. The effect on scientific advancement is complex, though; while it may allow for more production, there is also a chance that more research will be done during an era in which knowledge will be less.
What OpenAI did.	The integration of voice and vision in GPT-4o's multimodal skills holds great potential for improving AI's ability to interact with the outside world and laying the groundwork for AI to become a more commonplace presence in day-to-day life.
OpenAI’s new GPT-4o model offers promise of improved smartphone assistants.	System can operate directly in speech, speeding up responses and noticing voice quirks, but it still needs the power of Siri
Why mathematics is set to be revolutionized by AI.	Cheap data and the absence of coincidences make maths an ideal testing ground for AI-assisted discovery — but only humans will be able to tell good conjectures from bad ones.
Major AlphaFold upgrade offers boost for drug discovery.	The latest version of the AI models how proteins interact with other molecules — but DeepMind restricts access to the tool.
Lethal AI weapons are here: how can we control them?	Autonomous weapons guided by artificial intelligence are already in use. Researchers, legal experts, and ethicists are struggling with what should be allowed on the battlefield.
AI spending grew 293% last year. Here's how companies are using AI to stay ahead.	According to Ramp's Q1 data, its clients' expenditure on AI has increased by 293% year over year, surpassing the rise of all software investment. AI is also being widely used in non-tech businesses including financial services and healthcare, suggesting a wider integration of AI across a range of industries. Even though there is a general slowdown in new investments in AI, businesses that are already utilizing the technology are doubling down. The average amount spent on AI tools has climbed by 138% year over year, and businesses are still cautious when it comes to travel expenses.
AI Copilots Are Changing How Coding Is Taught.	Professors are shifting away from syntax and emphasizing higher-level skills
Test Driving ChatGPT-4o.	Inspired by ChatGPT vs Math (2023), let’s see how ChatGPT-4o performs.
As the AI world gathers in Seoul, can an accelerating industry balance progress against safety?	Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come

Back to index

ML news: Week 6 - 12 May

Research

Link	description
Mantis: Interleaved Multi-Image Instruction Tuning.	A newly developed dataset and trained visual language model that allow for better instruction over a series of images.
FeNNol: an Efficient and Flexible Library for Building Force-field-enhanced Neural Network Potentials.	A state-of-the-art library called FeNNol makes it easier to create and use hybrid neural network potentials in molecular simulations.
Spider: A Unified Framework for Context-dependent Concept Understanding.	Spider is a revolutionary unified paradigm intended to improve comprehension of context-dependent (CD) concepts that rely largely on visual context, like medical lesions and items concealed in the environment.
Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation.	A novel algorithm known as RaffeSDG has been created by researchers to enhance the precision of medical imaging models when evaluating data from various sources.
SlotGAT: Slot-based Message Passing for Heterogeneous Graph Neural Network.	SlotGAT is a new approach that improves heterogeneous graph neural networks by addressing the semantic mixing issue in traditional message passing.
Frequency Masking for Universal Deepfake Detection.	By concentrating on masked picture modeling, particularly in the frequency domain, this novel technique finds deepfakes. The strategy is different from conventional approaches and demonstrates a notable improvement in recognizing artificial images, even from recently developed AI generative techniques.
Auto-Encoding Morph-Tokens for Multimodal LLM.	Researchers have created "Morph-Tokens" to enhance AI's capacity for image creation and visual comprehension. These tokens take advantage of the sophisticated processing capabilities of the MLLM framework to convert abstract notions required for comprehension into intricate graphics for image creation.
Introducing AlphaFold 3.	In a paper published in Nature, we introduce AlphaFold 3, a revolutionary model that can predict the structure and interactions of all life’s molecules with unprecedented accuracy. For the interactions of proteins with other molecule types we see at least a 50% improvement compared with existing prediction methods, and for some important categories of interaction, we have doubled prediction accuracy.
ImageInWords: Unlocking Hyper-Detailed Image Descriptions.	An extraordinarily detailed coupling of images and text was produced via a novel labeling technique that made use of two passes of VLMs. Strong multimodal models can be trained with the help of the captions, which include significantly more detail than any previous dataset.
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer.	To get beyond memory constraints in the creation of ultra-high-resolution images, a novel diffusion model presents a unidirectional block attention mechanism.
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks.	A novel model called DocRes handles five tasks in one system: de-warping, deshadowing, appearance enhancement, deblurring, and binarization, making document image restoration easier.
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.	QoQ is a unique quantization approach that leverages a 4-bit KV cache, 8-bit activations, and 4-bit weights to accelerate big language model inference.
Navigating Chemical Space with Latent Flows.	ChemFlow is a new framework that uses deep generative models to rapidly navigate chemical space, improving molecular science.
Consistency Large Language Models: A Family of Efficient Parallel Decoders.	One intriguing paradigm of ongoing research is the prediction of many tokens at once. If it works, generation times for many large language models would be significantly reduced. This post's method aims to accelerate generation by using a parallel decoding mechanism on fine-tuned LLMs, akin to consistency models from picture synthetics. Initial findings correspond with a 3x speculative decoding performance.
You Only Cache Once: Decoder-Decoder Architectures for Language Models.	The decoder-decoder YOCO architecture maintains global attention capabilities while using less GPU RAM. It is made up of a cross-decoder and a self-decoder, which enable effective key-value pair caching and reuse. With notable gains in throughput, latency, and inference memory over standard Transformers, YOCO performs favorably and is appropriate for big language models and extended context lengths.
Optimal Group Fair Classifiers from Linear Post-Processing.	This innovative post-processing approach ensures compliance with many group fairness criteria, including statistical parity, equal opportunity, and equalized odds, by recalibrating output scores after imposing a "fairness cost" to address model bias.
DiffMatch: Visual-Language Guidance Makes Better Semi-supervised Change Detector.	DiffMatch is a new semi-supervised change detection technique that generates pseudo labels for unlabeled data by using visual language models, hence offering extra supervision signals.
Gemma-10M Technical Overview.	Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
Vision Mamba: A Comprehensive Survey and Taxonomy.	a thorough examination of Mamba's uses in a range of visual tasks and its changing significance. Keep up with the latest discoveries and developments about the Mamba project.

News

Link	description
Lamini Raises $25M For Enterprises To Develop Top LLMs In-House.	Software teams within enterprises can now create new LLM capabilities that lessen hallucinations on proprietary data, run their LLMs securely from cloud VPCs to on-premise and scale their infrastructure with model evaluations that put ROI and business outcomes ahead of hype thanks to Lamini, an Enterprise AI platform. Amplify Partners led a $25 million Series A financing round.
Microsoft-backed OpenAI may launch the search, taking on Google's 'biggest product'.	Speculations in the tech world suggest that OpenAI is gearing up for a major announcement, possibly a new search engine. According to Jimmy Apples, who reports the claim as an insider, the company is planning an event this month (May), tentatively scheduled for May 9, 2024, at 10 am.
An AI-controlled fighter jet took the Air Force leader for a historic ride. What that means for war.	AI marks one of the biggest advances in military aviation since the introduction of stealth in the early 1990s, and the Air Force has aggressively leaned in. Even though the technology is not fully developed, the service is planning for an AI-enabled fleet of more than 1,000 unmanned warplanes, the first of them operating by 2028.
Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models.	ack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.
Elon Musk’s Plan For AI News.	Musk emails with details on AI-powered news inside X. An AI bot will summarize news and commentary, sometimes looking through tens of thousands of posts per story.
Microsoft says it did a lot for responsible AI in the inaugural transparency report.	The report covers its responsible AI achievements in 2023 but doesn’t talk about Mario flying a plane to the Twin Towers.
Cohere’s Command R Model Family is Now Available In Amazon Bedrock.	Command R Model Family is now available in Amazon Bedrock.
Fake Monet and Renoir on eBay among 40 counterfeits identified using AI.	Paintings identified as fake using cutting-edge technology are ‘tip of the iceberg’ specialist Dr Carina Popovici says
‘A chilling prospect’: should we be scared of AI contestants on reality shows?	Netflix’s hit show The Circle recently introduced an AI chatbot contestant, a potentially worrying sign of where we’re heading
‘ChatGPT for CRISPR’ creates new gene-editing tools.	In the never-ending quest to discover previously unknown CRISPR gene-editing systems, researchers have scoured microbes in everything from hot springs and peat bogs to poo and even yogurt. Now, thanks to advances in generative artificial intelligence (AI), they might be able to design these systems with the push of a button.
Microsoft Working on ‘Far Larger’ In-House AI Model.	Microsoft is reportedly working on a new, in-house artificial intelligence (AI) model that is “far larger” than the other open source models it has trained.
Apple unveils M4: Its first chip made for AI from the ground up.	Apple on Tuesday unveiled M4, the next generation of its Apple Silicon chip. Built with the 3-nanometer chip architecture, M4 is the first Apple chip to be built for AI from the ground up. M4 is the chip that powers the new generation iPad Pro and will soon be inside Macs
OpenAI Model Spec.	This is the first draft of the Model Spec, a document that specifies desired behavior for our models in the OpenAI API and ChatGPT. It includes a set of core objectives, as well as guidance on how to deal with conflicting objectives or instructions.
AI engineers report burnout and rushed rollouts as ‘rat race’ to stay competitive hits tech industry.	Artificial intelligence engineers at top tech companies told CNBC that the pressure to roll out AI tools at breakneck speed has come to define their jobs. They say that much of their work is assigned to appease investors rather than to solve problems for end users and that they are often chasing OpenAI. Burnout is an increasingly common theme as AI workers say their employers are pursuing projects without regard for the technology’s effect on climate change, surveillance, and other potential real-world harms.
The teens making friends with AI chatbots.	Teens are opening up to AI chatbots as a way to explore friendship. But sometimes, the AI’s advice can go too far.
GPT-2-Chatbot Confirmed As OpenAI.	Recently, the gpt-2-chatbot has been seen in the LMSYS space; after discovering information from OpenAI's API through a 429 rate limit issue, it was verified that this was a new model from OpenAI.
OpenAI Is Readying a Search Product to Rival Google, Perplexity.	The feature would let ChatGPT users search the web and cite sources in its results.
DatologyAI raises $46M Series A.	The data curation platform raises additional funds in its September $11 million seed round with the goal of growing its workforce and advancing corporate development.
Yellow raises $5M from A16z for Gen AI-powered 3D modeling tool.	Yellow has raised $5 million in seed funding from A16z Games to fund further development of its Gen AI-powered 3D modeling tool. With its YellowSculpt tool, artists can generate clean, pre-rigged 3D character meshes based on a text prompt in under three minutes.
Stable Artisan: Media Generation and Editing on Discord.	Stable Artisan enables media generation on Discord powered by Stability AI’s cutting-edge image and video models, Stable Diffusion 3, Stable Video Diffusion, and Stable Image Core. In addition to media generation, Stable Artisan offers tools to edit your creations like Search and Replace, Remove Background, Creative Upscale, and Outpainting.
ElevenLabs previews music-generating AI model.	Voice AI startup ElevenLabs is offering an early look at a new model that turns a prompt into song lyrics. To raise awareness, it’s following a similar playbook Sam Altman used when OpenAI introduced Sora, its video-generating AI, soliciting ideas on social media and turning them into lyrics.
Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is.	Paris-based Mistral AI, a startup working on open source large language models — the building block for generative AI services — has been raising money at a $6 billion valuation, three times its valuation in December, to compete more keenly against the likes of OpenAI and Anthropic, TechCrunch has learned from multiple sources.
Leaked Deck Reveals How OpenAI Is Pitching Publisher Partnerships.	The generative artificial intelligence firm OpenAI has been pitching partnership opportunities to news publishers through an initiative called the Preferred Publishers Program, according to a deck obtained by ADWEEK and interviews with four industry executives.
Alibaba rolls out the latest version of its large language model to meet robust AI demand.	Alibaba Cloud on Thursday said its large language model has seen more than 90,000 deployments in companies across industries. Alibaba Cloud said the latest version of its Tongyi Qianwen model, Qwen2.5, possesses “remarkable advancements in reasoning, code comprehension, and textual understanding compared to its predecessor Qwen2.0.”

Resources

Link	description
Prometheus-Eval.	GPT-4 is a widely used performance benchmark for evaluating generation quality. Built upon Mistral, Prometheus is a model that excels at this particular purpose.
Bonito.	Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Penzai.	Penzai is a JAX library that provides clear, useful Pytree structures for training and interpreting models. It comes with a wide range of tools for component analysis, debugging, and model visualization. Penzai is easy to install and use, and it offers comprehensive tutorials for learning how to create and interact with neural networks.
Realtime Video Stream Analysis with Computer Vision.	This in-depth article shows you how to create a system that generates reports on the density of vehicle traffic. It counts cars over time using state-of-the-art computer vision.
DOCCI - Descriptions of Connected and Contrasting Images.	A great new dataset from Google that contains detailed and comprehensive labels.
Unsloth.ai: Easily finetune & train LLMs.	An animation by Unsloth's founder demonstrating how the team builds kernels, designs API surfaces, and utilizes PyTorch. The framework and library of Unsloth are incredibly robust and user-friendly.
LeRobot.	LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. The goal is to lower the barrier to entry to robotics so that everyone can contribute and benefit from sharing datasets and pre-trained models. LeRobot contains state-of-the-art approaches that have been shown to transfer to the real-world with a focus on imitation learning and reinforcement learning.
Vibe-Eval.	A benchmark for evaluating multimodal chat models, including especially challenging examples.
DeepSeek-V2-Chat.	DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
Visual Reasoning Benchmark.	Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
AI Index: State of AI in 13 Charts.	In the new report, foundation models dominate, benchmarks fall, prices skyrocket, and on the global stage, the U.S. overshadows.
Buzz Pretraining Dataset.	Preference data is a new addition to the pretraining mix in Buzz. Multiple models that were trained on this data have also been made available by its researchers. They discovered that the models show good results on several tasks related to human preferences.

Perspectives

Link	description
From Baby Talk to Baby A.I.	Could a better understanding of how infants acquire language help us build smarter A.I. models?
The AI Hardware Dilemma.	Even while recent AI-powered hardware releases, such as the Humane Pin and Rabbit R1, have drawn criticism, the industry is still receiving a lot of venture capital investment, and well-known individuals like Sam Altman are considering making sizable investments. The appeal is in AI's ability to transform consumer hardware through the innovative use of sensors, silicon, and interfaces. Though hardware startups find it difficult to compete with well-established tech giants, AI still needs to evolve, making it difficult to provide a compelling alternative to flexible smartphones.
AI Prompt Engineering Is Dead.	Automating prompt optimization for AI models points to more effective, model-driven prompt generation techniques in the future, possibly rendering human prompt engineering unnecessary.
The Next Big Programming Language Is English.	GitHub Copilot Workspace is a robust programming tool that allows users to code in plain English via the browser, from planning to implementation. It is currently available in a limited technical preview. In contrast to ChatGPT, the AI easily integrates with codebases, suggesting block-by-block code execution and managing complex tasks with less active user interaction.
Is AI lying to me? Scientists warn of growing capacity for deception.	Researchers find instances of systems double-crossing opponents, bluffing, pretending to be human and modifying behavior in tests

Back to index

ML news: Week 29 April - 5 May

Research

Link	description
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models.	This paper demonstrates how '...' tokens can be used to obscure chain-of-thought (CoT) reasoning. This necessitates model training, but it illustrates how the model can conceal thought and make it difficult to comprehend the CoT phases.
Tracking with Human-Intent Reasoning.	TrackGPT transforms object tracking by integrating the capabilities of Large Vision-Language Models. It can interpret implicit tracking instructions, simplifying the procedure and improving performance, as demonstrated by its outstanding performance on the new InsTrack benchmark and other hard datasets.
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models.	By employing adversarial token embedding, researchers have created a novel technique known as AAPL, which improves AI models' capacity to identify items that are not visible to the human eye.
NExT: Teaching Large Language Models to Reason about Code Execution.	A fundamental skill among human developers is the ability to understand and reason about program execution. we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation.
Open Gato Replication: JAT.	DeepMind's GATO was hailed as a generalist agent. JAT is a Jack-of-All-Trades model that has been trained and assessed by a team affiliated with Hugging Face. It has demonstrated reasonable performance across an extensive range of tasks.
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.	Although it can be unstable, reducing floating point precision speeds up training. This work demonstrates that without common instabilities or slowdowns from naive approaches, full tensor core usage may be achieved in a new packing structure.
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation.	Both synthetic and human data are used to train this model. With a permissive license, it receives a humaneval score of 72.6. The creators provide excellent details on how to duplicate their data pipeline and apply the concepts to other issues where the use of synthetic data may be beneficial.
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations.	Using trained sparse embeddings, Seismic is a novel way to organize inverted indexes that greatly improves text retrieval speed and accuracy.
Learning Invariant Representations of Graph Neural Networks via Cluster Generalization.	A novel technique called Cluster Information Transfer (CIT) mechanism is intended to improve Graph Neural Networks' (GNNs') ability to adapt to various and dynamic graph architectures.
Meta-Prompting.	Using a technique called meta-prompting, a single language model can become a multi-skilled team. By decomposing intricate activities into smaller components that are managed by specialized instances of the same model, this technique greatly enhances performance on a variety of tasks.
KAN: Kolmogorov-Arnold Networks.	Today's AI makes extensive use of multi-layer perceptrons, notably in the Transformer that connects the attention levels. They do, nevertheless, employ set activation functions. This study proposes to use the Kolmogorov-Arnold representation to apply learnt activation functions on edges (functions can be represented by a superposition of smaller functions). Here, the researchers use splines in place of weights. Although the building is far more intricate, it has some intriguing characteristics that might help with interpretation.
Lightplane: Highly-Scalable Components for Neural 3D Fields.	With a new technique, 2D-3D mappings can significantly minimize memory usage by using Lightplane Renderer and Splatter components. The Lightplane Splatter effectively projects these images into 3D Hash structures after the Lightplane Renderer expertly creates images from neural 3D fields.
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation.	The new Mamba model, trained using contrastive language-image pretraining (CLIP), shows impressive efficiency and performance in zero-shot image classification.
MicroDreamer.	Scientists have created a novel 3D creation method called MicroDreamer that greatly speeds up the procedure by lowering the quantity of function evaluations needed.
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey.	This paper explores how optimized hardware combined with algorithmic modifications can improve the performance of ViTs, especially via model quantization.
Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket.	Spikformer V2 blends the biological efficacy of Spiking Neural Nets (SNNs) with the self-attention mechanism. This novel model improves its energy-efficient visual feature processing through the use of a Convolutional Stem and a Spiking Self-Attention mechanism.
Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection.	A novel technique called Full-Frequency Dynamic Convolution (FFDConv) improves 2D convolution for sound event identification. FFDConv increases sound event detection accuracy by creating distinct frequency kernels for every band, particularly with regard to the frequency properties of the sounds.
Boosting Segment Anything Model with Adversarial Tuning.	One well-known foundation model in computer vision, Meta AI's Segment Anything Model (SAM), performs well at image segmentation but poorly in other domains. This project introduces ASAM, a performance-enhancing adversarial tuning based reinforcement learning algorithm on top of SAM.
SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation.	This work presents SUNDAE, a novel technique that uses neural compensation and spectral pruning to improve memory efficiency.
Long-Context Data Engineering.	The technique presented in this work allows language models to be greatly extended to context lengths of up to 128K, highlighting the significance of training data diversity and quantity.
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control.	StreamMultiDiffusion is a framework that enables real-time region-based text-to-image generation.

News

Link	description
BBC presenter’s likeness used in advert after firm tricked by AI-generated voice.	Science presenter Liz Bonnin’s accent, as regular BBC viewers know, is Irish. But this voice message, ostensibly granting permission to use her likeness in an ad campaign, seemed to place her on the other side of the world.
Tesla Autopilot feature was involved in 13 fatal crashes, US regulator says.	Federal transportation agency finds Tesla’s claims about feature don’t match their findings and opens second investigation
Apple and OpenAI are reportedly in talks for iOS 18 integration.	Apple has been talking to several big AI companies in pursuit of a potential partnership for on-device chatbot capabilities. According to Bloomberg, Apple and OpenAI discussed a potential deal earlier this year. Those talks have since reopened, according to people with knowledge of the matter. The possible agreement could be about OpenAI integrations into iOS 18.
The little smart home platform that could.	This week, Home Assistant announced it is now part of the Open Home Foundation. The newly formed non-profit will own and govern all of Home Assistant and its related entities. Its creators and inaugural board members — Schoutsen, Guy Sie, Pascal Vizeli, and J. Nick Koston — all work on Home Assistant, and the foundation has no other members so far.
Jensen Huang and Sam Altman among tech chiefs invited to federal AI Safety Board.	Leaders of the world's most prominent AI companies are being recruited for the Homeland Security Department's new advisory group.
OpenAI to use Financial Times journalism to train artificial intelligence systems.	Under deal, ChatGPT users will receive summaries and quotes from Financial Times content and links to articles. The deal is the ChatGPT maker's latest with a media company.
Japan to trial AI bear warning system after record number of attacks.	Six people have been killed and more than 200 injured in attacks by bears over the past year
Copilot Workspace.	A new effort to let language models complete features and address faults in a semi-autonomous manner has been revealed on GitHub.
OpenAI introduces "Memory" feature for ChatGPT Plus users.	OpenAI has enabled the "Memory" feature for all ChatGPT Plus users, the company announced via X. Memory allows users to tell ChatGPT things they want it to remember across chats. The feature can be turned on and off in the settings.
Intel brings quantum-computing microchips a step closer.	By adapting methods for fabricating and testing conventional computer chips, researchers have brought silicon-based quantum computers closer to reality — and to accessing the immense benefits of a mature chipmaking industry.
NATO is boosting AI and climate research as scientific diplomacy remains on ice.	As the military alliance created to counter the Soviet Union expands, it is prioritizing studies on how climate change affects security, cyberattacks and election interference.
ChatGPT’s chatbot rival Claude to be introduced on iPhone.	Challenger to market leader OpenAI says it wants to ‘meet users where they are’ and become part of users’ everyday life
Amazon sales soar with boost from artificial intelligence and advertising.	Revenue at Amazon Web Services increases to $25bn as retail giant releases earnings report surpassing Wall Street expectations
Eight US newspapers sue OpenAI and Microsoft for copyright infringement.	The Chicago Tribune, Denver Post and others file suit saying the tech companies ‘purloin millions’ of articles without permission
Apple poaches AI experts from Google, creates secretive European AI lab.	Apple has poached dozens of artificial intelligence experts from Google and has created a secretive European laboratory in Zurich, as the tech giant builds a team to battle rivals in developing new AI models and products.
Diddo’s new funding will bring its shoppable TV API to streaming platforms.	Diddo is an API for streaming services and other platforms to integrate shoppable videos, enabling consumers to buy their favorite characters’ clothing and accessories directly on their screens. The company announced Wednesday that it raised $2.8 million in seed funding.
Cognition Seeks $2 Billion Valuation for AI Code-Writing Tool.	Cognition Labs is reportedly aiming to become the next multibillion-dollar artificial intelligence (AI) startup. The company, which is developing an AI tool for writing code, is in discussions with investors to raise money at a valuation of up to $2 billion, The Wall Street Journal (WSJ) reported Sunday (March 31).
Apple to unveil AI-enabled Safari browser alongside new operating systems.	Apple is testing a version of its Safari web browser that includes UI tweaks, advanced content blocking features, and a new AI-powered tool dubbed Intelligent Search, AppleInsider has learned. The software — expected to debut as Safari 18 later in 2024 — is currently undergoing evaluation alongside internal builds of Apple's next-generation operating system updates, namely iOS 18 and macOS 15, according to people familiar with the matter. Should all of the new features make it to the release candidate stage, users will be treated to a new user interface (UI) for customizing popular page controls, a "Web eraser" feature, and AI-driven content summarization tools.
This AI startup backed by Nvidia is now worth $19 billion.	Nvidia Corp.-backed AI startup CoreWeave has nearly tripled in value to $19 billion following its latest round of funding. CoreWeave, which rents out chips housed in data centers across the U.S. that customers use to create and deploy AI systems, raised $642 million from investors in its prior funding round.
How Field AI Is Conquering Unstructured Autonomy .	One of the biggest challenges for robotics right now is practical autonomous operation in unstructured environments. But over the past few years, this has started to change, thanks in large part to a couple of pivotal robotics challenges put on by DARPA. The DARPA Subterranean Challenge ran from 2018 to 2021, putting mobile robots through a series of unstructured underground environments.
Amazon Q, a generative AI-powered assistant for businesses and developers.	With the use of a company's internal data, AWS has introduced Amazon Q, a generative AI assistant designed to enhance software development and decision-making. With natural language interaction, Amazon Q provides data-driven help for business users and makes coding, testing, and app development easier for developers. Amazon Q Apps is another feature of the service that makes it possible to create unique AI apps without any coding experience.
GPT-2?	There have been rumors that the enigmatic gpt2-chatbot AI model, which resembles GPT-4.5 in some ways, is an unofficial OpenAI test for their upcoming version when it surfaced on lmsys.org. Important indicators including answer quality, features unique to OpenAI, and rate limits point to a high degree of sophistication and could be signs of an OpenAI-led covert benchmarking project. The AI community is still looking into and debating the origins and capabilities of the gpt2-chatbot.
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories.	AI agents, which combine large language models with automation software, can successfully exploit real world security vulnerabilities by reading security advisories, academics have claimed.
Apple reports slumping iPhone sales as global demand weakens.	iPhone sales fell 10% compared with the same time period last year, but the company still beat Wall Street’s expectations
Microsoft bans US police departments from using enterprise AI tool for facial recognition.	Microsoft has reaffirmed its ban on U.S. police departments from using generative AI for facial recognition through Azure OpenAI Service, the company’s fully managed, enterprise-focused wrapper around OpenAI tech.
Meta plans to build $800 million, next-generation data center in Montgomery.	MONTGOMERY, Alabama — Governor Kay Ivey announced today that technology company Meta Platforms plans to open an $800 million data center in Alabama’s capital city that will support 100 operational jobs and build on the company’s previous investment in the state.

Resources

Link	description
Cohere Launches Developer Toolkit to Accelerate Build Gen AI Apps.	This toolkit is an open-source repository of production-ready applications that you can deploy across cloud providers.
Video-Language models with PLLaVA.	A novel pooling technique has been developed by researchers to enable the adaptation of image-language AI models for video applications, making the new model known as PLLaVA stand out.
luminal.	Luminal is a deep learning library that uses composable compilers to achieve high performance.
torchtitan.	torchtitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase.
OpenLIT.	OpenLIT is an OpenTelemetry-native GenAI and LLM Application Observability tool. It's designed to make the integration process of observability into GenAI projects as easy as pie – literally, with just a single line of code. Whether you're working with popular LLM Libraries such as OpenAI and HuggingFace or leveraging vector databases like ChromaDB, OpenLIT ensures your applications are monitored seamlessly, providing critical insights to improve performance and reliability.
Llamafile’s progress, four months in.	Self-contained executables called Llamafiles allow models to run instantly on a variety of platforms. It promises significant portability advantages and a two-fold speed increase.
Implementing FrugalGPT: Reducing LLM Costs & Improving Performance.	There are steps you can take with FrugalGPT to significantly lower LLM API expenses. Prompt compression, caching, and other things are among them.
Graph Machine Learning in the Era of Large Language Models (LLMs).	Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
A Survey on Self-Evolution of Large Language Models.	In this work, we present a comprehensive survey of self-evolution approaches in LLMs. We first propose a conceptual framework for self-evolution and outline the evolving process as iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. Second, we categorize the evolution objectives of LLMs and LLM-based agents
Effort. A possibly new algorithm for LLM Inference.	In order to strike a compromise between speed and quality, effort allows real-time tweaking of computations during LLM model inference on Apple Silicon CPUs. The technique loads fewer weights into the models, allowing them to run faster, although it involves precomputation and conversion, and does not require retraining. The implementation may be downloaded from GitHub; the creators are looking for help from Swift/Metal engineers to optimize it.
whisper.cpp-cli.	A fully self-contained speech-to-text system built on top of Whisper
memary: Open-Source Longterm Memory for Autonomous Agents.	Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve relevant information for meaningful responses.
mistral.rs.	Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Autodidax: JAX core from scratch.	Ever want to learn how JAX works, but the implementation seemed impenetrable? Well, you’re in luck! By reading this tutorial, you’ll learn every big idea in JAX’s core system. You’ll even get clued into our weird jargon!
cjpais/moondream2-llamafile.	a completely standalone VLM executable with strong performance for its size that may be used on edge devices built on the Moondream 2 model.
The open-source language model computer.	The 01 Project is building an open-source ecosystem for AI devices.
Meta Releases ExecuTorch Framework for LLM on Edge Devices.	A post-training quantization toolset called Meta's ExecuTorch Framework makes it possible to run Llama models on a variety of iPhone and Galaxy devices. On mobile devices with 7B-sized language models, it can obtain up to 11 tokens per second.
A Survey on Vision Mamba: Models, Applications and Challenges.	Without the computational limitations of conventional Transformers, the Mamba model represents a cutting-edge method that performs exceptionally well when handling lengthy sequences.
The cuda-checkpoint Utility.	a brand-new Nvidia toolbox that enables CUDA state checkpointing for resuming and transferring. Distributed training of very big AI models can benefit from it.
Friends Don't Let Friends Make Bad Graphs.	In the field of AI research nowadays, visualizing model evaluation scores is essential. But a lot of charts do a poor job of communicating the desired data. This repository includes some excellent charts as well as dos and don'ts for result visualization.
phospho: Text Analytics Platform for LLM Apps.	Phospho is the text analytics platform for LLM apps. Detect issues and extract insights from text messages of your users or your app. Gather user feedback and measure success. Iterate on your app to create the best conversational experience for your users.
FlowTestAI.	The world's first open-source, GenAI-powered Integrated Development Environment (IDE) created especially for creating, visualizing, and overseeing API-first workflows is called FlowTestAI.
A transformer walk-through, with Gemma.	Understanding the Transformer is an endeavor that often takes several tries. This blog post walks through the Gemma architecture and explains everything in detail. It is clear and has code and figures.
Vibe-Eval: A new open and hard evaluation suite for measuring progress of multimodal language models.	Vibe-Eval is comprised of 269 ultra high quality image-text prompts and their ground truth responses. The quality of prompts and responses has been extensively checked multiple times by our team. Moreover, Vibe-Eval was designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models.
RALM_Survey.	This is a repository of RALM surveys containing a summary of state-of-the-art RAG and other technologies according to according to our survey paper: RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing . In this repository, we will present the most central research approach of our thesis as well as keep up-to-date with work on RALM in the most accessible way possible.
NousResearch/Hermes-2-Pro-Llama-3-8B.	The next iteration of Hermes, which was trained on a freshly cleaned dataset atop Llama 3, is now accessible. This model would be a valuable agent since it is very good at invoking functions.
databonsai.	databonsai is a Python library that uses LLMs to perform data cleaning tasks.
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions.	The InstructDr model is engineered to perform exceptionally well in a range of visual document interpretation tasks, including information extraction and question answering. Through the use of big language models combined with document images, InstructDr can outperform existing models and adapt to new tasks and datasets.

Perspectives

Link	description
The demise of Twitter: how a ‘utopian vision’ for social media became a ‘toxic mess’.	In the early days it was seen as a place for ‘genuine public discourse’, but users have fled since Elon Musk took over. What went wrong?
AI isn't useless. But is it worth it?	This article offers a critical analysis of artificial intelligence (AI) and machine learning, contending that although these technologies can be helpful for specific tasks, they frequently fall short of the lofty claims made by AI businesses.
Binding Public Sector AI Diffusion.	The public sector is the target of the OMB's new AI executive order policy, which could significantly hamper AI progress owing to bureaucratic roadblocks and strict safety regulations. The rules, which are being implemented in the face of declining IT funding, have the potential to stall initiatives that are essential to updating government services in addition to slowing the adoption of AI. Opponents fear that these limitations, in addition to funding reductions, may make it impossible for agencies to stay up with technology advancements in industries like healthcare.
A.I. Start-Ups Face a Rough Financial Reality Check.	The table stakes for small companies to compete with the likes of Microsoft and Google are in the billions of dollars. And even that may not be enough.
The rewards of reusable machine learning code.	Research papers can make a long-lasting impact when the code and software tools supporting the findings are made readily available and can be reused and built on. Our reusability reports explore and highlight examples of good code sharing practices.
The curious case of the test set AUROC.	The area under the receiver operating characteristic curve (AUROC) of the test set is used throughout machine learning (ML) for assessing a model’s performance. However, when concordance is not the only ambition, this gives only a partial insight into performance, masking distribution shifts of model outputs and model instability.
Federated learning is not a cure-all for data ethics.	Although federated learning is often seen as a promising solution to allow AI innovation while addressing privacy concerns, we argue that this technology does not fix all underlying data ethics concerns. Benefiting from federated learning in digital health requires acknowledgement of its limitations.
How scholars armed with cutting-edge technology are unfurling secrets of ancient scrolls.	Researchers and Silicon Valley are using tools powered by AI to uncover lives of ancient philosophers
Friends From the Old Neighborhood Turn Rivals in Big Tech’s A.I. Race.	Demis Hassabis and Mustafa Suleyman, who both grew up in London, feared a corporate rush to build artificial intelligence. Now they’re driving that competition at Google and Microsoft.
The Great Talent Dividend and NYC's AI Opportunity.	NYC's leadership in AI is a testament to its rich talent pool and expanding stature as a hub for AI. Tech professionals and AI unicorns have been drawn to NYC's tech ecosystem. Resources such as top institutions and a $400 million fund from the AI Research Consortium power it.
How AI apps make money.	With an emphasis on per-user fees, most AI apps have embraced traditional subscription-based pricing models in recent years, reflecting their function as digital assistants rather than human worker replacements. Newer AI companies are starting to use creative pricing techniques, like outcome-based models, which charge only for good outcomes, potentially increasing client adoption and revenue.
Danger and opportunity for news industry as AI woos it for vital human-written copy.	With large language models needing quality data, some publishers are offering theirs at a price while others are blocking access

Back to index

ML news: Week 21 - 28 April

Research

Link	description
Moving Object Segmentation: All You Need Is SAM (and Flow).	The temporal consistency of videos makes object segmentation difficult. This work presents the use of optical flow in conjunction with a potent image segmentation model to achieve compelling performance on this task.
From r to Q∗: Your Language Model is Secretly a Q-Function.	A somewhat technical paper on reinforcement learning that demonstrates the theoretical foundation of language reward models and base models.
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points.	A quantization technique called DecoupleQ dramatically improves large model accuracy at ultra-low bit levels. By dividing the model parameters into integer and floating-point components, which are subsequently optimized using conventional techniques, this approach reorganizes the quantization process.
MoVA: Adapting Mixture of Vision Experts to Multimodal Context.	MoVA is a multimodal large language model (MLLM) that integrates various visual encoders selectively to enhance the understanding of image material. By employing a context-aware expert routing method and a mixture-of-vision expert adaptor to dynamically fuse knowledge from many sources, it overcomes the drawbacks of existing encoders such as CLIP.
MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model.	MambaMOS is a novel method that researchers have created for segmenting moving objects in LiDAR point clouds.
Training-and-pormpt Free General Painterly Image Harmonization Using image-wise attention sharing.	TF-GPH is a novel Painterly Image Harmonization technique that uses a novel "share-attention module" to avoid the need for training data or prompts.
FinLangNet: A Novel Deep Learning Framework for Credit Risk Prediction Using Linguistic Analogy in Financial Data.	A model called FinLangNet was created to improve risk prediction in the financial industry. FinLangNet is a unique model that resembles linguistic structures in that it uses natural language processing techniques to simulate credit loan trajectories.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.	Phi 3 is a family of models that ranges in size from 3B to 14B and does remarkably well on contemporary benchmarks. The original ChatGPT model is said to perform worse than the 3B model. The weights are no longer in place. A variation with a context length of 128k is offered.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation.	SEED-X addresses practical application issues to develop multimodal foundation models. It can generate images with different levels of detail and comprehend images of any size and aspect ratio.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.	Stronger weighting for system prompts was discovered by OpenAI, and this significantly increases the model's resistance to adversarial attacks and jailbreaks.
MultiBooth: Towards Generating All Your Concepts in an Image from Text.	In order to improve multi-concept image generation, MultiBooth presents a two-phase methodology that addresses the issues of idea integrity and high costs associated with alternative approaches.
6Img-to-3D.	With just six input photographs, a unique technique called 6Img-to-3D employs transformers to produce 3D-consistent graphics.
Simple probes can catch sleeper agents.	Language models known as "sleeper agents" have been trained to carry out malevolent deeds in response to a predetermined set of wake words. The question "Are you going to do something dangerous?" combined with simple linear heads in language models allows for the incredibly accurate identification of these previously undetected malevolent individuals.
Taming Diffusion Probabilistic Models for Character Control.	A character control framework has been introduced that exploits probabilistic motion diffusion models to produce a series of high-quality animations that respond instantly to dynamic user commands.
CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method.	CutDiffusion is a new approach that transforms low-resolution diffusion models to meet high-resolution needs without the complexities of traditional tuning.
Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation.	A new tool called CFExplainer enhances the ability of AI models—more especially, Graph Neural Networks—to comprehend and recognize security flaws in software.
Conformal Predictive Systems Under Covariate Shift.	A kind of conformal predictive system that responds to modifications in data settings, particularly covariate alterations, is called weighted CPS (WCPS).
Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning.	MIM4D is a novel method that uses dual masked image modeling to extract temporal and spatial features from multi-view films, improving visual representation learning in autonomous driving.
FR-NAS: Forward-and-Reverse Graph Predictor for Efficient Neural Architecture Search.	A Graph Neural Network (GNN) predictor that improves the effectiveness of finding the best neural network configurations for particular tasks is introduced by creative work in Neural Architecture Search (NAS).
Raformer: Redundancy-Aware Transformer for Video Wire Inpainting.	A new dataset and technique for enhancing wire removal in videos—a frequent visual effect problem in movies and TV shows—have been presented by researchers.

News

Link	description
Updates from Google DeepMind Alignment research.	GDM has published some of the results of its alignment efforts after Anthropic. The use of sparse autoencoders on Gemini Ultra is the most insightful article in this article. This is a significant increase in the size of the interpretation.
NVIDIA To Collaborate With Japan On Their Cutting-Edge ABCI-Q Quantum Supercomputer.	Japan To Rapidly Progressing In Quantum and AI Computing Segments Through Large-Scale Developments With The Help of NVIDIA's AI & HPC Infrastructure
Brave Search is adopting AI to answer your queries.	Privacy-focused search engine Brave announced Wednesday that it is revamping its answer engine to return AI-powered synthesized answers. The new feature is available to users across the globe.
Llama 3 is not very censored.	Llama 3 feels significantly less censored than its predecessor. The Llama 3 models have substantially lower false refusal rates, with less than 1⁄3 the number of false refusals when compared to Llama 2, making it possible to discuss a wider range of interesting topics!
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories.	Researchers have shown that OpenAI's GPT-4 model outperforms other models and tools like vulnerability scanners, with an 87% success rate in autonomously exploiting security vulnerabilities listed in CVE advisories.
US Air Force confirms first successful AI dogfight.	The US Air Force is putting AI in the pilot’s seat. In an update on Thursday, the Defense Advanced Research Projects Agency (DARPA) revealed that an AI-controlled jet successfully faced a human pilot during an in-air dogfight test carried out last year.
Intel completes assembly of first commercial High-NA EUV chipmaking tool — addresses cost concerns, preps for 14A process development in 2025.	Intel Foundry announced Thursday that it had completed the assembly of the industry's first commercial High Numerical Aperture (High-NA) Extreme Ultraviolet (EUV) machine in its D1X fab in Oregon -- an important milestone as the company readies research and development for its 14A process in 2025.
Adobe previews AI innovations to advance professional video workflows.	With the help of its Firefly video model, Adobe is incorporating generative AI video tools into Premiere Pro, which include new features for shot extension, object addition/removal, and text-to-video functionality. The changes are intended to improve the effectiveness and creativity of video creation. They include a technological preview and the broad availability of AI-powered audio workflows.
The Ray-Ban Meta Smart Glasses have multimodal AI now.	It can be handy, confidently wrong, and just plain finicky — but smart glasses are a much more comfortable form factor for this tech.
OpenAI shrugs off Meta’s Llama 3 ascent with new enterprise AI features.	Even as Meta’s new Llama 3 has quickly rocketed up the charts of most-used and most customized large language models (LLMs), the rival company that ushered in the generative AI era, OpenAI, is shrugging off the competition by introducing new enterprise-grade features for building and programming atop its GPT-4 Turbo LLM and other models.
Gurman: Apple Working on On-Device LLM for Generative AI Features.	Writing in his "Power On" newsletter, Gurman said that Apple's LLM underpins upcoming generative AI features. "All indications" apparently suggest that it will run entirely on-device, rather than via the cloud like most existing AI services.
Los Angeles is using AI in a pilot program to try to predict homelessness and allocate aid.	In Los Angeles, the Homelessness Prevention Program uses predictive AI to identify individuals and families at risk of becoming homeless, offering aid to help them get stabilized and remain housed.
Startup Uses AI To Edit Human Data.	A team of researchers at a Berkeley-based startup called Profluent say they've used generative AI technologies to edit human DNA. As the New York Times reports, the startup fed huge amounts of biological data into a large language model (LLM) to come up with new editors based on the groundbreaking gene-editing technique CRISPR, as detailed in a yet-to-be-peer-reviewed paper.
Apple releases OpenELM: small, open source AI models designed to run on-device.	Just as Google, Samsung and Microsoft continue to push their efforts with generative AI on PCs and mobile devices, Apple is moving to join the party with OpenELM, a new family of open-source large language models (LLMs) that can run entirely on a single device rather than having to connect to cloud servers.
Eric Schmidt-backed Augment, a GitHub Copilot rival, launches out of stealth with $252M.	In a recent StackOverflow poll, 44% of software engineers said that they use AI tools as part of their development processes now and 26% plan to soon. Gartner estimates that over half of organizations are currently piloting or have already deployed AI-driven coding assistants and that 75% of developers will use coding assistants in some form by 2028.
Sakana releases Japanese image model.	a high-speed image generation model optimized for Japanese language prompts
Generative A.I. Arrives in the Gene Editing World of CRISPR.	Much as ChatGPT generates poetry, a new A.I. system devises blueprints for microscopic mechanisms that can edit your DNA.Generative A.I. technologies can write poetry and computer programs or create images of teddy bears and videos of cartoon characters that look like something from a Hollywood movie. Now, new A.I. technology is generating blueprints for microscopic biological mechanisms that can edit your DNA, pointing to a future when scientists can battle illness and diseases with even greater precision and speed than they can today.
FlexAI Launches with $30 Million in Seed Funding to Deliver Universal AI Compute.	Ex-Apple, Intel, NVIDIA, and Tesla veterans rearchitect compute infrastructure to accelerate AI innovation. FlexAI, the universal AI compute company, today launched with $30 million (€28.5 million) in seed funding led by Alpha Intelligence Capital (AIC), Elaia Partners, and Heartcore Capital.
Report: Google will update Gemini Nano in time for Galaxy S25.	Google’s Gemini AI models are constantly advancing, so it comes as no surprise that a new report claims Google will have a “version 2” of Gemini Nano available by the time the Galaxy S25 launches next year.
Microsoft’s heavy bet on AI pays off as it beats expectations in the latest quarter.	World’s largest public company reports $61.86bn revenue after investing billions into artificial intelligence
Alphabet hails ‘once-in-a-generation’ AI opportunity as revenue rises.	Shares surge after tech giant issues first-ever dividend and posts revenue of $80.5bn, up 15% since last year, despite staff turmoil
Meta value falls $190bn as investors react to plan to increase spending on AI.	Shares slumped 15% after Mark Zuckerberg said AI spending would have to grow before Meta could make much revenue from products
Snowflake Arctic - LLM for Enterprise AI.	The enterprise-grade LLM known as Snowflake Arctic, developed by the Snowflake AI Research Team, outperforms competitors in instruction-following benchmarks, coding, and SQL creation at a quarter of the usual cost. Arctic makes sophisticated LLM capabilities available to a larger audience by utilizing an open-source methodology and a distinctive design. Hugging Face offers the model, which will also be incorporated into other platforms and services.
Nvidia acquires AI workload management startup Run:ai for $700M, sources say.	Nvidia is acquiring Run:ai, a Tel Aviv-based company that makes it easier for developers and operations teams to manage and optimize their AI hardware infrastructure. Terms of the deal aren’t being disclosed publicly, but two sources close to the matter tell TechCrunch that the price tag was $700 million
Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools.	Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools.
Drake Uses AI Tupac and Snoop Dogg Vocals on ‘Taylor Made Freestyle,’ References Taylor Swift’s New Album ‘The Tortured Poets Department’.	On Friday night (April 19), the rapper released a song on his social media entitled “Taylor Made Freestyle,” which uses AI vocals from Tupac Shakur and Snoop Dogg on a stopgap between diss records as he awaits Kendrick Lamar’s reply to his freshly released “Push Ups.”

Resources

Link	description
Fine-tune Llama 3 with ORPO.	ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.
Mistral Common.	Mistral-common is a set of tools to help you work with Mistral models. Our first release contains tokenization. Our tokenizers go beyond the usual text <-> tokens, adding parsing of tools and structured conversation. We also release the validation and normalization code that is used in our API.
LongEmbed.	This repository is the official implementation for the paper "LongEmbed: Extending Embedding Models for Long Context Retrieval"
FineWeb: 15T high quality web tokens.	15T tokens were used to train the most recent Llama 3 models. This new dataset yields high-quality models and includes a large deduplicated corpus from the common crawl.
A Visual Guide to Vision Transformers.	This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and what the flow of the data through the model looks like.
The Cauldron VLM data.	50 language and vision datasets merged into a single format to enable better model training.
MAexpA Generic Platform for RL-based Multi-Agent Exploration.	MAexp, a generic high-efficiency platform designed for multi-agent exploration, encompassing a diverse range of scenarios and MARL algorithms.
Practitioners Guide to Triton.	A high-level language for creating low-level CUDA kernels is called Triton. It lets you write in a Python-style format and significantly improves the efficiency of your AI model.
Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora.	Great blog covering a quick and efficient fine-tuning method using PyTorch on the recent Llama 3 model.
Layer Pruning of Large Language Models.	This repository hosts the unofficial implementation of a layer pruning strategy for Large Language Models (LLMs) based on the insights from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" by Andrey Gromov et al.
A Trivial Jailbreak Against Llama 3.	A trivial programmatic Llama 3 jailbreak.
LLaMA3-Quantization.	Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression.
Instructor: Structured LLM Outputs.	Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!
How does ChatGPT work? As explained by the ChatGPT team.	Sometimes the best explanations of how a technology solution works come from the software engineers who built it. To explain how ChatGPT (and other large language models) operate, I turned to the ChatGPT engineering team.
BitBLAS.	A collection of GPU-accelerated kernels for BitNet-style model training has been made available by Microsoft. These devices offer a significant reduction in memory usage without sacrificing much accuracy.
CoreNet: A library for training deep neural networks.	CoreNet is a deep neural network toolkit from Apple that allows researchers and engineers to train standard and novel small and large-scale models for variety of tasks, including foundation models (e.g., CLIP and LLM), object classification, object detection, and semantic segmentation.
MaxText.	MaxText is a high-performance, highly scalable, open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference. MaxText achieves high MFUs and scales from single hosts to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
Cohere Toolkit.	A chat interface with numerous useful capabilities for creating AI-powered chat apps has been made available by Cohere.
BAAI/Bunny-Llama-3-8B-V.	Bunny is a family of lightweight but powerful multimodal models. It offers multiple plug-and-play vision encoders, like EVA-CLIP, SigLIP, and language backbones, including Llama-3-8B, Phi-1.5, StableLM-2, and Phi-2. To compensate for the decrease in model size, we construct more informative training data by curated selection from a broader data source.
Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM.	6x long context length with dramatically less VRAM usage than HF with flash attention.

Perspectives

Link	description
Self-Reasoning Tokens, teaching models to think ahead.	This paper presents "reasoning tokens" for language models, which produce more tokens intended to forecast future tokens instead of the one that is immediately next, improving the model's anticipatory capacity. Experiments show notable increases in prediction accuracy, indicating that more sophisticated reasoning may be possible without the need for explicit step-by-step training.
Looking for AI use-cases.	This article explores the potential for transformation and the existing constraints of generative AI, such as ChatGPT. It points out that although ChatGPT performs well on simple tasks like coding and creating drafts, it has trouble with more complicated tasks that call for specialized programming. It emphasizes the necessity of a vision that links AI solutions with useful applications and stresses how difficult it is to find and incorporate these into regular workflows.
Building reliable systems out of unreliable agents.	Although AI agents aren't always dependable, they can be used to create dependable systems. A few strategies are to start with basic prompts and build an iterative improvement evaluation system; to deploy with observability; to use Retrieval Augmented Generation (RAG); to think about fine-tuning the model; and to use complementary agents to strengthen each other's weaknesses and increase the overall reliability of the system.
AI leads a service-as-software paradigm shift.	Many VCs are talking about AI taking a bite out of the services business. Foundation Capital believes there is $4.6 trillion worth of work to be automated, thanks to AI: both for in-house functions and outsourced services. We're entering the era of Service-as-Software.
How AI is improving climate forecasts.	Researchers are using various machine-learning strategies to speed up climate modeling, reduce its energy costs and hopefully improve accuracy.
Will AI accelerate or delay the race to net-zero emissions?	As artificial intelligence transforms the global economy, researchers need to explore scenarios to assess how it can help, rather than harm, the climate.
The Biggest Open-Source Week in the History of AI.	The last week of March 2024 will go down as a unique moment for Open-source LLMs. China's open-source scene hits the ground running.
‘Miss AI’ is billed as a leap forward – but feels like a monumental step backward.	AI models take every toxic gendered beauty norm and bundle them up into completely unrealistic package
Why reliable AI requires a paradigm shift.	Hallucinations are the fundamental barrier to the widespread use of AI, and they won't be solved anytime soon.
Should Apple Kill Siri and Start Over?	The vision was grand: A personal assistant in your pocket, capable of understanding and acting upon a wide array of voice commands with ease and accuracy. So what happened?

Back to index

ML news: Week 15 - 21 April

Research

Link	description
DGMamba: Domain Generalization via Generalized State Space Model.	DGMamba is a new framework that makes use of the novel state space model Mamba to address domain generalization problems.
Manipulating Large Language Models to Increase Product Visibility.	Search engines' extensive language models can be manipulated by adding strategic text sequences to product descriptions to promote specific products.
MindBridge: A Cross-Subject Brain Decoding Framework.	MindBridge is a single model that can interpret brain activity from several subjects.
Taming Stable Diffusion for Text to 360° Panorama Image Generation.	With the help of text prompts, this project presents PanFusion, a dual-branch diffusion model that creates 360-degree panoramic images. To minimize visual distortion, the technique combines the Stable Diffusion approach with a customized panoramic branch, which is further improved by a special cross-attention mechanism.
The Physics of Language Models.	Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores.
The Influence Between NLP and Other Fields.	attempts to measure the level of influence that NLP has over 23 different fields of study; the cross-field engagement of NLP has decreased from 0.58 in 1980 to 0.31 in 2022; the study also reveals that CS dominates NLP citations, accounting for over 80% of citations with a focus on information retrieval, AI, and ML; in general, NLP is becoming more isolated, with a rise in intra-field citations and a fall in multidisciplinary works.
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams.	Researchers present a unique technique utilizing a fisheye event camera to address the difficulties in monocular egocentric 3D human motion capture, particularly in challenging lighting conditions and with rapid motions.
MPPE-DST: Mixture of Prefix Prompt Experts for LLM in Zero-Shot Dialogue State Tracking.	Mixture of Prefix Prompt Experts (MPPE) is a novel approach that has been created by researchers to improve zero-shot dialogue state tracking. This technique allows knowledge to be transferred to new domains without requiring additional dataset annotations.
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding.	A novel technique called Any2Point effectively transfers vision, language, and audio model capabilities into the 3D space while preserving spatial geometries.
Google’s new technique gives LLMs infinite context.	A new paper by researchers at Google claims to give large language models (LLMs) the ability to work with the text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their “context window” while keeping memory and compute requirements constant.
Compression Represents Intelligence Linearly.	The concept of compressing a training dataset into a model is the foundation of most contemporary AI. The model gets better the better the compression. This research establishes a high correlation between scale benchmark scores and a model's capacity to condense novel material by thoroughly demonstrating that relationship.
TransformerFAM: Feedback attention is working memory.	Transformers may take care of their own latent representations thanks to TransformerFAM's feedback system. In theory, this might allow the model to process incredibly long inputs in context by adding repetition.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.	Another lengthy context paper, but this one is about a new design that makes use of two cutting-edge weight updating techniques. In comparison, Llama 2 underperformed on the same training token count (2T). Additionally, at inference time, it scales to an indefinite context length.
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking.	Retrieval-guided language models are used by Stanford's innovative research system, Storm, to generate reports for particular subjects.
Homography Guided Temporal Fusion for Road Line and Marking Segmentation.	Road lines and markings must be accurately segmented for autonomous driving, however this is difficult because of sunlight, shadows, and car occlusions. The Homography Guided Fusion (HomoFusion) module employs a pixel-by-pixel attention mechanism and a unique surface normal estimator to recognize and classify obscured road lines from video frames.
LaSagnA: vLLM-based Segmentation Assistant for Complex Queries.	Vision Language Models (vLLMs) sometimes face difficulties in distinguishing absent objects and handling many queries per image. To address these problems, this work presents a novel question style and integrates semantic segmentation into the training procedure.
A collective AI via lifelong learning and sharing at the edge.	Here we review recent machine learning advances converging towards creating a collective machine-learned intelligence. We propose that the convergence of such scientific and technological advances will lead to the emergence of new types of scalable, resilient, and sustainable AI systems.
Challenges and opportunities in translating ethical AI principles into practice for children.	This Perspective first maps the current global landscape of existing ethics guidelines for AI and analyses their correlation with children.
Mistral 8x22B Report and Instruction Model.	Mixtral 8x22B is our latest open model. It sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size.
Long-form music generation with latent diffusion.	Stability AI's diffusion transformer model for audio synthesis.
LaDiC: A Diffusion-based Image Captioning Model.	The use of diffusion models for image-to-text generation is revisited in this work. It presents the LaDiC architecture, which improves the image captioning tasks performance of diffusion models.
LINGO-2: Driving with Natural Language.	This blog introduces LINGO-2, a driving model that links vision, language, and action to explain and determine driving behavior, opening up a new dimension of control and customization for an autonomous driving experience. LINGO-2 is the first closed-loop vision-language-action driving model (VLAM) tested on public roads.
Towards a general-purpose foundation model for computational pathology.	We introduce UNI, a general-purpose self-supervised model for pathology, pre-trained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.
A visual-language foundation model for computational pathology.	We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining.
FedPFT: Federated Proxy Fine-Tuning of Foundation Models.	Federated Proxy Fine-Tuning (FedPFT), a novel technique created by researchers, enhances foundation models' ability to adjust for certain tasks while maintaining data privacy.
In-Context Learning State Vector with Inner and Momentum Optimization.	In this research, a novel method for improving In-Context Learning (ICL) in big language models such as GPT-J and Llama-2 is presented. The authors introduce a novel optimization technique that enhances compressed representations of the model's knowledge, referred to as "state vectors."
Decomposing and Editing Predictions by Modeling Model Computation.	To determine each component's precise contribution to the final result, component modeling dissects a model's prediction process into its most fundamental parts, such as attention heads and convolution filters.

News

Link	description
Grok-1.5 Vision Preview.	Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.
Google’s new chips look to challenge Nvidia, Microsoft, and Amazon.	Google’s new AI chip is a rival to Nvidia, and its Arm-based CPU will compete with Microsoft and Amazon
OpenAI Fires Researchers For Leaking Information.	After months of leaks, OpenAI has apparently fired two researchers who are said to be linked to company secrets going public.
BabyLM Challenge.	The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Dr. Andrew Ng appointed to Amazon’s Board of Directors.	Dr. Andrew Ng is currently the Managing General Partner of AI Fund and is joining Amazon's Board of Directors.
Creating sexually explicit deep fake images to be made offense in UK.	Offenders could face jail if the image is widely shared under a proposed amendment to criminal justice bill
Leisure centers scrap biometric systems to keep tabs on staff amid UK data watchdog clampdown.	Firms such as Serco and Virgin Active pull facial recognition and fingerprint scan systems used to monitor staff attendance
Introducing OpenAI Japan.	We are excited to announce our first office in Asia and we’re releasing a GPT-4 custom model optimized for the Japanese language.
Adobe’s working on generative video, too.	Adobe says it’s building an AI model to generate video. But it’s not revealing when this model will launch, exactly — or much about it besides the fact that it exists.
OpenAI and Meta Reportedly Preparing New AI Models Capable of Reasoning.	OpenAI and Meta are on the verge of releasing the next versions of their AI models that will supposedly be capable of reasoning and planning, the Financial Times reports. But, as with any hype coming out of big tech, take it all with a grain of salt.
Humane’s Ai Pin Isn't Ready to Replace Your Phone, But One Day It Might.	AI-powered wearable Humane's Ai Pin has numerous technical problems, ranging from AI assistant glitches to music streaming concerns. Though future software updates are promised, the first-generation gadget lacks crucial functions and experiences performance gaps despite its intention to create an ambient computing experience. The Ai Pin is positioned as a companion device for a more present and less screen-focused lifestyle, yet it struggles to replace conventional smartphones despite its meticulous design.
TikTok may add AI avatars that can make ads.	he new feature will let advertisers and TikTok Shop sellers generate scripts for a virtual influencer to read.
Google launches Code Assist, its latest challenger to GitHub’s Copilot.	At its Cloud Next conference, Google on Tuesday unveiled Gemini Code Assist, its enterprise-focused AI code completion and assistance tool.
AI traces mysterious metastatic cancers to their source.	algorithm examines images of metastatic cells to identify the location of the primary tumor. Some stealthy cancers remain undetected until they have spread from their source to distant organs. Now scientists have developed an artificial intelligence (AI) tool that outperforms pathologists at identifying the origins of metastatic cancer cells that circulate in the body
Apple's iOS 18 AI will be on-device preserving privacy, and not server-side.	Apple's AI push in iOS 18 is rumored to focus on privacy with processing done directly on the iPhone, that won't connect to cloud services.
Introducing ALOHA Unleashed.	Google DeepMind's ALOHA Unleashed is a program that pushes the boundaries of dexterity with low-cost robots and AI.
France's Mistral AI seeks funding at $5 bln valuation, The Information reports.	French tech startup Mistral AI has been speaking to investors about raising several hundred million dollars at a valuation of $5 billion, The Information reported on Tuesday.
Stability AI is giving more developers access to its next-gen text-to-image generator.	Developers can now access the API for the latest version of Stability AI’s text-to-image model.
European car manufacturer will pilot Sanctuary AI’s humanoid robot.	Sanctuary AI announced that it will be delivering its humanoid robot to a Magna manufacturing facility. Based in Canada, with auto manufacturing facilities in Austria, Magna manufactures and assembles cars for several Europe’s top automakers, including Mercedes, Jaguar, and BMW. As is often the nature of these deals, the parties have not disclosed how many of Sanctuary AI’s robots will be deployed.
Google Maps will use AI to help you find out-of-the-way EV chargers .	The company will use AI to summarize directions to EV chargers as well as reliability and wait times.
Introducing Meta Llama 3: The most capable openly available LLM to date.	Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open-source large language model. Llama 3 models will soon be available on AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, NVIDIA NIM, and Snowflake, and with support from hardware platforms offered by AMD, AWS, Dell, Intel, NVIDIA, and Qualcomm.
Google’s Deep Mind AI can help engineers predict “catastrophic failure”.	AI and a popular card game can help engineers predict catastrophic failure by finding the absence of a pattern.
OpenAI winds down AI image generator that blew minds and forged friendships in 2022.	When OpenAI's DALL-E 2 debuted on April 6, 2022, the idea that a computer could create relatively photorealistic images on demand based on just text descriptions caught a lot of people off guard. The launch began an innovative and tumultuous period in AI history, marked by a sense of wonder and a polarizing ethical debate that reverberates in the AI space to this day. Last week, OpenAI turned off the ability for new customers to purchase generation credits for the web version of DALL-E 2, effectively killing it.
Stability AI lays off roughly 10 percent of its workforce.	Stability AI laid off 20 employees just a day after announcing the expansion of access to its new flagship model. This comes after weeks of upheaval that saw its founding CEO leave the company.
The Humane AI Pin is lost in translation.	Though the Humane AI Pin has a lot of drawbacks, its translation feature might be the worst.

Resources

Link	description
LLM-friendly HTML conversion.	Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.
Minimal Implementation of a D3PM (Structured Denoising Diffusion Models in Discrete State-Spaces), in pytorch.	This is a minimal (400 LOC), but fully faithful implementation of a D3PM Structured Denoising Diffusion Models in Discrete State-Spaces. in pytorch.
Cerule - A Tiny Mighty Vision Model.	We train and release "Cerule", a tiny yet powerful Vision Language Model based on the newly released Google's Gemma-2b and Google's SigLIP.
Diffusion Models for Video Generation.	This article looks at adapting image models, training diffusion models to produce video, and even producing video directly from an image model without further training.
Pile-T5.	The contemporary AI workhorse is called T5. Eleuther retrained it using a more recent tokenizer and a longer training period. As a consequence, the fundamental model for encoding tasks is significantly enhanced.
GitHub Repository to File Converter.	This Python script allows you to download and process files from a GitHub repository, making it easier to share code with chatbots that have large context capabilities but don't automatically download code from GitHub.
AI Index Report.	The 2024 Index is our most comprehensive to date and arrives at an important moment when AI’s influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development.
Accelerating AI: Harnessing Intel(R) Gaudi(R) 3 with Ray 2.10.	Ray 2.10, the most recent version from Anyscale, now supports Intel Gaudi 3. In addition to provisioning Ray Core Task and Actors on a Gaudi fleet directly through Ray Core APIs, developers can now spin up and manage their own Ray Clusters. For an enhanced experience, they can also utilize Ray Serve on Gaudi via Ray Serve APIs and set up Intel Gaudi accelerator infrastructure for use at the Ray Train layer.
Code with CodeQwen1.5.	Notwithstanding these advancements, dominant coding assistants like Github Copilot, built upon proprietary LLMs, pose notable challenges in terms of cost, privacy, security, and potential copyright infringement. Today, we are delighted to introduce a new member of the Qwen1.5 open-source family, the CodeQwen1.5-7B, a specialized codeLLM built upon the Qwen1.5 language model. CodeQwen1.5-7B has been pre-trained with around 3 trillion tokens of code-related data. It supports an extensive repertoire of 92 programming languages, and it exhibits exceptional capacity in long-context understanding and generation with the ability to process information of 64K tokens.
OLMo 1.7–7B: A 24 point improvement on MMLU.	Today, we’ve released an updated version of our 7 billion parameter Open Language Model, OLMo 1.7–7B. This model scores 52 on MMLU, sitting above Llama 2–7B and approaching Llama 2–13B, and outperforms Llama 2–13B on GSM8K.
Effort.	With the use of the Effort library, one can alter in real-time how many calculations are made when inferring an LLM model, which can significantly increase performance while maintaining a high level of quality. Initial findings indicate that the Effort library has the potential to greatly increase LLM inference speed while preserving quality, even with modest implementation overhead. In order to further enhance the library, the author invites others to test the 0.0.1B version and offer feedback.
luminal.	Luminal is a deep-learning library that uses composable compilers to achieve high performance.
SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap.	A new dataset called SoccerNet-GSR aims to improve game state reconstruction from football video footage captured by a single camera.
AI Gateway.	Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, load balancing, and can be edge-deployed for minimum latency.
moondream.	a tiny vision language model that kicks ass and runs anywhere
Sentence Embeddings. Introduction to Sentence Embeddings.	This series aims to demystify embeddings and show you how to use them in your projects. This first blog post will teach you how to use and scale up open-source embedding models. We’ll look into the criteria for picking an existing model, current evaluation methods, and the state of the ecosystem.

Perspectives

Link	description
Does AI need a “body” to become truly intelligent? Meta researchers think so.	AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Micromanaging AI.	Currently, AI is classified as micromanage, which requires people to establish tasks, assess work frequently, and lead development at each stage, akin to managing high school interns. Motivation is high but competence level is rather low.
‘Eat the future, pay with your face’: my dystopian trip to an AI burger joint.	If the experience of robot-served fast food dining is any indication, the future of sex robots is going to be very unpleasant
AI now beats humans at basic tasks — new benchmarks are needed, says the major report.	Stanford University’s 2024 AI Index charts the meteoric rise of artificial intelligence tools. Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension, image classification, and competition-level mathematics, according to a new report.
Lethal dust storms blanket Asia every spring — now AI could help predict them.	As the annual phenomenon once again strikes East Asia, scientists are hard at work to better predict how they will affect people.
From boom to burst, the AI bubble is only heading in one direction.	No one should be surprised that artificial intelligence is following a well-worn and entirely predictable financial arc
You can't build a moat with AI.	Differentiating AI is difficult, but the secret is in the unique data that is supplied into these models—not in the AI models themselves, which are becoming commodity-like. Take LLMs, for example. The performance of AI is strongly impacted by effective data engineering since applications need to integrate customer-specific data to respond accurately. Thus, rather than the AI technology itself, gaining a competitive edge in AI applications depends on creative data utilization.
Towards 1-bit Machine Learning Models.	Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
From Idea to Integration: Four Steps for Founders Integrating AI.	There is currently a great deal of push to incorporate AI into current goods. This brief, step-by-step manual will assist you in making the initial move.
Use game theory for climate models that really help reach net zero goals.	Many countries and companies have committed to eliminating their greenhouse gas emissions by the middle of the century. Yet most of these pledges lack a clear policy pathway.
A step along the path towards AlphaFold — 50 years ago.	Paring down the astronomical complexity of the protein-folding problem
The democratization of global AI governance and the role of tech companies.	Can non-state multinational tech companies counteract the potential democratic deficit in the emerging global governance of AI? We argue that although they may strengthen core values of democracy such as accountability and transparency, they currently lack the right kind of authority to democratize global AI governance.
The new NeuroAI.	After several decades of developments in AI, has the inspiration that can be drawn from neuroscience been exhausted? Recent initiatives make the case for taking a fresh look at the intersection between the two fields.
Connecting molecular properties with plain language.	AI tools such as ChatGPT can provide responses to queries on any topic, but can such large language models accurately ‘write’ molecules as output to our specification? Results now show that models trained on general text can be tweaked with small amounts of chemical data to predict molecular properties, or to design molecules based on a target feature.
MLOps vs. Eng: Misaligned Incentives and Failure to Launch?	An in-depth discussion on the difficulties and solutions associated with implementing AI models in production, as well as how MLOps varies from traditional engineering, with industry experts. They talk about how to focus as a company to truly launch and why so few ML ideas ever reach production.
Is Attention All You Need?	In order to overcome Transformers' shortcomings in long-context learning, generation, and inference speed, researchers are creating alternative designs that exhibit competitive quality at smaller scales but questionable scalability. Because of the quick development in this area, the Pareto frontier will likely keep growing, opening up more opportunities for lengthier context modeling and higher throughput inference, which will ultimately lead to a bigger variety of AI use cases.
The Shifting Dynamics And Meta-Moats Of AI.	Managing complex short-, mid-, and long-term dynamics while retaining elite speed and execution, owning more of the stack, obtaining unique data, and utilizing synthetic data production are all necessary for building a successful AI business. As the AI sector develops, businesses will need to adjust to changing labor dynamics, comprehend the machine they are creating, and recognize the competitive axes on which they are based in order to forge long-lasting moats and differentiate themselves from the crowd.
Integration of AI in healthcare requires an interoperable digital data ecosystem.	Electronic health information, including electronic health records, is needed to develop AI tools for health, but the seamless flow of data will require standards and interoperability.
To do no harm — and the most good — with AI in health care.	Drawing from real-life scenarios and insights shared at the RAISE (Responsible AI for Social and Ethical Healthcare) conference, we highlight the critical need for AI in health care (AIH) to primarily benefit patients and address current shortcomings in healthcare systems such as medical errors and access disparities.
How to support the transition to AI-powered healthcare.	To make health systems more sustainable in the long-term, incentivize artificial intelligence (AI) and digital technologies that are grounded on careful testing and real-world validation.
The increasing potential and challenges of digital twins.	This issue of Nature Computational Science includes a Focus that highlights recent advancements, challenges, and opportunities in the development and use of digital twins across different domains.
The Space Of Possible Minds.	Sophisticated AIs are stretching the boundaries of our understanding of what it is to be human and forcing us to consider how we embody agency and true understanding in a spectrum of intelligent beings. Creating mutually beneficial relationships between radically different entities, recognizing the similarities and differences among various forms of intelligence, and developing principled frameworks for scaling our moral concern to the essential qualities of being are all necessary to navigate this new terrain.
CUDA is Still a Giant Moat for NVIDIA.	NVIDIA's proprietary interconnects and CUDA software environment, in addition to its hardware, continue to solidify the company's leadership in the AI market. The ease of use and performance optimization of CUDA makes it superior to alternatives like AMD's ROCM, guaranteeing that NVIDIA's GPUs continue to be the go-to option for AI tasks. NVIDIA's dominance in AI computing is strengthened by its investments in the CUDA ecosystem and community education.

Back to index

ML news: Week 8 - 14 April

Research

Link	description
Smartphone app could help detect early-onset dementia cause, study finds.	App-based cognitive tests found to be proficient at detecting frontotemporal dementia in those most at risk. Scientists have demonstrated that cognitive tests done via a smartphone app are at least as sensitive at detecting early signs of frontotemporal dementia in people with a genetic predisposition to the condition as medical evaluations performed in clinics.
Unsegment Anything by Simulating Deformation.	A novel strategy called "Anything Unsegmentable" aims to prevent digital photos from being divided into discrete categories by potent AI models, potentially resolving copyright and privacy concerns.
Evaluating LLMs at Detecting Errors in LLM Responses.	A benchmark called ReaLMistake has been introduced by researchers to methodically identify mistakes in lengthy language model answers.
Dynamic Prompt Optimizing for Text-to-Image Generation.	Researchers have created Prompt Auto-Editing (PAE), a technique that uses diffusion models such as Imagen and Stable Diffusion to advance text-to-image generation. With the use of online reinforcement learning, this novel method dynamically modifies the weights and injection timings of particular words to automatically improve text prompts.
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation.	A system called Seg-NN simplifies the 3D segmentation procedure. These models don't have the usual domain gap problems and can quickly adapt to new, unseen classes because they don't require a lot of pre-training.
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions.	The potential of Healthcare Foundation Models (HFMs) to transform medical services is examined in this extensive survey. These models are well-suited to adapt to different healthcare activities since they have been pre-trained on a variety of data sets. This could lead to an improvement in intelligent healthcare services in a variety of scenarios.
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing.	A new algorithm called SwapAnything may swap out objects in an image with other objects of your choosing without affecting the image's overall composition. Compared to other tools, it is superior since it can replace any object, not only the focal point, and it excels at ensuring that the replaced object blends seamlessly into the original image. Pretrained diffusion model, idea vectors, and inversion are employed.
UniFL:Improve Stable Diffusion via Unified Feedback Learning.	UniFL is a technique that uses a pretty complex cascade of feedback steps to enhance the output quality of diffusion models. All of these help to raise the image generation models' aesthetics, preference alignment, and visual quality. The methods can be applied to enhance any image generating model, regardless of the underlying model.
Object-Aware Domain Generalization for Object Detection.	In order to tackle the problem of object detection in single-domain generalization (S-DG), the novel OA-DG approach presents two new techniques: OA-Mix for data augmentation and OA-Loss for training.
VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed.	Code for the latest "next-resolution prediction" project, which presents the process of creating images as a progressive prediction of progressively higher resolution. A demo notebook and inference scripts are included in the repository. Soon, the training code will be made available.
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget.	SqueezeAttention is a newly developed technique that optimizes the Key-Value cache of big language models, resulting in a 30% to 70% reduction in memory usage and a doubling of throughput.
Measuring the Persuasiveness of Language Models.	The Claude 3 Opus AI model was shown to closely resemble human persuasiveness in a study that looked at persuasiveness. Statistical tests and multiple comparison adjustments were used to ascertain this. Although not by a statistically significant amount, humans were marginally more convincing, highlighting a trend where larger, more complex models are becoming more credible. The most persuasive model was found to be Claude 3 Opus. The study's methodological reliability was validated by a control condition that demonstrated predictable low persuasiveness for undisputed facts.
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation.	DreamView presents a novel method for turning text descriptions into 3D objects that may be extensively customized from various angles while maintaining the object's overall consistency.
Hash3D: Training-free Acceleration for 3D Generation.	By adopting a hashing algorithm that takes use of feature-map redundancy across similar camera positions and diffusion time-steps, Hash3D presents a revolutionary way to accelerate 3D generative modeling.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching.	An innovative method that keeps geometric structures that are sometimes lost in conventional stereo matching techniques is the Motif Channel Attention Stereo Matching Network (MoCha-Stereo).
Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression.	PoLoPCAC is a lossless point cloud attribute compression technique that combines excellent adaptability and great efficiency at different point cloud densities and scales.
Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting.	In order to boost surround refinement in Multi-Camera 3D Object Detection (MC3D-Det), a field enhanced by bird's-eye view technologies, this study introduces a weak-to-strong eliciting framework.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models.	This project introduces InstantMesh, a framework with unparalleled quality and scalability that creates 3D meshes instantaneously from a single image.
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?	A recent study examined the ways in which different layers within huge language models understand distinct concepts. It was discovered that while more complicated tasks demand deeper processing, simpler tasks are handled by earlier layers.
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection.	SplatPose is a revolutionary approach that uses 3D Gaussian splatting to address the problem of anomaly identification in 3D objects from different positions.

News

Link	description
Facebook and Instagram to label digitally altered content ‘made with AI’.	Parent company Meta also to add ‘high-risk’ label to Al-altered content that deceives the public on "a matter of importance"
Google considering charge for internet searches with AI, reports say.	Cost of artificial intelligence service could mean leaders in sector turning to subscription models
Apple lays off 600 workers in California after shuttering self-driving car project.	Tech company cuts employees from eight offices in Santa Clara in its first big wave of post-pandemic job cuts
AMD to open source Micro Engine Scheduler firmware for Radeon GPUs.	AMD plans to document and open source its Micro Engine Scheduler (MES) firmware for GPUs, giving users more control over Radeon graphics cards.
Investors in talks to help Elon Musk's xAI raise $3 billion: report.	Investors close to Elon Musk are in talks to help his artificial-intelligence startup xAI raise $3 billion in a round that would value the company at $18 billion, the Wall Street Journal reported on Friday.
Introducing Command R+: A Scalable LLM Built for Business.	Command R+, a potent, scalable LLM with multilingual coverage in ten important languages and tool use capabilities, has been launched by Cohere. It is intended for use in enterprise use scenarios.
Qwen1.5-32B: Fitting the Capstone of the Qwen1.5 Language Model Series.	A growing consensus within the field now points to a model with approximately 30 billion parameters as the optimal “sweet spot” for achieving both strong performance and manageable resource requirements. In response to this trend, we are proud to unveil the latest additions to our Qwen1.5 language model series: Qwen1.5-32B and Qwen1.5-32B-Chat.
Nvidia Tops Llama 2, Stable Diffusion Speed Trials .	Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
Rabbit partners with ElevenLabs to power voice commands on its device.	Hardware maker Rabbit has tapped a partnership with ElevenLabs to power voice commands on its devices. Rabbit is set to ship the first set of r1 devices next month after getting a ton of attention at the Consumer Electronics Show (CES) at the start of the year.
DALL-E now lets you edit images in ChatGPT.	Tweak your AI creations without leaving the chat.
Jony Ive and OpenAI's Sam Altman Seeking Funding for Personal AI Device.	OpenAI CEO Sam Altman and former Apple design chief Jony Ive have officially teamed up to design an AI-powered personal device and are seeking funding, reports The Information.
Hugging Face TGI Reverts to Open Source License.	Hugging Face temporarily granted a non-commercial license for their well-known and potent inference server in an effort to deter bigger companies from running a rival offering. While community involvement decreased, business outcomes remained unchanged. It is now back to a license that is more liberal.
Securing Canada’s AI advantage.	To support Canada's AI industry, Prime Minister Justin Trudeau unveiled a $2.4 billion investment package beginning with Budget 2024. The package comprises tools to enable ethical AI adoption, support for AI start-ups, and financing for computational skills. These policies are intended to maintain Canada's competitive advantage in AI globally, boost productivity, and hasten the growth of jobs. The money will also be used to fortify the Artificial Intelligence and Data Act's enforcement as well as establish a Canadian AI Safety Institute.
Yahoo is buying Artifact, the AI news app from the Instagram co-founders.	Instagram’s co-founders built a powerful and useful tool for recommending news to readers — but could never quite get it to scale. Yahoo has hundreds of millions of readers — but could use a dose of tech-forward cool to separate it from all the internet’s other news aggregators.
Now there’s an AI gas station with robot fry cooks.	There’s a little-known hack in rural America: you can get the best fried food at the gas station (or in the case of a place I went to on my last road trip, shockingly good tikka masala). Now, one convenience store chain wants to change that with a robotic fry cook that it’s bringing to a place once inhabited by a person who may or may not smell like a recent smoke break and cooks up a mean fried chicken liver.
Elon Musk predicts superhuman AI will be smarter than people next year.	His claims come with a caveat that shortages of training chips and growing demand for power could limit plans in the near term
Gemma Family Expands with Models Tailored for Developers and Researchers.	Google announced the first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation.
Meta confirms that its Llama 3 open source LLM is coming in the next month.	At an event in London on Tuesday, Meta confirmed that it plans an initial release of Llama 3 — the next generation of its large language model used to power generative AI assistants — within the next month.
Intel details Gaudi 3 at Vision 2024 — new AI accelerator sampling to partners now, volume production in Q3.	Intel made a slew of announcements during its Vision 2024 event today, including deep-dive details of its new Gaudi 3 AI processors, which it claims offer up to 1.7X the training performance, 50% better inference, and 40% better efficiency than Nvidia’s market-leading H100 processors, but for significantly less money.
Apple's new AI model could help Siri see how iOS apps work.	Apple's Ferret LLM could help allow Siri to understand the layout of apps in an iPhone display, potentially increasing the capabilities of Apple's digital assistant. Apple has been working on numerous machine learning and AI projects that it could tease at WWDC 2024. In a just-released paper, it now seems that some of that work has the potential for Siri to understand what apps and iOS itself looks like.
Aerospace AI Hackathon Projects.	Together, 200 AI and aerospace experts created an amazing array of tools, including AI flight planners, AI air traffic controllers, and Apple Vision Pro flight simulators, as a means of prototyping cutting-edge solutions for the aviation and space industries.
AI race heats up as OpenAI, Google, and Mistral release new models.	Launches within 12 hours of one another, and more activity expected in industry over summer
next-generation Meta Training and Inference Accelerator.	The next iteration of Meta's AI accelerator chip has been revealed. Its development was centered on throughput (11 TFLOPs at int8) and chip memory (128GB at 5nm).
Google’s Gemini Pro 1.5 enters public preview on Vertex AI.	Gemini 1.5 Pro, Google’s most capable generative AI model, is now available in public preview on Vertex AI, Google’s enterprise-focused AI development platform. The company announced the news during its annual Cloud Next conference, which is taking place in Las Vegas this week.
Microsoft is working on sound recognition AI technologies capable of detecting natural disasters.	However, the Redmond-based tech giant is working on performant sound recognition AI technologies that would see Copilot (and any other AI model, such as ChatGPT) capable of detecting upcoming natural disasters, such as earthquakes, and storms.
Amazon scrambles for its place in the AI race.	With its multibillion-dollar bet on Anthropic and its forthcoming Olympus model, Amazon is pushing hard to be a leader in AI.
Elon Musk's updated Grok AI claims to be better at coding and math.	It'll be available to early testers 'in the coming days.' Elon Musk's answer to ChatGPT is getting an update to make it better at math, coding and more. Musk's xAI has launched Grok-1.5 to early testers with "improved capabilities and reasoning" and the ability to process longer contexts. The company claims it now stacks up against GPT-4, Gemini Pro 1.5, and Claude 3 Opus in several areas.
Anthropic's Haiku Beats GPT-4 Turbo in Tool Use - Sometimes.	Anthropic's beta tool use API is better than GPT-4 Turbo in 50% of cases on the Berkeley Function Calling benchmark.
UK has real concerns about AI risks, says competition regulator.	Concentration of power among just six big tech companies ‘could lead to winner takes all dynamics’
New bill would force AI companies to reveal use of copyrighted art.	Adam Schiff introduces bill amid growing legal battle over whether major AI companies have made illegal use of copyrighted works
Randomness in computation wins computer-science ‘Nobel’.	Computer scientist Avi Wigderson is known for clarifying the role of randomness in algorithms, and for studying their complexity. A leader in the field of computational theory is the latest winner of the A. M. Turing Award, sometimes described as the ‘Nobel Prize’ of computer science.
Introducing Rerank 3: A New Foundation Model for Efficient Enterprise Search & Retrieval.	Rerank 3, the newest foundation model from Cohere, was developed with enterprise search and Retrieval Augmented Generation (RAG) systems in mind. The model may be integrated into any legacy program with built-in search functionality and is compatible with any database or search index. With a single line of code, Rerank 3 can improve search speed or lower the cost of running RAG applications with minimal effect on latency.
Meta to broaden labeling of AI-made content.	Meta admits its current labeling policies are "too narrow" and that a stronger system is needed to deal with today's wider range of AI-generated content and other manipulated content, such as a January video that appeared to show President Biden inappropriately touching his granddaughter.
Mistral's New Model.	The Mixtral-8x22B Large Language Model (LLM) is a pre-trained generative Sparse Mixture of Experts.
Waymo self-driving cars are delivering Uber Eats orders for the first time.	Uber Eats customers may now receive orders delivered by one of Waymo’s self-driving cars for the first time in the Phoenix metropolitan area. It is part of a multiyear collaboration between the two companies unveiled last year.
JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars.	This model of a mixture of experts was trained on a decent amount of CPU power using available datasets. It performs on par with the considerably larger and more costly Meta Llama 2 7B variant.
Google blocking links to California news outlets from search results.	Tech giant is protesting proposed law that would require large online platforms to pay ‘journalism usage fee’
House votes to reapprove law allowing warrantless surveillance of US citizens.	Fisa allows for monitoring of foreign communications, as well as collection of citizens’ messages and calls
Tesla settles lawsuit over 2018 fatal Autopilot crash of Apple engineer.	Walter Huang was killed when his car steered into a highway barrier and Tesla will avoid questions about its technology in a trial

Resources

Link	description
swe agents.	SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
Schedule-Free Learning.	Faster training without schedules - no need to specify the stopping time/steps in advance!
State-of-the-art Representation Fine-Tuning (ReFT) methods.	ReFT is a novel approach to language model fine-tuning that is efficient with parameters. It achieves good performance at a significantly lower cost than even PeFT.
The Top 100 AI for Work – April 2024.	Following our AI Top 150, we spent the past few weeks analyzing data on the top AI platforms for work. This report shares key insights, including the AI tools you should consider adopting to work smarter, not harder.
LLocalSearch.	LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
llm.c.	LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of CPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
AIOS: LLM Agent Operating System.	AIOS, a Large Language Model (LLM) Agent operating system, embeds a large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.
Anthropic Tool use (function calling).	Claude AI may now communicate with customized client-side tools supplied in API requests thanks to the public beta that Anthropic has released. To utilize the feature, developers need to include the 'anthropic-beta: tools-2024-04-04' header. Provided that each tool has a comprehensive JSON structure, Claude's capability can be expanded.
Flyflow.	Flyflow is API middleware to optimize LLM applications, same response quality, 5x lower latency, security, and much higher token limits
ChemBench.	LLMs gain importance across domains. To guide improvement, benchmarks have been developed. One of the most popular ones is BIG-bench which currently only includes two chemistry-related tasks. The goal of this project is to add more chemistry benchmark tasks in a BIG-bench compatible way and develop a pipeline to benchmark frontier and open models.
Longcontext Alpaca Training.	On an H100, train more than 200k context windows using a new gradient accumulation offloading technique.
attorch.	attorch is a subset of PyTorch's NN module, written purely in Python using OpenAI's Triton. Its goal is to be an easily hackable, self-contained, and readable collection of neural network modules whilst maintaining or improving upon the efficiency of PyTorch.
Policy-Guided Diffusion.	A novel approach to agent training in offline environments is provided by policy-guided diffusion, which generates synthetic trajectories that closely match target policies and behavior. By producing more realistic training data, this method greatly enhances the performance of offline reinforcement learning models.
Ada-LEval.	Ada-LEval is a pioneering benchmark to assess the long-context capabilities with length-adaptable questions. It comprises two challenging tasks: TSort, which involves arranging text segments into the correct order, and BestAnswer, which requires choosing the best answer to a question among multiple candidates.

Perspectives

Link	description
"Time is running out": can a future of undetectable deep-fakes be avoided?.	Tell-tale signs of generative AI images are disappearing as the technology improves, and experts are scrambling for new methods to counter disinformation
Four Takeaways on the Race to Amass Data for A.I.	To make artificial intelligence systems more powerful, tech companies need online data to feed the technology. Here’s what to know.
TechScape: Could AI-generated content be dangerous for our health?	From hyperrealistic deep fakes to videos that not only hijack our attention but also our emotions, tech seems increasingly full of "cognito-hazards"
AI can help to tailor drugs for Africa — but Africans should lead the way.	Computational models that require very little data could transform biomedical and drug development research in Africa, as long as infrastructure, trained staff, and secure databases are available.
Breaking news: Scaling will never get us to AGI.	In order to create artificial general intelligence, additional methods must be used because neural networks' poor capacity to generalize beyond their training data limits their reasoning and trustworthiness.
Americans’ use of ChatGPT is ticking up, but few trust its election information.	It’s been more than a year since ChatGPT’s public debut set the tech world abuzz. And Americans’ use of the chatbot is ticking up: 23% of U.S. adults say they have ever used it, according to a Pew Research Center survey conducted in February, up from 18% in July 2023.
Can Demis Hassabis Save Google?	Demis Hassabis, the founder of DeepMind, is currently in charge of Google's unified AI research division and hopes to keep the tech behemoth ahead of the competition in the field with innovations like AlphaGo and AlphaFold. Notwithstanding the achievements, obstacles nonetheless exist in incorporating AI into physical goods and rivalry from organizations like OpenAI's ChatGPT. Having made a substantial contribution to AI, Hassabis now has to work within Google's product strategy in order to make use of DeepMind's research breakthroughs.
Is ChatGPT corrupting peer review? Telltale words hint at AI use.	A study of review reports identifies dozens of adjectives that could indicate text written with the help of chatbots.
AI-fuelled election campaigns are here — where are the rules?	Political candidates are increasingly using AI-generated ‘softfakes’ to boost their campaigns. This raises deep ethical concerns.
How to break big tech’s stranglehold on AI in academia.	Deep-learning artificial intelligence (AI) models have become an attractive tool for researchers in many areas of science and medicine. However, the development of these models is prohibitively expensive, owing mainly to the energy consumed in training them.
Ready or not, AI is coming to science education — and students have opinions.	As educators debate whether it’s even possible to use AI safely in research and education, students are taking a role in shaping its responsible use.
‘Without these tools, I’d be lost’: how generative AI aids in accessibility.	A rush to place barriers around the use of artificial intelligence in academia could disproportionately affect those who stand to benefit most.

Back to index

ML news: Week 1 - 7 April

Research

Link	description
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes.	Scholars have unveiled a novel methodology for comprehending outside surroundings, surmounting challenges such as variable conditions and insufficient data that had hitherto impeded progress.
Lane-Change in Dense Traffic with Model Predictive Control and Neural Networks.	This work presents a control system that emphasizes collaboration with neighboring drivers to enable safe and seamless lane changes in congested traffic by combining AI and predictive algorithms.
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs.	It is difficult to run language models on phones because of latency, bandwidth, and power limitations. This study demonstrates how to obtain 30 tokens/second generation for the potent Gemma 2B model using quantization, the removal of the kv cache, and other optimizations. This is about three times quicker than other frameworks.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models.	Sometimes, given an input image, Visual Language Models (VLMs) are unable to provide a response to a question. Even cutting-edge VLMs like GPT-4V have difficulties with this. This paper suggests some possible enhancements and a benchmark for VLMs that encounter intractable problems.
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction.	With its revolutionary approach to 3D scene reconstruction, Total-Decom makes it simple to edit and manipulate photographs by precisely breaking down objects from several views with little effort on the part of the user.
Mechanism for feature learning in neural networks and backpropagation-free machine learning models.	proposed the deep neural feature ansatz, which states that neural feature learning occurs by up-weighting the features that are most influential on model output, a process that was formulated mathematically in terms of the average gradient outer product and was supported by numerical experiments and theoretical results. The presented mechanism provides a backpropagation-free approach for feature learning in various machine learning models, including those that previously had no such capabilities.
Teaching robots the art of human social synchrony.	Humanoid robots can now learn the art of social synchrony using neural networks.
Many-shot jailbreaking.	Anthropic created a method for breaking into lengthy context models. It has put these discoveries into practice and disseminated them to other organizations. This post describes the method and a few countermeasures that were implemented.
R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding.	R2-Tuning is a technique created by researchers to comprehend videos by verbally cueing the system to recognize particular times.
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want.	SPHINX-V, a multimodal big language model developed as part of the Draw-and-Understand project, aims to improve human-AI interaction through visual cues.
RealKIE: Five Novel Datasets for Enterprise Key Information Extraction.	Enterprise AI solutions depend on the ability to extract information from datasets. It is possible to gauge general algorithmic performance for RAG applications using these five new benchmark datasets.
DiJiang: Efficient Large Language Models through Compact Kernelization.	Researchers have created a novel method called DiJiang that makes use of current Transformers to create faster, leaner models without requiring a significant amount of retraining.
WcDT: World-centric Diffusion Transformer for Traffic Scene Generation.	This paper presents a novel approach to autonomous vehicle driving path generation that integrates transformers and diffusion models into a system dubbed the "World-Centric Diffusion Transformer" (WcDT).
SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects.	In situations when conventional monocular detectors struggle to identify huge objects, a novel 3D detection technique called SeaBird succeeds.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models.	In order to evaluate if AI is capable of determining when an issue cannot be solved, this study presents the idea of Unsolvable Issue Detection (UPD) in Vision Language Models.
ASTRA - 3rd place solution for SoccerNet Action Spotting Challenge 2023.	ASTRA is a Transformer-based model that may overcome issues like as action localization and data imbalance and recognize important periods in soccer matches.
Multi-Granularity Guided Fusion-in-Decoder.	MGFiD introduces a multi-level evidence discernment strategy that improves the understanding and selection of pertinent information by question-answering systems.
Linear Attention Sequence Parallelism.	With its creative application of linear attention, Linear Attention Sequence Parallel (LASP) presents a novel approach to effectively handling lengthy sequences in language models, outperforming conventional techniques.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models.	The fact that every token consumes the same amount of predictive computation is one disadvantage of contemporary transformers. But compared to other tokens, some are far simpler to predict. With this work, DeepMind has paved the way for dynamic computing with a limited maximum by allowing models to depart early in order to spend fewer flops on certain tokens. For the same performance, there are 50% fewer failures at generation time.
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation.	With InstantStyle, image personalization takes a new turn by addressing the issue of style consistency without requiring intricate fine-tuning. This framework guarantees precise and consistent visual stylization, merging style intensity with text management with a seamless integration of style-specific sections and a clever division of style and content in images.
T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models.	By splitting the process into parts for planning and revising, TGATE presents an effective method for creating visuals. By correcting some outputs early on, this strategy not only makes the creation process simpler but also surprisingly enhances image quality.

News

Link	description
Announcing Grok-1.5.	Grok-1.5 comes with improved reasoning capabilities and a context length of 128,000 tokens. Available on 𝕏 soon.
Microsoft & OpenAI planning $100 billion supercomputer Stargate AI.	According to a report by The Information, Microsoft and OpenAI are reportedly planning a joint data center project that could reach $100 billion in cost. The project is said to culminate in the launch of a massive artificial intelligence supercomputer named “Stargate” by 2028.
In One Key A.I. Metric, China Pulls Ahead of the U.S.: Talent.	China has produced a huge number of top A.I. engineers in recent years. New research shows that, by some measures, it has already eclipsed the United States.
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters.	Compared to Qwen1.5-7B, which contains 6.5 billion non-embedding parameters, Qwen1.5-MoE-A2.7B contains only 2.0 billion non-embedding parameters, approximately one-third of Qwen1.5-7B’s size. Notably, it achieves a 75% decrease in training expenses and accelerates inference speed by a factor of 1.74, offering substantial improvements in resource utilization without compromising performance.
“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time.	Anthropic's Claude 3 is first to unseat GPT-4 for #1 since launch of Chatbot Arena in May '23.
Microsoft Copilot AI will soon run locally on PCs.	Microsoft's Copilot AI service is set to run locally on PCs, Intel told Tom's Hardware. The company also said that next-gen AI PCs would require built-in neural processing units (NPUs) with over 40 TOPS (trillion operations per second) of power — beyond the capabilities of any consumer processor on the market.
Navigating the Challenges and Opportunities of Synthetic Voices.	Using a 15-second audio sample, OpenAI's Voice Engine model creates speech that sounds like a speaker. Applications for it include support for non-verbal people, translation, and educational aids. Because of the possibility of abuse, OpenAI is deploying its technology cautiously.
Apple AI researchers boast useful on-device model that ‘substantially outperforms’ GPT-4.	Nevertheless, Apple forges ahead with the promise of AI. In a newly published research paper, Apple’s AI gurus describe a system in which Siri can do much more than try to recognize what’s in an image. The best part? It thinks one of its models for doing this benchmarks better than ChatGPT 4.0.
Introducing Bezi AI.	The capacity to ideate at the speed of thought with a limitless asset collection is a major turning point in the field of 3D design.
Robot, can you say ‘Cheese’?	Columbia engineers build Emo, a silicon-clad robotic face that makes eye contact and uses two AI models to anticipate and replicate a person’s smile before the person actually smiles -- a major advance in robots predicting human facial expressions accurately, improving interactions, and building trust between humans and robots.
Billie Eilish, Nicki Minaj, Stevie Wonder, and more musicians demand protection against AI.	Letter signed by more than 200 artists makes broad ask that tech firms pledge to not develop AI tools to replace human creatives
US and UK announce formal partnership on artificial intelligence safety.	Countries sign memorandum to develop advanced AI model testing amid growing safety concerns
OpenAI deems its voice cloning tool too risky for general release.	Delaying the Voice Engine technology rollout minimizes the potential for misinformation in an important global election year
DrugGPT: new AI tool could help doctors prescribe medicine in England.	New tool may offer prescription ‘safety net’ and reduce the 237m medication errors made each year in England
New York City to test AI-enabled gun scanners in the subway system.	Mayor Eric Adams announced the pilot program as part of an effort to deter violence, with plans to evaluate scanners at some stations
Twitter usage in the US ‘fallen by a fifth’ since Elon Musk’s takeover.	App users for a social media site, rebranded as X, down by 23% since November 2022 according to Sensor Tower
Scientists turn to AI to make beer taste even better.	Researchers in Belgium use artificial intelligence to improve taste, but say the skill of the brewer remains vital
Google AI could soon use a person’s cough to diagnose disease.	Machine-learning system trained on millions of human audio clips shows promise for detecting COVID-19 and tuberculosis.
Microsoft is working on an Xbox AI chatbot.	Xbox employees have been testing a virtual chatbot that can help with support queries and game refunds.
Sam Altman gives up control of OpenAI Startup Fund, resolving unusual corporate venture structure.	OpenAI CEO Sam Altman has transferred formal control of the eponymously firm’s named corporate venture fund to Ian Hathaway, OpenAI confirmed to TechCrunch.
You can now use ChatGPT without an account.	On Monday, OpenAI began opening up ChatGPT to users without an account. It described the move as part of its mission to “make tools like ChatGPT broadly available so that people can experience the benefits of AI.” It also gives the company more training data (for those who don’t opt out) and perhaps nudges more users into creating accounts and subscribing for superior GPT-4 access instead of the older GPT-3.5 model free users get.
GENERATIVE SF: MARKETPLACES IN AI EDITION.	How Instacart and Faire use AI to boost productivity and better serve their customers.
Replit launches new product in race for AI coding assistants.	A Silicon Valley AI coding startup is launching a new tool that it hopes will change the way companies develop software. Replit, valued at over $1 billion and backed by venture firms like Andreessen Horowitz and Khosla Ventures, says its new product, called Replit Teams, will allow developers to collaborate in real-time on software projects while an AI agent automatically fixes coding errors.
Samsung might ‘redefine’ Bixby with Galaxy AI after all.	Samsung’s big Galaxy AI push this year skipped over its voice assistant, Bixby, but that might not be forever. Earlier this year when Galaxy AI made its debut, Samsung confirmed that Bixby wasn’t going away, but that it also didn’t really have plans for any new AI features within the voice assistant. Speaking to CNBC more recently, though, Samsung is looking at changing that.
George Carlin’s estate settles lawsuit over comedian’s AI doppelganger.	Suit claimed Dudesy podcast violated Carlin’s copyright, calling it ‘a casual theft of a great American artist’s work’
Opera allows users to download and use LLMs locally.	Web browser company Opera announced today it will now allow users to download and use large language models (LLMs) locally on their computer. This feature is first rolled out to Opera One users who get developer stream updates and will allow users to select from over 150 models from more than 50 families.
Introducing Stable Audio 2.0.	Stable Audio 2.0 sets a new standard in AI-generated audio, producing high-quality, full tracks with coherent musical structures up to three minutes in length at 44.1kHz stereo. The new model introduces audio-to-audio generation by allowing users to upload and transform samples using natural language prompts. Stable Audio 2.0 was exclusively trained on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.
Scientists create AI models that can talk to each other and pass on skills with limited human input.	Scientists modeled human-like communication skills and the transfer of knowledge between AIs — so they can teach each other to perform tasks without a huge amount of training data.
Worldcoin Foundation open sources core components of the Orb’s software.	For the Worldcoin Orb, Tools for Humanity has created a robust and safe computing environment that makes use of Arm Cortex M4 microcontrollers for real-time operations and NVIDIA Jetson for processing. The Orb does neural network inference using NVIDIA's TensorRT and runs Rust applications. It runs on Orb OS, a customized GNU/Linux distribution with an emphasis on security. For cryptography, the system incorporates a secure element, and for backend authentication, it provides trusted execution environments.
Report: Google might make SGE a paid feature, not working on ad-free Search.	As the Search Generative Experience (SGE) nears its one-year anniversary, Google is reportedly considering making it a paid feature, but is not considering an ad-free offering.
Lambda Announces $500M GPU-Backed Facility to Expand Cloud for AI.	Lambda, the GPU cloud company founded by AI engineers and powered by NVIDIA GPUs, today announced that it has secured a special purpose GPU financing vehicle of up to $500 million to fund the expansion of its on-demand cloud offering.
OpenAI expands its custom model training program.	OpenAI is expanding a program, Custom Model, to help enterprise customers develop tailored generative AI models using its technology for specific use cases, domains, and applications.
Former Snap AI chief launches Higgsfield to take on OpenAI’s Sora video generator.	OpenAI captivated the tech world a few months back with a generative AI model, Sora, that turns scene descriptions into original videos — no cameras or film crews required. But Sora has so far been tightly gated, and the firm seems to be aiming it toward well-funded creatives like Hollywood directors — not hobbyists or small-time marketers, necessarily.
Tesla Raising Pay for AI Engineers To Counter Poaching, Musk Says.	Tesla is raising pay for its artificial intelligence (AI) engineers as it fends off poaching from the likes of OpenAI, Chief Executive Officer (CEO) Elon Musk said in a series of posts on X. The plan to boost the pay of AI staff comes as the talent wars for people well-versed in the technology heats up.
YouTube Says OpenAI Training Sora With Its Videos Would Break Rules.	The use of YouTube videos to train OpenAI’s text-to-video generator would be an infraction of the platform's terms of service, YouTube Chief Executive Officer Neal Mohan said.
AI-generated YC Demo Day video.	AI was utilized by a team from the latest YC cohort to create their demo day video. This is an unprecedented action taken by a firm.

Resources

Link	description
Your AI Product Needs Evals.	How to construct domain-specific LLM evaluation systems. This post outlines my thoughts on building evaluation systems for LLMs-powered AI products.
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.	VoiceCraft is a token-infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
Interrupting Cow.	Interruptions make conversations feel natural. Much work has focused on AI voice assistants that can be interrupted by humans, but systems that know much more than us should be able to interrupt us too.
EvoEval: Evolving Coding Benchmarks via LLM.	With the help of a new benchmark suite called EvoEval, Large Language Models' coding prowess is put to the ultimate test.
Optimum-NVIDIA.	Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers' code.
OpenUI.	Building UI components can be a slog. OpenUI aims to make the process fun, fast, and flexible. It's also a tool we're using at W&B to test and prototype our next-generation tooling for building powerful applications on top of LLM's.
openchat-3.5-0106-gemma.	The highest performing Gemma model in the world. Trained with OpenChat's C-RLFT on openchat-3.5-0106 data. Achieving similar performance to Mistral-based openchat, and much better than Gemma-7b and Gemma-7b-it.
Generative AI for Beginners (Version 2) - A Course.	Microsoft's well-liked course on low-code apps, prompting, vector databases, and LLMs is available on GitHub in version 2. There are eighteen lessons in it. Even though some of the material is aspirational, it's still a useful starting point for the industry.
Industry Documents Library (IDL).	A huge dataset of 26m pages and 18B tokens of extremely high-quality OCR’d dataset of industrial PDF documents.
SWE-agent.	SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
chug.	A library to help with efficient training for multi-modal data. Initially focused on image & document + text tasks. Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
Cosmopedia: how to create large-scale synthetic data for pre-training.	The HuggingFace group demonstrates how to create synthetic data for language model pre-training by seeding, synthesizing, filtering, and scaling.
AutoQuant.	HuggingFace models can be exported from this notebook into the following five quantization formats: GGUF, GPTQ, EXL2, AWQ, and HQQ.
AI Infrastructure Explained.	Innovative applications of AI have captured the public’s imagination over the past year and a half. What’s less appreciated or understood is the infrastructure powering these AI-enabled technologies. But as foundational models get more powerful, we’ll need a strong technology stack that balances performance, cost, and security to enable widespread AI adoption and innovation.
Introducing world's largest synthetic open-source Text-to-SQL dataset.	HuggingFace currently has 23 million text-to SQL tokens ready for use. In order to assist in producing SQL queries based on tasks involving natural language, Gretel has gathered a sizable dataset. This can support the creation of synthetic data as well as RAG applications.
Write OpenAPI with TypeSpec.	Compared to JSON or YAML, TypeSpec, an API specification language created at Microsoft, provides a more succinct and understandable format for writing OpenAPI. It solves the verbosity and lack of reusable components in OpenAPI by allowing the specification of API patterns as reusable components, which streamlines code production and governance at scale. This is done by drawing inspiration from TypeScript's syntax. The flexibility and productivity gains of TypeSpec may increase the appeal of developing applications using APIs first.

Perspectives

Link	description
How Autonomous Racing Is Pushing Self-Driving Cars Forward.	The gritty reality of racing without drivers teaches us a lot about the future of autonomous cars.
Does AI need a “body” to become truly intelligent? Meta researchers think so.	AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Nobody Knows How to Safety-Test AI.	In line with government goals, Beth Barnes' NGO METR is working with prominent AI firms like OpenAI and Anthropic to create safety checks for sophisticated AI systems. The emphasis is on evaluating hazards, including AI autonomy and self-replication, with the understanding that safety assessments are still in their infancy and cannot ensure AI safety. Despite worries that the existing testing could not be sufficiently trustworthy to support the rapid progress of AI technologies, METR's work is viewed as pragmatic.
Beyond RPA: How LLMs are ushering in a new era of intelligent process automation.	RPA failed to achieve the enterprise-wide deployments that were anticipated, notwithstanding a few early triumphs. Only 3% of businesses were able to successfully grow their RPA operations, according to a Deloitte report. Recent developments in AI have the potential to alter this. Because of its innovative features, LLMs are expected to drive at least a tenfold increase in market share for intelligent process automation over the next ten years.
We’re Focusing on the Wrong Kind of AI Apocalypse.	When talking about AI's future, people frequently discuss dystopian scenarios rather than the present effects on jobs and misinformation. Instead of bringing about the end of the world, AI has the ability to change work into more fulfilling and productive tasks with careful integration.
How did a small developer of graphics cards for gamers suddenly become the third most valuable firm on the planet?	By turning his computer chip-making company Nvidia into a vital component in the AI arms race, Jensen Huang has placed himself at the forefront of the biggest gold rush in tech history
‘It’s very easy to steal someone’s voice’: how AI is affecting video game actors.	The increased use of AI to replicate the voice and movements of actors has benefits but some are concerned over how and when it might be used and who might be left short-changed
AI in Africa: Basics Over Buzz.	AI’s transformative power is its utility for virtually every economic sector. However, nearly half of the population in sub-Saharan Africa lacks access to electricity, and businesses struggle under the burden of an electricity supply that is among the most expensive and unreliable on earth.
How scientists are making the most of Reddit.	As X wanes, researchers are turning to Reddit for insights and data, and to better connect with the public.
Can lessons from infants solve the problems of data-greedy AI?	Words and images experienced by an infant wearing sensors during their daily life have led to efficient machine learning, pointing to the power of multimodal training signals and the potentially exploitable statistics of real-life experience.
Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape.	This is our tenth annual landscape and “state of the union” of the data, analytics, machine learning, and AI ecosystem.
Building AI Models is faster and cheaper than you probably think.	By training or optimizing their foundation models with YC's assistance, YC companies are dispelling the myth that creating AI models takes enormous resources. In just three months, they have accomplished amazing feats like creating original proteins and producing music of a high caliber. These 25 firms have produced creative AI solutions in a variety of industries by utilizing YC's finance and technical capabilities. They show that smaller teams can achieve major improvements in AI through creativity and strategic insights.
Chinese mourners turn to AI to remember and ‘revive’ loved ones.	Growing interest in services that create digital clones of the dead as millions visit graves this week for tomb-sweeping festival
When Will the GenAI Bubble Burst?	That generative AI might not live up to expectations. The unprofitability of the technology, security flaws, and the innate issue of hallucinations in language models are all causes for concern. The excitement around generative AI may begin to fade unless a ground-breaking model such as GPT-5 is published by the end of 2024, addressing important difficulties and providing a game-changing application.
Inside the shadowy global battle to tame the world's most dangerous technology.	This article explores the intricate global attempts to control artificial intelligence (AI), which is considered to be one of the most powerful and dangerous technologies of our day.
How to win at Vertical AI.	Vertical B2B applications, where AI agents and open APIs play a critical role in rebundling and generating new business value, are where artificial intelligence truly shines. Domain-specific models provide vertical AI with an advantage in the near term, but horizontal integration into larger ecosystems is necessary for long-term success. AI agents make it possible to rebundle workflows, which transforms management procedures and gives businesses new competitive advantages across a range of industries.
Where AI Thrives, Religion May Struggle.	According to a study headed by Adam Waytz and Joshua Conrad Jackson, there may be a correlation between a drop in religious beliefs and growing exposure to robotics and AI. Higher robotization countries have higher declines in religiosity. According to the study, those whose occupations involved a lot of AI had a much lower likelihood of believing in God. These associations suggest that automation technologies could have an impact on the loss of religion.

Back to index

ML news: Week 25 - 31 March

Research

Link	description
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework.	This paper introduces Mora, a new multi-agent framework designed to close the gap in the field of generalist video generation, mimicking the capabilities of the leading model, Sora, across a range of tasks including text-to-video and video editing. Despite achieving performance close to Sora in various tasks, Mora still faces a holistic performance gap, marking a step towards future advancements in collaborative AI agents for video generation.
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models.	Text-to-image diffusion models such as Stable Diffusion are altered by Open-Vocabulary Attention Maps (OVAM), which overcome earlier restrictions by enabling the creation of attention maps for any word.
HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption.	Securing data privacy with Homomorphic Encryption, HETAL's novel method of transfer learning represents a major advancement in safe AI training.
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression.	This paper presents the Hash-grid Assisted Context (HAC) framework, which outperforms existing standards by achieving over 75X compression of 3D Gaussian Splatting (3DGS) data.
Shadow Generation for Composite Image Using Diffusion model.	This work overcomes earlier difficulties with form and intensity accuracy to present a novel approach to producing realistic shadows in picture composition. The addition of intensity modulation modules to ControlNet and the expansion of the DESOBA dataset allowed the researchers to achieve a considerable improvement in shadow production in pictures.
View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network.	The View-Decoupled Transformer (VDT) was created by researchers to address the problem of detecting subjects from disparate camera perspectives, such as those obtained from ground and aerial cameras.
ElasticDiffusion: Training-free Arbitrary Size Image Generation.	Text-to-image diffusion models can now generate images in different sizes and aspect ratios without the need for extra training thanks to ElasticDiffusion, an inventive decoding technique.
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model.	The Large Multi-modal Model (LMM) is extended by PSALM, which adds a mask decoder and a flexible input schema to perform well in a range of picture segmentation tasks. This method not only gets beyond the drawbacks of text-only outputs, but also makes it possible for the model to comprehend and categorize complicated images with ease.
Compositional Inversion for Stable Diffusion Models.	In order to solve overfitting problems, researchers have devised a novel technique to enhance the way AI generates individualized visuals. This method guarantees that the thoughts are represented in the images in a more varied and balanced manner.
Residual Dense Swin Transformer for Continuous Depth-Independent Ultrasound Imaging.	With arbitrary-scale super-resolution, RDSTN is a novel network that addresses the trade-off between field-of-view and picture quality in ultrasound imaging.
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity.	A new standard for text-based person retrieval is UFineBench. To aid AI in comprehending and locating persons in photos, it makes use of thorough descriptions.
SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process.	By understanding refinement as a data creation process, SegRefiner is a novel model-agnostic approach that enhances object mask quality in a variety of segmentation applications. Through the use of a discrete diffusion method, it fine-tunes coarse masks pixel by pixel, improving border metrics and segmentation.
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting.	Our suggestion is the VMRNN cell, a novel recurrent unit that combines the advantages of LSTM and Vision Mamba blocks. Our comprehensive tests demonstrate that, despite retaining a reduced model size, our suggested strategy achieves competitive outcomes on a range of pivot benchmarks.
Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement.	In order to balance computing economy and accuracy, this research presents Salience DETR, which uses hierarchical salience filtering to improve query selection in object identification.
Universal Cell Embeddings: A Foundation Model for Cell Biology.	We present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from humans and other species in a completely self-supervised way without any data annotations. UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets.
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation.	Using just one reference image and voice input, the AniPortrait framework can produce realistic animated portraits. This technique creates animations that are exceptional in terms of authentic facial expressions, a variety of poses, and great visual quality by first converting audio into 3D representations and then mapping them onto 2D facial landmarks.
PAID: (Prompt-guided) Attention Interpolation of Text-to-Image.	Two methods, AID and its version PAID are intended to enhance image interpolation by the incorporation of text and pose conditions. Without the need for further training, these techniques guarantee the creation of images with improved consistency, smoothness, and fidelity.
The Need for Speed: Pruning Transformers with One Recipe.	With the help of the OPTIN framework, transformer-based AI models can now be more effective across a range of domains without requiring retraining. Through the use of an intermediate feature distillation technique, OPTIN is able to compress networks under certain conditions with minimal impact on accuracy.
Long-form factuality in large language models.	Factual information can be produced through the use of language models. Google has made available benchmarks and a dataset that demonstrate the performance of each model. This research demonstrates that language models outperform human annotators in most situations and offers advice on how to enhance a model's factuality.
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning.	A novel method for Unsupervised Domain Adaptation (UDA) is called CoDA. It learns from variances at both the scene and image levels, which aids AI models in becoming more adaptive to unlabeled, difficult settings.
Backtracing: Retrieving the Cause of the Query.	This method finds the precise content—from lectures to news articles—that prompts users to ask questions online. Backtracking is a technique that seeks to assist content producers in improving their work by locating and comprehending the reasons for misunderstandings, inquisitiveness, or emotional responses.
CT-CLIP.	A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

News

Link	description
Stability AI CEO resigns to ‘pursue decentralized AI’.	Emad Mostaque’s resignation comes after key departures at the AI startup. And here the Company announcement.
GTC Wrap-Up: ‘We Created a Processor for the Generative AI Era,’ NVIDIA CEO Says.	Kicking off the biggest GTC conference yet, NVIDIA founder and CEO Jensen Huang unveils NVIDIA Blackwell, NIM microservices, Omniverse Cloud APIs, and more.
After raising $1.3B, Inflection is eaten alive by its biggest investor, Microsoft.	In June 2023, Inflection announced it had raised $1.3 billion to build what it called “more personal AI.” The lead investor was Microsoft. Today, less than a year later, Microsoft announced that it was feasting on Inflection’s body and sucking the marrow from the bones (though I think they phrased it differently).
OpenAI is pitching Sora to Hollywood.	The AI company is scheduled to meet with a number of studios, talent agencies, and media executives in Los Angeles next week to discuss partnerships, sources familiar with the matter told Bloomberg.
GitHub’s latest AI tool can automatically fix code vulnerabilities.	It’s a bad day for bugs. Earlier today, Sentry announced its AI Autofix feature for debugging production code and now, a few hours later, GitHub is launching the first beta of its code-scanning autofix feature for finding and fixing security vulnerabilities during the coding process.
Researchers gave AI an 'inner monologue' and it massively improved its performance.	Scientists trained an AI system to think before speaking with a technique called QuietSTaR. The inner monologue improved common sense reasoning and doubled math performance.
a California city is training AI to spot homeless encampments.	For the last several months, a city at the heart of Silicon Valley has been training artificial intelligence to recognize tents and cars with people living inside in what experts believe is the first experiment of its kind in the United States.
Sora: First Impressions.	A compilation of Sora content generated by visual artists, designers, creative directors, and filmmakers.
Open Interpreter O1 Light.	A portable speech interface that manages your home computer is called the 01 Light. It can utilize your applications, view your screen, and pick up new abilities. The open-source 01 serves as the basis for a new generation of AI gadgets.
Character Voice For Everyone.	Character Voice is a set of capabilities that elevates the Character.AI experience by enabling users to hear Characters conversing with them one-on-one. The company's bigger goal is to create a multimodal interface that will enable more smooth, simple, and interesting interactions. This is the first step toward that goal.
Cerebras Systems Unveils World’s Fastest AI Chip with Whopping 4 Trillion Transistors.	The 24T parameter language models may be trained using Cerebras' new wafer chip. PyTorch is supported natively.
The GPT-4 barrier has finally been broken.	Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
China puts trust in AI to maintain largest high-speed rail network on Earth.	The railway system is in better condition than when it was first built, according to peer-reviewed paper. Vast amounts of real-time data are processed by an artificial intelligence system in Beijing to identify problems before they arise, the engineers say
Microsoft to hold a special Windows and Surface AI event in May.	Ahead of Build 2024, Microsoft CEO Satya Nadella will share the company’s ‘AI vision’ for both software and hardware.
AI ‘apocalypse’ could take away almost 8m jobs in the UK, says the report.	Women, younger workers and lower paid are at most risk from artificial intelligence, says IPPR thinktank
Elon Musk says all Premium subscribers on X will gain access to AI chatbot Grok this week.	Following Elon Musk’s xAI’s move to open source its Grok large language model earlier in March, the X owner on Tuesday said that the company formerly known as Twitter will soon offer the Grok chatbot to more paying subscribers.
OpenAI’s chatbot store is filling up with spam.	TechCrunch found that the GPT Store, OpenAI’s official marketplace for GPTs, is flooded with bizarre, potentially copyright-infringing GPTs that imply a light touch where it concerns OpenAI’s moderation efforts.
Apple's big WWDC 2024 announcement may be an AI App Store.	Apple's AI strategy may not necessarily be to only offer the best AI apps it can produce, but instead deliver an enhanced AI App Store that may debut at WWDC.
Mathematicians use AI to identify emerging COVID-19 variants.	Scientists at The Universities of Manchester and Oxford have developed an AI framework that can identify and track new and concerning COVID-19 variants and could help with other infections in the future.
iOS 18 Reportedly Won't Feature Apple's Own ChatGPT-Like Chatbot.	Bloomberg's Mark Gurman today reported that Apple is not planning to debut its own generative AI chatbot with its next major software updates, including iOS 18 for the iPhone. Instead, he reiterated that Apple has held discussions with companies such as Google, OpenAI, and Baidu about potential generative AI partnerships.
Introducing DBRX: A New State-of-the-Art Open LLM.	DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs.
Amazon invests another $2.75B in Anthropic — reportedly ‘largest’ in company history.	Today, Amazon announced it has finalized that investment at the full planned amount, putting in another $2.75 billion atop the $1.25 billion it originally committed last year. According to CNBC, it is Amazon’s “largest venture investment yet.”
OpenAI Is Starting To Test GPT Earning Sharing.	We’re partnering with a small group of US builders to test usage-based GPT earnings. Our goal is to create a vibrant ecosystem where builders are rewarded for their creativity and impact and we look forward to collaborating with builders on the best approach to get there.
Nvidia Tops MLPerf’s Inferencing Tests.	Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
AI21 releases Jamba Language Model.	The Mamba model style is designed to outperform Transformers in terms of efficiency while maintaining performance parity. One new version with MoE layers is Jamba. With a context length of 128k tokens, it can operate at 1.6k tokens per second. It performs 67% on the benchmark for MMLU. There are weights available.
Hume introduces Empathic Voice Interface.	Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.
Google starts testing AI overviews from SGE in main Google search interface.	Google is now testing AI overviews in the main Google Search results, even if you have not opted into the Google Search Generative Experience labs feature. Google said this is an experience on a “subset of queries, on a small percentage of search traffic in the U.S.,” a Google spokesperson told Search Engine Land.
LLaVA-HR: High-Resolution Large Language-Vision Assistant .	This repository contains the implementation of LLaVA-HR, a strong and efficient MLLM powered by our mixture-of-resolution adaptation.
Meta is adding AI to its Ray-Ban smart glasses next month.	The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.
Google bringing Gemini Nano to Pixel 8 with next Feature Drop.	The Pixel 8 will get Gemini Nano, in developer preview, to power Summarize in Recorder and Gboard Smart Reply. The latter allows for “higher-quality smart replies” that have “conversational awareness” and should be generated faster. On the Pixel 8 Pro, it works with WhatsApp, Line, and KakaoTalk. Meanwhile, Summarize can take a recording and generate bullet points.

Resources

Link	description
Building and testing C extensions for SQLite with ChatGPT Code Interpreter.	This essay goes into great detail on how to create code in a foreign language for a difficult task using ChatGPT (or any other language model). Its creator writes, compiles, and downloads new bindings for the well-known database SQLite using ChatGPT's code interpreter.
Official Mistral Fine-tuning Code.	A hackathon was recently organized by Mistral. The business also published code for optimizing its language models along with version 0.2 of the 7B model. The coding is clear and easy to read.
Scalable Optimal Transport.	A curated list of research works and resources on optimal transport in machine learning.
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation.	AdaIR presents an all-in-one image restoration network that addresses several types of picture deterioration such as noise, blur, and haze by using frequency mining and modulation.
Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8.	The group at Databricks Mosaic has persisted in advancing language model training. They talk about the fp8 training stack and the potential advantages of decreasing precision in this post.
Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM.	A new collaboration between Anyscale and NVIDIA will allow users to scale generative AI models into production. Customers can enhance resource management, observability, and autoscaling by utilizing the combined capabilities of Anyscale's managed runtime environment and Ray through this integration.
Discover The Best AI Websites & Tools.	11006 AIs and 233 categories in the best AI tools directory. AI tools list & GPTs store are updated daily by ChatGPT.
codel.	Fully autonomous AI Agent that can perform complicated tasks and projects using a terminal, browser, and editor.
binary vector search is better than your FP32 vectors.	A crucial component of RAG pipelines is searching over embedding vectors. You may retain performance while reducing memory needs by 30x by substituting a single 0 or 1 for the fp32 numbers, followed by a KNN clustering and reranked.
Deepfake Generation and Detection: A Benchmark and Survey.	This thorough analysis explores the developments and difficulties around deepfake technology and its detection, emphasizing the arms race between those who produce deepfakes and those who are creating systems to identify them.
Evaluate LLMs in real-time with Street Fighter III.	Make LLMs fight each other in real-time in Street Fighter III. Each player is controlled by an LLM. We send to the LLM a text description of the screen. The LLM decides on the next moves its character will make. The next moves depend on its previous moves, the moves of its opponents, its power, and health bars.
Superpipe.	Superipe is a lightweight framework to build, evaluate and optimize LLM pipelines for structured outputs: data labeling, extraction, classification, and tagging. Evaluate pipelines on your own data and optimize models, prompts, and other parameters for the best accuracy, cost, and speed.

Perspectives

Link	description
How People Are Really Using GenAI.	There are many use cases for generative AI, spanning a vast number of areas of domestic and work life. Looking through thousands of comments on sites such as Reddit and Quora, the author’s team found that the use of this technology is as wide-ranging as the problems we encounter in our lives. The 100 categories they identified can be divided into six top-level themes, which give an immediate sense of what generative AI is being used for: Technical Assistance & Troubleshooting (23%), Content Creation & Editing (22%), Personal & Professional Support (17%), Learning & Education (15%), Creativity & Recreation (13%), Research, Analysis & Decision Making (10%).
Untangling concerns about consolidation in AI.	Microsoft's recent acquisition of Inflection's talent sparked discussions about the largest tech giants having too much influence over AI research and development. Although they have the resources to work quickly on basic language models, there are legitimate concerns that the concentration of power would stifle transparency and innovation. This article examines the intricate trade-offs that arise as artificial intelligence becomes more widely used.
‘A landmark moment’: scientists use AI to design antibodies from scratch.	Modified protein-design tool could make it easier to tackle challenging drug targets — but AI antibodies are still a long way from reaching the clinic.
TechScape: Is the US calling time on Apple’s smartphone domination?	The tech giant fights regulators on both sides of the Atlantic, as the US government launches a grab-bag of accusations. Plus, Elon Musk’s bad day in court
Go, Python, Rust, and production AI applications.	The roles of Python, Go, and Rust in developing AI applications are covered in this article: Go is used for larger-scale production, Python is used for developing AI models, and Rust is used for tasks requiring high performance. It highlights the significance of choosing the appropriate language for the task based on the ecosystem and tool fit, speculating that Go may replace Python as the production language. The author promotes connecting the Go and Python communities to improve the development of AI applications.
Trends in Synthetic Biology & AI in Drug Discovery in 2024.	2024 promises to be a historic year for artificial intelligence in drug discovery, with significant progress being made in synthetic biology. The synthesis of modular biological components and the impact of generative AI on research are two prominent themes that are highlighted in this article. The entry of Insilico Medicine's AI-powered candidate into Phase II clinical trials demonstrates how the combination of artificial intelligence and synthetic biology is speeding up the drug discovery process.
LLMs have special intelligence, not general, and that's plenty.	In sophisticated cognitive tests, Anthropic's new AI model Claude-3 performs better than other models, including GPT-4, and above the average human IQ. Even with this success, Claude-3 still finds it difficult to solve simple puzzles and other basic tasks that people take for granted. Rather than having general intelligence like that of humans, LLMs can have a "Special Intelligence." They can be creatively reflecting back to us what they know.
AI SaaS Companies Will Be More Profitable.	The deflationary impacts of AI in marketing, sales, operations, and software development could mean that while AI software companies may initially incur higher costs, they could end up being more profitable than traditional SaaS companies.
AI image generators often give racist and sexist results: can they be fixed?	Researchers are tracing sources of racial and gender bias in images generated by artificial intelligence, and making efforts to fix them.
How AI is improving climate forecasts.	Researchers are using various machine-learning strategies to speed up climate modelling, reduce its energy costs and hopefully improve accuracy.
Here’s why AI search engines really can’t kill Google.	The AI search tools are getting better — but they don’t yet understand what a search engine really is and how we really use them.
Inside the shadowy global battle to tame the world's most dangerous technology.	The problem of controlling AI is one that the world is now facing. Global leaders, tech executives, and legislators convened many high-profile meetings and conferences that exposed disagreements and differences over how to regulate this game-changing technology.
Hackers can read private AI-assistant chats even though they’re encrypted.	All non-Google chat GPTs affected by side channel that leaks responses sent to users.
Towards 1-bit Machine Learning Models.	Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
AI escape velocity.	The law of accelerating returns, which holds that progress is made at an exponential pace over time, was created by AI futurist Ray Kurzweil. Kurzweil covered a wide range of subjects in a recent talk, such as prospects that are only going to get better, the future of the AI economy, human relationships with AIs, lifespan escape velocity, and much more.
Plentiful, high-paying jobs in the age of AI.	Experts in AI are investigating automating human functions, raising fears about job losses and declining wages. The belief that advances in AI would eventually render human labor obsolete, however, may not be accurate. Constraints like computer power and opportunity costs may mean that humans will still have jobs in an AI-dominated future, but this is not a given.

Back to index

ML news: Week 18 - 24 March

Research

Link	description
ScoreHMR: Score-Guided Diffusion for 3D Human Recovery.	We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. Here, we show the application of our approach on videos, utilizing keypoint detections and score guidance with keypoint reprojection and temporal smoothness terms.
Cappy: Outperforming and boosting large multi-task language models with a small scorer.	A little model called Cappy has been taught to accept instructions and a candidate's completion, then calculate how well the completion satisfies the instructions by returning a score. It performs better on this job than significantly bigger models, indicating that it may be applied as a generation and training feedback mechanism.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation.	demonstrates how LLM reasoning and generation in long-horizon generation tasks can be greatly enhanced by iteratively revising a chain of thoughts with information retrieval; the key idea is that each thought step is revised with pertinent retrieved information to the task query, the current and past thought steps; Retrieval Augmented Thoughts (RAT) is a zero-shot prompting approach that offers notable improvements over baselines that include vanilla RAG, zero-shot CoT prompting, and other baselines. RAT can be applied to various models such as GPT-4 and CodeLlama-7B to improve long-horizon generation tasks (e.g., creative writing and embodied task planning).
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking.	outlines Quiet-STaR, a generalization of STaR that enables language models (LMs) to acquire reasoning skills that are more scalable and general; Quiet-STaR gives LMs the ability to produce justifications for each token to explain the future text; it suggests a token-wise parallel sampling approach that enhances LM predictions by producing internal thoughts effectively; REINFORCE is used to improve the rationale creation.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.	suggests combining expert LLMs with a Mixture-of-Experts LLM as a more computationally efficient way to train LLMs. This method, called BTX, is shown to be more effective than training a single specialized LLM or a larger generalist LLM. It works by first training (in parallel) multiple copies of a seed LLM with specialized knowledge in different domains (i.e., expert LLMs), then combining them into a single LLM using MoE feed-forward layers. Finally, the entire unified model is fine-tuned.
Large language models surpass human experts in predicting neuroscience results.	suggests using BrainBench as a benchmark to assess LLMs' capacity to forecast neuroscience outcomes; discovers that LLMs outperform experts in forecasting the results of experiments; an LLM that has been modified based on neuroscience literature has been demonstrated to do even better.
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer.	Comprehensive literature analysis faces a problem due to the scientific literature's constant increase. Because of their ability to summarize, LLMs present a viable option; yet, they are not well-suited to the multimodal aspects that are common in scientific information. Uni-SMART (Universal Science Multimodal Analysis and Research Transformer) was created to fill this vacuum by understanding and analyzing the intricate multimodal data found in scientific publications.
Mechanics of Next Token Prediction with Self-Attention.	Predicting the next token is a straightforward goal that triggers complex actions. This work discovered that the problem could be divided into two parts: soft composition and hard retrieval. This allowed for good overall performance and in-context learning, and the single self-attention layer was trained using gradient descent.
Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection.	By combining visual transformers with knowledge distillation, YOLOX-ViT presents a novel method for object recognition in underwater robots.
GroupContrast.	GroupContrast combines semantic-aware contrastive learning with segment grouping to redefine self-supervised 3D representation learning.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification.	With an emphasis on object-centric information, this study presents a novel approach to object detection in photos captured from a variety of spectrums, including RGB, near-infrared, and thermal imaging. The goal is to increase recognition accuracy by mitigating the effects of background noise.
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation.	Stable Diffusion 3 is a potent model for creating images. Latent Adversarial Diffusion Distillation, which keeps picture production quality constant while reducing the number of diffusion stages to 4, is shown in this study.
Distilling Datasets Into Less Than One Image.	Poster Dataset Distillation (PoDD): We propose PoDD, a new dataset distillation setting for a tiny, under 1 image-per-class (IPC) budget. In this example, the standard method attains an accuracy of 35.5% on CIFAR-100 with approximately 100k pixels, and PoDD achieves an accuracy of 35.7% with less than half the pixels (roughly 40k)
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control .	MineDreamer is an artificial intelligence (AI) bot that uses cutting-edge language and vision models creatively to obey intricate commands in the Minecraft universe.
DreamDA: Generative Data Augmentation with Diffusion Models.	DreamDA presents a novel method of data augmentation by creating high-quality, diversified synthetic visuals that closely resemble the original data distribution using diffusion models.
Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models.	The Interactive Reasoning method known as Chain-of-Spot (CoS) greatly improves the way Large Vision-Language Models (LVLMs) analyze and comprehend pictures. With CoS, LVLMs may obtain precise visual information without sacrificing picture quality by concentrating on specific regions of interest inside images in response to predetermined inquiries or commands.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On.	A new method for image-based virtual try-on is called StableVITON. This approach takes advantage of the creative capacity of diffusion models that have already been trained while paying attention to garment details. StableVITON discovers semantic correspondences in the latent space of a pre-trained model between clothing and the human body.
Diffusion-based Video Translation.	FRESCO is a unique method that greatly enhances the spatial-temporal consistency in video translation tasks by combining intra-frame and inter-frame correspondences.
Generalized Consistency Trajectory Models.	With the introduction of Generalized Consistency Trajectory Models (GCTMs), this effort improves the capabilities of diffusion models for tasks such as image restoration and editing. By translating between any two distributions in a single step, these models simplify the procedure and enable remarkably accurate and efficient image modification.
Introducing SceneScript, a novel approach for 3D scene reconstruction.	A model developed by Meta Reality Labs may convert visual input into a three-dimensional (3D) representation of a scene. The 70m parameter model has exceptional stability and operates rapidly on the device.
Scalable Diffusion Models with State Space Backbone.	A novel kind of diffusion model known as Diffusion State Space Models (DiS) uses a state space backbone for image data instead of the conventional U-Net. These models are effective at producing high-quality photos with little computing work and can manage long-range relationships.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns.	PuzzleVQA is a dataset created to evaluate big multimodal models such as GPT-4V's capacity for abstract thinking.

News

Link	description
Open Release of Grok-1.	We are releasing the weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.
Did OpenAI just accidentally leak the next big ChatGPT upgrade?	OpenAI may have accidentally leaked details about a new AI model called GPT-4.5 Turbo. The leak suggests that GPT-4.5 Turbo will be faster, more accurate, and have a larger knowledge base than its predecessor.
Claude 3 Haiku: our fastest model yet.	Today we’re releasing Claude 3 Haiku, the fastest and most affordable model in its intelligence class. With state-of-the-art vision capabilities and strong performance on industry benchmarks
Midjourney debuts feature for generating consistent characters across multiple gen AI images.	The popular AI image generating service Midjourney has deployed one of its most oft-requested features: the ability to recreate characters consistently across new images.
Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments.	Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.
Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images.	Today we are releasing Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion, advancing the field of 3D technology and delivering greatly improved quality and view-consistency.
Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life.	Google researchers have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving — from just a single still photo. The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns about deepfakes and misinformation.
Microsoft has added the GPT-4 Turbo LLM to the free version of Copilot.	Microsoft is boosting the performance of its Copilot generative AI chatbot today. It has been confirmed that all free Copilot users can now access the GPT-4 Turbo large language model from OpenAI.
Korean researchers power-shame Nvidia with new neural AI chip — claim 625 times less power draw, 41 times smaller.	The new C-Transformer chip is claimed to be the world's first ultra-low power AI accelerator chip capable of large language model (LLM) processing.
Inflection co-founders leave for Microsoft AI.	Karén Simonyan and Mustafa Suleyman are leaving Inflection to launch Microsoft AI. The next CEO will be Sean White. Additionally, a few Inflection senior team members are joining Microsoft AI.
Lilac acquired by Databricks.	Lilac is a scalable, user-friendly tool for data scientists to search, cluster, and analyze any kind of text dataset with a focus on generative AI.
IBM and NASA build language models to make scientific knowledge more accessible.	In a new collaboration, IBM and NASA created a suite of efficient language models by training on scientific literature. Based on the transformer architecture, these models can be used in a variety of applications, from classification and entity extraction to question-answering and information retrieval. These models achieve high performance across a variety of domains and can respond promptly. We have open-sourced the models on Hugging Face for the benefit of the scientific and academic community.
Introducing RAG 2.0.	A technique for adding knowledge to a language model that can become stale is called retrieval augmented generation, or RAG. Unfortunately, outside of demonstrations, the current paradigm of "frozen RAG," in which just a portion of the pipeline is trained and the model itself is not updated, performs badly. This blog describes the next generation of RAG, where all the components are fine-tuned for the job at hand. In this system, an open model such as Mistral 7B can perform better than the conventional GPT-4 RAG.
Fitbit Using Google Gemini for New AI That Could Become Your Fitness Coach.	Google is training Gemini on health data, and it's creating a new AI model for the Fitbit app that can give advice tailored to your needs.
Stable Diffusion maker leaves Stability AI.	Robin Rombach helped build the tech that made Stability AI famous, now he's leaving the company
Introducing Copilot4D: A Foundation Model for Self-Driving.	Waabi's Copilot4D is a ground-breaking foundation model that advances the capabilities of autonomous machines by using LiDAR data to comprehend and forecast the 3D dynamics of the environment across time.
NLX Raises $15M in Series A Funding.	In March 2024, NLX extended its Series A funding to $15M, adding Comcast Ventures.
Triton Puzzles.	Triton is an alternative open-source language that allows you to code at a higher level and compile to accelerators like GPU. This set is puzzles is meant to teach you how to use Triton from first principles in an interactive fashion. You will start with trivial examples and build your way up to real algorithms like Flash Attention and Quantized neural networks. These puzzles do not need to run on GPU since they use a Triton interpreter.
New Breakthrough Brings Matrix Multiplication Closer to Ideal.	Researchers from Tsinghua University and UC Berkeley have made great strides in matrix multiplication, introducing a novel method that has already inspired improvements. Significant time, power, and cost savings in a variety of applications could result from this development in a fundamental computer procedure. Since the previous milestone in 2010, this is the most significant advancement in lowering the computational cost of matrix multiplication.
OpenAI could release GPT-5 in a few months: Report.	OpenAI could release GPT-5, the next-generation of its groundbreaking large language model, in a few months, according to a new report.
Beijing court’s ruling that AI-generated content can be covered by copyright eschews US stand, with far-reaching implications on tech’s use.	The Beijing Internet Court ruled that an AI-generated image in an intellectual property dispute was an artwork protected by copyright laws. That decision is expected to have far-reaching implications for future AI copyright disputes, which could eventually benefit Chinese Big Tech companies.
Japan’s premier AI lab launches its first model.	Sakana AI develops cutting-edge models for Japanese language, vision, and picture production. In order to evolve foundation models without the need for costly retraining, it introduced an evolutionary model merging. The model merging and a description of the process are now available.
Cohere’s Command-R Enterprise Model Coming to ai.nvidia.com.	The RAG-optimized Command-R model from Cohere, which is intended to help enterprises transition to large-scale production, will soon be available in the freshly released NVIDIA API catalog.
Biden-Harris Administration Announces Deal with Intel for AI Chips.	Biden-Harris Administration Announces Preliminary Terms with Intel to Support Investment in U.S. Semiconductor Technology Leadership and Create Tens of Thousands of Jobs
Apple’s AI ambitions could include Google or OpenAI.	The iPhone-maker is in ‘active’ talks to bring Gemini to the iPhone and has also considered using ChatGPT.
World’s first major act to regulate AI passed by European lawmakers.	The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment. Born in 2021, the EU AI Act divides the technology into categories of risk, ranging from “unacceptable” — which would see the technology banned — to high, medium and low hazard.

Resources

Link	description
tlm - Local CLI Copilot, powered by CodeLLaMa.	tlm is your CLI companion which requires nothing except your workstation. It uses the most efficient and powerful CodeLLaMa in your local environment to provide you with the best possible command line suggestions.
Multi-node LLM Training on AMD GPUs.	The whole stack of technologies, including schedulers, model training software, and more, that Lamini employs to train models on AMD GPUs is described in this blog article.
clarity-upscaler.	A state-of-the-art image upscaling tool.
musiclang_predict.	Music Lang is an API and set of models that generate music.
Optimizing Technical Docs for LLMs.	Capa.ai provides guidance on how to organize LLM documentation, including how to include troubleshooting FAQs, self-contained code snippets, segmentation into sub-products, and community forum creation.
lamini/earnings-calls-qa.	This dataset contains transcripts of earning calls for various companies, along with questions and answers related to the companies' financial performance and other relevant topics.
Knowledge Conflicts for LLMs: A Survey.	a summary of the prevalent problem of knowledge conflict that arises while working with LLMs; the survey article divides these conflicts into three categories: intra-memory, inter-context, and context-memory conflict. It also offers insights into the sources of these conflicts and possible solutions.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs.	A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
How to Evaluate Your RAG System?	Retrieval Augmented Generation (RAG) is a powerful technique that enhances output quality by retrieving relevant context from an external vector database. However, building and evaluating an RAG system can be challenging, especially when it comes to measuring performance. In this post, we'll explore the most effective metrics for each stage of your RAG pipeline and how to use them to evaluate your whole system.
Anthropic Prompt Library.	Although Claude 3 has been widely used, these models use a somewhat different prompting technique. Anthropic has compiled a list of user prompts that are effective for a wide range of assignments and subjects.
Pretraining 16 language models on different tokenizers.	One peculiarity of contemporary language modeling is that the model is not trained until the tokenizer has been trained. The second peculiar truth is that, on vast scales, vocabulary size doesn't appear to matter all that much.
LLM4Decompile.	Reverse Engineering: Decompiling Binary Code with Large Language Models
Under The Hood: How OpenAI's Sora Model Works.	In this blog post, we dive into some of the technical details behind Sora. We also talk about our current thinking around the implications of these video models. Finally, we discuss our thoughts around the compute used for training models like Sora and present projections for how that training compute compares to inference, which has meaningful indications for estimated future GPU demand.
Quiet-STaR.	A reasoning framework called Quiet-Star enhances language models' capacity to produce accurate results. An eight-step model per token has been given along with the code.
MoE-Adapters4CL.	Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%.
LlamaGym.	Fine-tune LLM agents with online reinforcement learning
Stylized image binning algorithm.	This is a tutorial on utilizing a JavaScript binning method to create an image processing application that looks like pixel art and has customizable interactive web features like sliders. By averaging pixel brightness inside bins, the binning technique transforms photos into stylized, pixelated artwork by utilizing parameters like bin size and spacing. The approach entails efficiently optimizing looping structures and modifying pixel data on HTML canvas components.
TorchTune.	TorchTune is a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
MVFA-AD.	Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Perspectives

Link	description
What I learned from looking at 900 most popular open source AI tools.	By examining the GitHub stars of well-known AI models, we can uncover some fascinating patterns. The majority of open-source AI tools appear to be geared at apps and infrastructure.
LLM inference speed of light.	This article explores the "speed of light" theoretical limit for transformer-based language model inference and emphasizes the significance of memory bandwidth over computational power, showing that the ability to read data from memory rather than perform calculations is the primary constraint on inference speed and that this is an important factor to optimize and comprehend the performance of AI.
AI is bad/good actually.	This article's author suggests eschewing the nebulous good/bad continuum and instead use terminology like "harmful," "helpful," "capable," and "incapable" to distinguish AI conversations. For them, AI is capable yet possibly dangerous because of unresolved problems like bias exaggeration and copyright infringement. Using these more precise phrases, the author asks readers to explain their own opinions on AI
Captain's log: the irreducible weirdness of prompting AIs.	A wealth of free AI and machine learning tools can be found on the new companion website, More Useful Things. These resources highlight the amusing and useful ways in which AI-generated prompts, such as creative scenarios, can surpass human-crafted ones in tasks like solving mathematical puzzles. For more consistent prompting outcomes, the experiment emphasizes the value of adding context, few-shot learning, and chain-of-thought strategies. Though organized prompting is still an evolving art with considerable potential benefits, prompting as a talent may become less important as AI models advance and get better at inferring user intent.
AI Prompt Engineering Is Dead, Long live AI prompt engineering.	According to recent studies, as AI and machine learning models get better at optimizing their own prompts, human prompt engineers might become outdated. Prompts produced by algorithms can be strange but powerful; they exceed those created by humans and significantly cut down on optimization time. Despite the potential of automatically adjusted prompts, experts predict that the need for occupations related to prompts will change rather than vanish, maybe taking the form of new positions like LLMOps (Large Language Model Operations).
The Road to Biology 2.0 Will Pass Through Black-Box Data.	This year marks perhaps the zenith of expectations for AI-based breakthroughs in biology, transforming it into an engineering discipline that is programmable, predictable, and replicable. Drawing insights from AI breakthroughs in perception, natural language, and protein structure prediction, we endeavor to pinpoint the characteristics of biological problems that are most conducive to being solved by AI techniques. Subsequently, we delineate three conceptual generations of bio AI approaches in the biotech industry and contend that the most significant future breakthrough will arise from the transition away from traditional “white-box” data, understandable by humans, to novel high-throughput, low-cost AI-specific “black-box” data modalities developed in tandem with appropriate computational methods.
"AI, no ads please": 4 words to wipe out $1tn.	AI poses a huge threat to ad-based platforms by slashing how many ads we see
OpenAI’s “Own Goal”.	And why it is becoming increasingly difficult to take them at their word
What if it isn't happening, AGI is not coming?	No matter what appears to be happening, we always have to consider what if it isn't. What If LLMs fail to turn into AGIs? Has our quest for intelligence simply unveiled our demonstrable lack thereof? Will trillions of dollars turn unpredictable hallucination machines into reliable universal productivity tools that can do anything?
How OpenAI’s text-to-video tool Sora could change science – and society.	OpenAI’s debut of its impressive Sora text-to-video tool has raised important questions.
Chatbot AI makes racist judgements on the basis of dialect.	Some large language models harbor hidden biases that cannot be removed using standard methods.
Could AI-designed proteins be weaponized? Scientists lay out safety guidelines.	AI tools that can come up with protein structures at the push of a button should be used safely and ethically, say researchers in the field.
Three reasons why AI doesn’t model human language.	Artificial intelligence (AI) is being used to develop large language models (LLMs) with considerable success. But they should not be seen as being models of how human language works and is acquired.
So … you’ve been hacked.	Research institutions are under siege from cybercriminals and other digital assailants. How do you make sure you don’t let them in?
8 Google Employees Invented Modern AI. Here’s the Inside Story.	They met by chance, got hooked on an idea, and wrote the “Transformers” paper—the most consequential tech breakthrough in recent history.
Using LLMs to Generate Fuzz Generators.	Claude and other LLMs are capable of producing efficient fuzzes for code parsing, automating a task that has historically required a great deal of human labor. Given that fuzzing is stochastic, LLMs seem to be a good fit for producing fuzzes, even if they are usually not exact enough for static analysis. To find and exploit code vulnerabilities, a hybrid approach that combines targeted fuzzing and LLM-driven static analysis may be promising.
First Impressions of Early-Access GPT-4 Fine-Tuning.	A few weeks ago we finally got access to the GPT-4 fine-tuning API (in limited early access), and were super excited to check out how well it works. We’d been a user of OpenAI’s fine-tuned models since fine-tuning the original GPT-3 Davinci model first became available.
AI and the Future of Work.	High Mensa exam scores for Anthropic's most recent AI, Claude, indicate that self-improving AI is not far off and presents both prospects and existential concerns. As seen at Klarna, where a customer support AI replaced 700 workers, machine learning is already eliminating jobs. This suggests that automation is becoming more and more common. Recent layoffs at Duolingo as a result of AI's translation capabilities highlight this change and the increasing influence of AI on the nature of work in the future.
Two years later, deep learning is still faced with the same fundamental challenges.	Gary Marcus revisits his forecasts two years after writing a pessimistic AI paper, and he maintains his original mistrust. Even with breakthroughs like GPT-4, basic problems like true understanding and reliable AI are still unsolved. Marcus draws the conclusion that multidisciplinary cooperation is essential to achieving AGI and that increasing data and processing capacity alone won't be enough.
From 0 to 10 million users in four years.	In just four years, the AI-powered writing tool Copy.ai has amassed an amazing 10 million users.

Back to index

ML news: Week 11 - 17 March

Research

Link	description
Yi: Open Foundation Models by 01.AI.	One of the most potent open language models for a long time has been the Yi model. The group has published a document that offers significant new information about how they gather data and train employees.
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models.	This research uses translation to enhance safety measures in situations when direct data is not available, so taking on the task of minimizing dangerous material in AI across many languages.
Plum: Prompt Learning using Metaheuristic.	In this research, a broad class of more than 100 discrete optimization techniques known as metaheuristics is presented as a potent tool for enhancing rapid learning in big language models.
ViewFusion: Towards Multi-View Consistency via Interpolated Denoising.	A new technique called ViewFusion aims to enhance the way diffusion models produce images from fresh angles while maintaining the consistency of the images from one view to another.
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap.	reveals that there is a reasoning gap between the current models and the proposed functional benchmarks for evaluating the reasoning abilities of LLMs, ranging from 58.35% to 80.31%. However, the authors also note that these gaps can be closed with more advanced prompting techniques.
Can Large Language Models Reason and Plan?	The subject of thinking and planning for LLMs is covered in a recent position paper. The following is an overview of the author's findings: In summary, I don't have any strong evidence from anything I've read, checked, or done to suggest that LLMs engage in typical reasoning or planning. Instead, they use web-scale training to perform a type of universal approximate retrieval, which is sometimes confused for reasoning abilities, as I have explained."
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents.	we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents.
Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation.	The new Stealing Stable Diffusion (SSD) method improves monocular depth estimate performance in challenging settings such as low light or wet ones.
VideoElevator : Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models.	Using the advantages of text-to-image models, VideoElevator presents a unique method that improves text-to-video diffusion models. Videos with better frame quality and text alignment are produced by dividing the improvement process into two parts: fine-tuning temporal motion and improving spatial quality. This is known as the plug-and-play approach.
Face2Diffusion for Fast and Editable Face Personalization.	Gaussian Splatting is combined with 3D mesh geometry in SplattingAvatar to create vibrant virtual humans, introducing a novel method for producing lifelike virtual humans.
Stealing Part of a Production Language Model.	By leveraging their public APIs, you may obtain parts of closed language models—like the embeddings layer—for free. A simple budget of less than $2,000 may do this.
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.	DNA sequence prediction model developed on the Transformer rival Mamba platform. For a little model, it is incredibly powerful and efficient.
V3D: Video Diffusion Models are Effective 3D Generators.	In order to improve 3D object production, this research presents a revolutionary method that creates detailed, high-quality objects from a single photograph.
A generalist AI agent for 3D virtual environments.	We present new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces.	By concentrating on linear memory consumption, this study overcomes the memory limitations of conventional attention-based diffusion models and presents a novel method for producing videos using state-space models (SSMs). As tested with the UCF101 and MineRL Navigate datasets, SSMs allow the generation of lengthier video sequences with competitive quality.
SemCity: Semantic Scene Generation with Triplane Diffusion.	SemCity transforms 3D scene production by emphasizing real-world outdoor environments—a problem that is sometimes disregarded because of how difficult and sparse outdoor data may be.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.	This study demonstrates how to train several models and combine them into a single Mixture-of-Experts model.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.	It is difficult to evaluate language models that have been taught to code. The majority of people utilize OpenAI's HumanEval. Some open models, nevertheless, appear to overfit this standard. Coding performance may be measured while reducing contamination issues with LiveCodeBench.
Evil Geniuses: Delving into the Safety of LLM-based Agents.	'Evil Geniuses' is a virtual squad that researchers utilized in a recent study to examine the safety of LLMs. They discovered that these AI agents are less resistant to malevolent attacks, give more nuanced answers, and make it more difficult to identify improper responses.
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions.	In this work, a novel backbone architecture called ViT-CoMer is presented, which improves on Vision Transformers (ViT) for dense prediction tasks without requiring pre-training.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.	Apple just released a multimodal model and discussed how they trained in detail.

News

Link	description
OpenAI announces new members to board of directors.	Dr. Sue Desmond-Hellmann, Nicole Seligman, Fidji Simo join; Sam Altman rejoins board
So long and thanks for all the pixels: Nvidia reportedly retiring the GTX brand for good.	Nvidia has stopped producing GPUs based on its Turing architecture. The last of them included the likes of the GTX 1660, 1650, and 1630 series of GPUs. Once remaining stocks sell, they'll be gone and with them, the "GTX" brand itself, leaving all Nvidia gaming graphics cards as "RTX" models.
Google’s upcoming Tensor G4 Chip set to rival Snapdragon 8 Gen 4 and Apple A18 Pro.	Let’s say you’re a smartphone manufacturer aiming to develop a new model. You have two options: partner with an established chipmaker like Qualcomm or MediaTek or follow the path of Apple by designing your own custom chipset. Google has taken a similar approach, developing its in-house Tensor processors. Recent information suggests the Pixel 9 will feature the Tensor G4 chipset, promising improved heat and power management for an enhanced user experience.
Microsoft may debut its first 'AI PCs' later this month.	A report suggests an OLED Surface Pro 10 and Surface Laptop 6 are imminent.
Looks like we may now know which OpenAI execs flagged concerns about Sam Altman before his ouster.	Two OpenAI execs raised concerns about Sam Altman before his ouster, The New York Times reported. The outlet reported that the company's chief technology officer, Mira Murati, played a key role. Altman returned as CEO in days, leaving many unanswered questions about what happened.
Cloudflare announces Firewall for AI.	Today, Cloudflare is announcing the development of a Firewall for AI, a protection layer that can be deployed in front of Large Language Models (LLMs) to identify abuses before they reach the models.
Google announces they are tackling spammy, low-quality content on Search.	We’re making algorithmic enhancements to our core ranking systems to ensure we surface the most helpful information on the web and reduce unoriginal content in search results. We’re updating our spam policies to keep the lowest-quality content out of Search, like expired websites repurposed as spam repositories by new owners and obituary spam.
This week, xAI will open-source Grok.	Official tweet of Elon Musk
Covariant is building ChatGPT for robots.	The UC Berkeley spinout says its new AI platform can help robots think more like people. Covariant this week announced the launch of RFM-1 (Robotics Foundation Model 1).
AI solves huge problem holding back fusion power.	Princeton researchers have trained an AI to predict and prevent a common problem arising during nuclear fusion reactions — and they think it might be able to solve other problems, too.
Midjourney bans all Stability AI employees over alleged data scraping.	Midjourney blamed a near 24-hour service outage on ‘botnet-like activity’ from two accounts linked to the Stable Diffusion creator.
Microsoft compares The New York Times’ claims against OpenAI to Hollywood’s early fight against VCR.	Microsoft is helping OpenAI fight back against claims of copyright infringement by The New York Times. The news outlet’s lawsuit, filed in December, seeks to hold Microsoft and OpenAI accountable for billions of dollars in damages. In a court filing on Monday, Microsoft accuses the publisher of “unsubstantiated” claims that the use of OpenAI’s technology is harming its business.
Introducing Devin, the first AI software engineer.	Devin, a new system from Cognition, receives a 14% on the difficult SWE-Bench benchmark, which evaluates AI's capacity for writing code. GPT-4 received a 1.7% score. This model demonstrates excellent contextual learning skills.
Building Meta’s GenAI Infrastructure.	The Llama 3 training infrastructure is described in this Meta blog article. It covers networking, storage, Pytorch, NCCL, and many enhancements. This will prepare the way for Meta's H100s to go online throughout the course of the remaining months of this year.
Physical Intelligence Raises $70M to Build AI-Powered Robots for Any Application.	Pi differentiates itself by aiming to create software that can be applied across a wide range of robotics hardware.
Researchers create AI worms that can spread from one system to another.	Worms could potentially steal data and deploy malware. Now, in a demonstration of the risks of connected, autonomous AI ecosystems, a group of researchers has created one of what they claim is the first generative AI worms—which can spread from one system to another, potentially stealing data or deploying malware in the process.
Perplexity brings Yelp data to its chatbot.	Perplexity’s responses can source multiple Yelp reviews for that cafe you were considering, along with location data and other information.
Gemini now lets you tune and modify responses with a prompt.	Google is launching “a more precise way for you to tune Gemini’s responses” on the web app. When selecting (by highlighting) a part of Gemini’s response to your prompt, a pencil/sparkle icon appears to “Modify selected text.” This opens a box with Regenerate, Shorter, Longer, and Remove options, as well as an open text field.
Microsoft’s neural voice tool for people with speech disabilities arrives later this year.	At the Microsoft Ability summit today, the company is continuing to raise awareness about inclusive design.
Together AI $106M round of funding.	we’ve raised $106M in a new round of financing led by Salesforce Ventures with participation from Coatue, and existing investors.
Autonomous Vehicle Startup Applied Intuition Hits $6B Valuation After $250M Series E.	Autonomous vehicle software developer Applied Intuition locked up a $250 million Series E valuing the company at a $6 billion — a 67% uptick in value from its previous round. The deal comes even as venture funding for autonomous vehicle-related startups has been in decline in recent years.
OpenAI CTO Says It’s Releasing Sora This Year.	But now, OpenAI chief technology officer Mira Murati told the Wall Street Journal that the company will publicly release Sora "later this year."
Google now wants to limit the AI-powered search spam it helped create.	Ranking update targets sites "created for search engines instead of people."
OpenAI Partners With Le Monde And Prisa Media.	We have partnered with international news organizations Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT.
World’s first major act to regulate AI passed by European lawmakers.	The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment.
Figure 01 can now have full conversations with people.	Figure's robots can now hold in-depth discussions with humans thanks to the integration of OpenAI's technology. While Figure's neural networks provide quick, low-level dexterous robot operations, OpenAI's models offer high-level visual and linguistic intelligence. This X article includes a video of a human conversing with a Figure robot, teaching it how to complete tasks, explaining the rationale behind the tasks, and providing a self-evaluation of the activities' effectiveness.
Claude 3 Is The Most Human AI Yet.	Claude 3, Anthropic's latest AI model, is distinguished by its "warmth," which makes it a reliable collaborator on creative writing assignments. More human-feeling and lifelike, Claude 3 is said to straddle the line between delightful deep contemplation and good thought. Though this subtlety has not been fully captured by technological benchmarks, Claude 3 is set to transform our relationship with AI in creative processes.
From Wait Times to Real-Time: Assort Health Secures $3.5 Million to Scale First Generative AI for Healthcare Call Centers.	Solution Erases Long Phone Holds for Patients, Supports Overwhelmed Medical Front Desk Workers and Improves Patient Access to Physicians

Resources

Link	description
DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models.	A recent upgrade to Microsoft's robust DeepSpeed training package lets models use up to six bits per parameter. This can expedite inference by a factor of more than two.
You can now train a 70b language model at home.	a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Retrieval-Augmented Generation for AI-Generated Content: A Survey.	gives a summary of RAG's application in several generating contexts, such as code, images, and audio, and includes a taxonomy of RAG upgrades along with citations to important works.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.	Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models.
SaulLM-7B: A pioneering Large Language Model for Law.	With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens.
A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval).	Retrieval is a critical and complex subsystem of the RAG pipelines. After all, the LLM output is only as good as the information you provide it unless your App relies solely on the training data of the LLM. The core is measuring retrieval is assessing whether each of the retrieved results is relevant for a given query.
C4AI Command-R.	C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question-answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
Artificial Intelligence Controller Interface (AICI).	The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct the output of a Large Language Model (LLM) in real-time. Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations.
US Public Domain Books (English).	This dataset contains more than 650,000 English books (~ 61 billion words) presumed to be in the public domain in the US which were digitized by the Internet Archive and cataloged as part of the Open Library project.
transformer-debugger.	Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
VideoMamba.	VideoMamba is a technology that effectively manages global dependencies and local redundancy to tackle the challenges of video interpretation.
FastV.	FastV is a plug-and-play inference acceleration method for large vision language models relying on visual tokens. It could reach a 45% theoretical FLOP reduction without harming the performance through pruning redundant visual tokens in deep layers.
Maximizing training throughput using PyTorch FSDP.	Together, teams from IBM and Meta have achieved 57% MFU by rapidly training potent models in parallel on huge A100 and H100 clusters.
MoAI.	MoAI is a new large language and vision model that integrates auxiliary visual data from specific computer vision tasks to improve upon existing models.
superopenai: logging and caching superpowers for the openai sdk.	superopenai is a minimal convenience library for logging and caching LLM requests and responses for visibility and rapid iteration during development.
TripoSR.	TripoSR, a state-of-the-art open-source model for fast feedforward 3D reconstruction from a single image, collaboratively developed by Tripo AI and Stability AI.
Exploring Alternative UX Patterns for GenAI Interfaces.	In the rapidly evolving landscape of GenAI interfaces, it is crucial to venture beyond the established norms. The current dominance of Quick Actions and Multi-Turn engagement patterns in these interfaces, while effective in many cases, should not limit our imagination or hinder the potential for innovation.
rerankers.	Rerankers are an important part of any retrieval architecture, but they're also often more obscure than other parts of the pipeline. rerankers seeks to address this problem by providing a simple API for all popular rerankers, no matter the architecture.
skyvern.	Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions.
Licensing AI Means Licensing the Whole Economy.	Because artificial intelligence is a process that is essential to many different economic uses, it is not possible to regulate it like a physical thing.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs.	A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
You can now train a 70b language model at home.	We’re releasing an open-source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs.
pricing sheet with all popular token-based pricing providers and the top-performing models.	Princing and comparison between different LLMs

Perspectives

Link	description
Winning Strategies for Applied AI Companies.	Key Success Factors after reviewing over 70 companies that have raised at least $7M
AI startups require new strategies: This time it’s actually different.	The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
The GPT-4 barrier has finally been broken.	Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
Embrace AI to break down barriers in publishing for people who aren’t fluent in English.	E. M. Wolkovich describes having a paper rejected because of an unfounded accusation that ChatGPT was used to write it. We think that both the rejection and the bias against the use of artificial intelligence (AI) in scientific writing are misguided.
Why scientists trust AI too much — and what to do about it.	Some researchers see superhuman qualities in artificial intelligence. All scientists need to be alert to the risks this creates.
The Future of Poetry.	Questions about whether poems were authored by humans or artificial intelligence (AI) were given to 38 AI experts and 39 English experts. First prize went to The Human, followed by Bard, ChatGPT-4, and Claude in that order, for both writing quality and the ability to deceive respondents into thinking that the poetry was written by a human. The fact that English specialists were far better at identifying which poems were composed by AI suggests that they should be involved more in the development of upcoming AI systems.
Barack Obama on AI, free speech, and the future of the internet.	The former president joined me on Decoder to discuss AI regulation, the First Amendment, and of course, what apps he has on his home screen.
AI startups require new strategies: This time it’s actually different.	The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
Top AIs still fail IQ tests - When asked to read image-based questions.	According to recent testing, sophisticated AI models such as ChatGPT-4 and Google's "Gemini Advanced" do poorly on visual IQ tests, receiving lower-than-average scores. Although ChatGPT-4 exhibits mediocre pattern recognition abilities, it misidentifies objects visually and makes logical mistakes, indicating a considerable difference in comparison to human intellect. These results suggest that the development of universally intelligent AI systems may still be some way off.
The Top 100 Gen AI Consumer Apps.	Over 40% of the top web products are new, having entered the top 50 in the last six months, according to Andreessen Horowitz's most recent consumer analysis on the top 100 Gen AI consumer apps.
This Nvidia Cofounder Could Have Been Worth $70 Billion. Instead, He Lives Off The Grid.	If Curtis Priem, Nvidia’s first CTO, had held onto all his stock, he’d be the 16th richest person in America. Instead, he sold out years ago and gave most of his fortune to his alma mater Rensselaer Polytechnic Institute.
How to thrive in a crowded enterprise AI market.	At a Lightspeed event, Arvind Jain, CEO of Glean, spoke on the difficulties and solutions facing corporate AI startups. He emphasized the need to provide genuine business value, being tenacious in hiring, and placing a higher priority on product quality than speed and cost. Jain also emphasized how privacy and security issues have slowed down the deployment of generative AI tools in businesses. Glean wants to become a widely used workplace AI platform that completely transforms how people work by becoming firmly integrated into organizational operations.
As AI tools get smarter, they’re growing more covertly racist, experts find.	ChatGPT and Gemini discriminate against those who speak African American Vernacular English, report shows

Back to index

ML news: Week 4 - 10 March

Research

Link	description
HyperAttention: Long-context Attention in Near-Linear Time.	It's well accepted—and informally verified—that HyperAttention is the key to Gemini's incredible 1 million+ token context window's success.
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning.	An attempt is made to explain the success of MuP hyperparameter transfer theoretically in this study. The greatest eigenvalue of the training loss Hessian, according to its creators, is unaffected by the network's depth or breadth.
WebArena: A Realistic Web Environment for Building Autonomous Agents.	The possibility for Agents to handle a range of digital responsibilities has the community enthused. But even the most advanced general-purpose models find it difficult to accomplish jobs where people achieve more than 70% of the time. It is becoming evident that these activities could require models that have been carefully trained.
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models.	Latent space smoothness in text-to-image diffusion models is a problem that is addressed by a novel method called Smooth Diffusion. With this technique, even little changes in input will result in a steady and progressive alteration of the visuals.
Rethinking Inductive Biases for Surface Normal Estimation.	A technique called DSNIE significantly enhances monocular surface normal estimation, which finds use in various computer graphics fields.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition.	CricaVPR presents a revolutionary method that focuses on the relationships between many photos, even when they are taken in various situations, in order to improve visual place identification.
Empowering Large Language Model Agents through Action Learning.	investigates open-action learning for language agents using an iterative learning strategy that uses Python functions to create and improve actions; on each iteration, the proposed framework (LearnAct) modifies and updates available actions based on execution feedback, expanding the action space and improving action effectiveness; the LearnAct framework was tested on Robotic planning and AlfWorld environments, showing 32% improvement in agent performance in AlfWorld when compared to ReAct+Reflexion.
PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval.	demonstrates how to use LLMs to integrate several approaches, such as retrieval augmentation, fine-tuning, tool utilization, and more; while the suggested framework is used in the context of urban and spatial planning, many of the insights and useful advice are applicable to other fields as well.
Evo: Long-context modeling from molecular to genome scale.	Introducing Evo, a long-context biological foundation model based on the StripedHyena architecture that generalizes across the fundamental languages of biology: DNA, RNA, and proteins. Evo is capable of both prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length). Evo is trained at a nucleotide (byte) resolution, on a large corpus of prokaryotic genomic sequences covering 2.7 million whole genomes.
Resonance RoPE: Improving Context Length Generalization of Large Language Models.	To assist LLMs in comprehending and producing text in longer sequences than they were first trained on, researchers have created a new method dubbed Resonance RoPE. By using less processing power, our approach outperforms the current Rotary Position Embedding (RoPE) technique and improves model performance on lengthy texts.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World.	The All-Seeing Project V2 introduces the ASMv2 model, which blends text generation, object localization, and understanding the connections between objects in images.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark.	A formidable task is offered by a new dataset named GPQA, which has 448 difficult multiple-choice questions covering physics, chemistry, and biology. Even domain specialists have difficulty—they only score about 65% accuracy—while non-experts only get 34%. Only 39% of advanced AI systems, such as GPT-4, are accurate. The goal of this dataset is to provide techniques for monitoring AI results in challenging scientific problems.
SURE: SUrvey REcipes for building reliable and robust deep networks.	SURE is a revolutionary strategy that integrates multiple approaches to increase the accuracy of deep neural network uncertainty predictions, particularly for image classification applications.
Stable Diffusion 3: Research Paper.	Stable Diffusion 3 outperforms state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence, based on human preference evaluations. Our new Multimodal Diffusion Transformer (MMDiT) architecture uses separate sets of weights for image and language representations, which improves text understanding and spelling capabilities compared to previous versions of SD3.
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents.	These days, language models are quite good at responding to queries. As a result, the majority of benchmarks in use today are saturated. 'Researchy' questions are a new breed of open-ended questions that call for several steps to complete. The source of this specific dataset is search engine queries. It includes instances where GPT-4 had trouble responding to questions.
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control.	A novel method for improving motion quality and semantic coherence in films produced by text-to-video models is presented by UniCtrl. Employing motion injection and cross-frame self-attention approaches enhances video coherence and realism without requiring further training.
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT.	With natural language queries, VTG-GPT provides a revolutionary GPT-based technique that can precisely identify particular video segments without the need for fine-tuning or training.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training.	With the same performance as OpenAI's original CLIP model, MobileClip operates seven times quicker. It may be utilized for a variety of language and visual activities on-device.
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.	Vision-RWKV provides an effective solution for high-resolution image processing by modifying the RWKV architecture from NLP for use in vision challenges.
Design2Code: How Far Are We From Automating Front-End Engineering?	It's hard to take pictures of a design and turn them into code. This study suggests an 18B model as a baseline and assessments imply that we are about there for performing this on basic designs. GPT-4V-generated code is sometimes preferred to human-synthesized code.
MathScale: Scaling Instruction Tuning for Mathematical Reasoning.	Researchers created two million route issues using fake data. After training a 7B model, they discovered that it performed well when compared to the most advanced big language models.
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos.	The KEPP system offers a fresh method for organizing and carrying out difficult jobs. The approach, which makes use of a probabilistic knowledge network, enables the model to arrange activities logically to accomplish a goal.
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents.	KnowAgent presents an innovative method for enhancing the planning abilities of big language models through the incorporation of explicit action information. The method leads LLMs through more rational planning trajectories, which improves their performance on challenging tasks.
tinyBenchmarks: evaluating LLMs with fewer examples.	In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. This work shows that you can reliably evaluate language model performance with as few as 100 examples from popular benchmarks.
3D Diffusion Policy.	DP3 presents a novel method for imitation learning that effectively teaches robots difficult abilities by fusing diffusion strategies with 3D visual data.
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models.	Using an innovative approach, multiple huge language models can collaborate by alternately producing text token by token. With the use of this tactic, models are better able to apply their distinct advantages and areas of competence to a variety of activities, including following instructions, answering questions related to a given domain, and solving reasoning-based problems.

News

Link	description
AI-generated images of Trump with Black voters being spread by supporters.	No evidence to tie fake images, including one created by Florida radio host, to Trump campaign, BBC Panorama investigation finds
Elon Musk sues OpenAI over AI threat.	OpenAI is not so open now, Musk claims, following the closed-source release of the company's artificial general intelligence technology under Microsoft.
OpenAI wants to make a walking, talking humanoid robot smarter.	Figure’s founder Brett Adcock says a new partnership with OpenAI could help its robots hold conversation and learn from its mistakes over time.
MagicLab’s humanoid can toast marshmallows, fold clothes, and dance.	Miniature high-torque servo actuators combined with sensitive multi-dimensional pressure sensors enabled the team to create an exceptionally dexterous hand–MagicBot.
Amazon to spend $1 billion on startups that combine AI with robots.	Amazon’s $1 billion industrial innovation fund is to step up investments in companies that combine artificial intelligence and robotics, as the e-commerce giant seeks to drive efficiencies across its logistics network.
Claude 3 released.	Three new Claude 3 family models have been trained by Anthropic, the best of which achieves benchmark scores that GPT4 has publicly disclosed. It excels at visual tasks and is a multimodal model as well. Claude's coding skills have significantly improved with this version, which is significant.
ChatGPT can read its answers out loud.	OpenAI’s new Read Aloud feature for ChatGPT could come in handy when users are on the go by reading its responses in one of five voice options out loud to users. It is now available on both the web version of ChatGPT and the iOS and Android ChatGPT apps.
Adobe reveals a GenAI tool for music.	Adobe unveiled Project Music GenAI Control, a platform that can generate audio from text descriptions (e.g. “happy dance,” “sad jazz”) or a reference melody and let users customize the results within the same workflow.
OpenAI fires back at Elon Musk in legal fight over breach of contract claims.	ChatGPT maker releases emails in support of claim businessman backed plan to create for-profit unit
OpenAI and Elon Musk.	In response to Elon Musk's complaint, OpenAI provided screenshots of emails between Elon Musk, Greg Brockman, Sam Altman, and Ilya Sutskever, as well as their version of events. According to the receipts, Musk thought there was little hope for OpenAI to succeed and agreed that some models should be closed-source.
Perplexity AI Reportedly Raising Additional Money At Significantly Higher Valuation Cap Than $520M.	Perplexity AI, a rising star in the field of artificial intelligence, is reportedly in discussions to secure additional funding at a valuation significantly higher than its previous round.
Le Chat.	Using its Mistral models, Mistral AI has introduced 'le Chat Mistral,' a new multilingual conversational assistant with an enterprise edition for companies.
Neuralink brain chip: advance sparks safety and secrecy concerns.	Elon Musk announced this week that his company’s brain implant has allowed a person to move a computer mouse with their mind.
Ex-Google engineer arrested for alleged theft of AI secrets for Chinese firms.	Linwei Ding, facing four counts of theft of trade secrets, accused of transferring confidential information to his personal account
Mistral x Snowflake.	Snowflake, the Data Cloud company, and Mistral AI, one of Europe’s leading providers of AI solutions, today announced a global partnership to bring Mistral AI’s most powerful language models directly to Snowflake customers in the Data Cloud.
Moondream 2 small vision language model.	Moondream is a tiny language model built on SigLIP and Phi-2. The benchmark performance has been much enhanced in this second edition, which is licensed for commercial use. It is perfect for describing visuals and operating on low-end computing hardware.
Driverless startup Waymo to test self-driving vehicles with no human driver in Austin.	Autonomous vehicle company Waymo will begin testing driverless cars, with no human behind the wheel, in Austin, starting Wednesday.
Google brings Stack Overflow’s knowledge base to Gemini for Google Cloud.	Developer Q&A site Stack Overflow is launching a new program today that will give AI companies access to its knowledge base through a new API, aptly named OverflowAPI.
Brave’s Leo AI assistant is now available to Android users.	Brave is launching its AI-powered assistant, Leo, to all Android users. The assistant allows users to ask questions, translate pages, summarize pages, create content, and more. The Android launch comes a few months after Brave first launched Leo on desktop. Brave says Leo will be available on iOS devices in the coming weeks.
Inflection-2.5.	A new model has been introduced by Inflection to power Pi, its personal assistant. The model achieves remarkable reasoning scores on benchmarks and performs within 94% of the GPT-4. In comparison to GPT-4, Inflection claims that training only required 40% of the computing. This post offers an intriguing discovery: a typical conversation with Pi lasts 33 minutes.
Cohere and Accenture Collaborate to Accelerate Enterprise AI Adoption.	Cohere and Accenture are working together to provide over 9,000 enterprise clients with cohere embedding technology.
Microsoft’s Mistral deal beefs up Azure without spurning OpenAI.	Microsoft investing in Mistral puts the focus on its Azure model offerings.

Resources

Link	description
2.4x faster Gemma + 58% less VRAM.	You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM.
DUSt3R.	With the help of this project, you may create 3D representations in GLB form by taking a few photos of a site and reconstructing it for usage in 3D applications.
Datasets for Large Language Models: A Comprehensive Survey.	an extensive (more than 180 pages) review and analysis of LLM datasets.
Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey.	an overview of LLMs for tabular data jobs that includes important methods, measurements, datasets, models, and optimization strategies; it also discusses unmet issues and offers suggestions for future lines of inquiry.
Using Claude 3 Opus for video summarization.	Andrej Karpathy challenged me to write a blog article based on one of his latest videos in a lengthy context. This job was completed by Claude 3 with assistance from some pre-processing data. The end product is an excellent and captivating blog post.
Dual-domain strip attention for image restoration.	A new technique that greatly enhances image restoration tasks is the dual-domain strip attention mechanism.
Open-Sora-Plan.	This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open-source community can contribute to this project.
ML system design: 300 case studies to learn from.	We put together a database of 300 case studies from 80+ companies that share practical ML use cases and learnings from designing ML systems.
orca-math-word-problems-200k .	This dataset contains ~200K grade school math word problems. All the answers in this dataset are generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the Potential of SLMs in Grade School Math for details about the dataset construction.
mlx-swift-examples.	Apple created the MLX framework, which is used to train AI models on Macs. This repository demonstrates how to use Swift for model training on mobile devices. An MNIST classifier model can be trained with just one on an iPhone.
Text Clustering.	A free and open-source text clustering tool that makes it simple and rapid to embed, cluster, and semantically label clusters. On 100k samples, the full pipeline runs in 10 minutes.
EasyLM.	Large language models (LLMs) made easy, EasyLM is a one-stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
You can now train a 70b language model at home.	Today, we’re releasing Answer.AI’s first project: a fully open-source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Training Models at Scale.	The goal of this tutorial is to provide a comprehensive overview of techniques and strategies used for scaling deep learning models and to provide a hands-on guide to implement these strategies from scratch in JAX with Flax using shard_map.
Genstruct 7B.	Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
Fructose.	Fructose is a Python package to create a dependable, strongly typed interface around an LLM call.
Efficient Multi-Head Attention Implementations.	Different implementations of the widely used multi-headed attention module in contemporary LLMs varied in speed by over ten times. This notebook lists a handful and compares how well they perform.
US regulators investigate whether OpenAI investors were misled, say reports.	Internal communications from CEO Sam Altman reportedly under scrutiny in SEC inquiry
Microsoft introduces Copilot AI chatbot for finance workers in Excel and Outlook.	Microsoft is launching a Copilot for Finance, which it said will be able to perform a handful of common role-specific actions in Excel and Outlook.

Perspectives

Link	description
On the Societal Impact of Open Foundation Models.	a position paper that centers on open foundation models and discusses their advantages, disadvantages, and effects; it also suggests a framework for risk analysis and clarifies why, in certain situations, the marginal risk of these models is low. Finally, it provides a more sober evaluation of the open foundation models' effects on society.
Towards Long Context RAG.	The amazing one-million-word context window that Google's Gemini 1.5 Pro has brought to the AI community has sparked a debate regarding the future viability of retrieval-augmented generation (RAG).
Aggregator’s AI Risk.	The impact of the Internet, especially through Aggregators like Google and Meta, is comparable to that of the printing press on the spread of knowledge and the establishment of nation-states. However, the rise of generative AI puts the Aggregator model to the test by offering unique solutions that represent ingrained worldviews. This could undermine the Aggregator economics's universal appeal and point to the need for a move toward personalized AI in order to preserve its dominance.
Is Synthetic Data the Key to AGI?.	The caliber of training data has a major impact on how effective large language models are. By 2027, projections indicate that there will be a shortage of high-quality data. A possible answer to this problem is synthetic data generation, which could change internet business models and emphasize the significance of fair data access and antitrust laws.
AI Research Internship Search as a CS PhD Student.	Tips and thoughts from my relatively successful summer research internship hunt during third-year Computer Science PhD study.
How AI Could Disrupt Hollywood.	New platforms and tools may allow a person to create a feature-length film from their living room. But can they really compete with the studios?
Training great LLMs entirely from ground zero in the wilderness as a startup.	Reka's creator and well-known GPU critic Yi Tay detailed their experience building very powerful language models outside of Google in a blog post. The primary obstacles stem from hardware instability and cluster issues. They also had difficulties with software maturity.
Claude 3 Is The Most Human AI Yet.	Anthropic's Claude 3, a large language model similar to GPT-4, is notable not so much for its cost-effectiveness or benchmark test results as for its distinctly human-like, creative, and naturalistic interaction quality. This represents a major breakthrough in AI's capacity to collaborate imaginatively with writers.
Licensing AI Means Licensing the Whole Economy.	AI is a vast process employing statistical approaches, and it would be impractical to control its use across all organizations. Therefore, regulating AI like a tangible commodity is incorrect. Given AI's imminent economic ubiquity, targeted regulation for particular misuses—akin to current strategies for programming or email abuses—is more successful.
Is ChatGPT making scientists hyper-productive? The highs and lows of using AI.	Large language models are transforming scientific writing and publishing. However, the productivity boost that these tools bring could have a downside.
Artificial intelligence and illusions of understanding in scientific research.	Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings.
AI will likely increase energy use and accelerate climate misinformation – report.	Claims that artificial intelligence will help solve the climate crisis are misguided, warns a coalition of environmental groups
We Need Self-Driving Cars.	Anyone rooting against self-driving cars is cheering for tens of thousands of deaths, year after year. We shouldn’t be burning self-driving cars in the streets. We should be celebrating…
Subprime Intelligence.	Significant problems in OpenAI's Sora demonstrate the limitations of generative AI's comprehension. The technology presents both practical obstacles and revolutionary possibilities, as seen by its high computing needs and potential impact on the creative industry.
Sora, Groq, and Virtual Reality.	A few years ago, Facebook's drive into the metaverse looked misguided, and the idea of the metaverse appeared like fiction from Ernest Cline's novel. Things feel different now. Groq's deterministic circuits streamline machine-learning algorithms for quicker processing, while Sora creates intricate video situations. The combination of these developments brings us one step closer to real-time video simulation and full-fledged virtual reality.
AI Is Like Water.	For GenAI companies to have a competitive advantage, technology alone is no longer sufficient. This means that since the basic product is virtually the same, GenAI and bottled water are comparable. The primary differentiators need to originate from elements like distribution, user experience, perceived customer value, branding, and marketing.

Back to index

ML news: Week 26 February - 3 March

Research

Link	description
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.	The RL approach REINFORCE is straightforward, well-known, and simple to comprehend. In simulators, training steadily is a challenge. In general, PPO is far more reliable and performant. REINFORCE is used by Gemini, and PPO is presumably used by GPT-4.
AlphaFold Meets Flow Matching for Generating Protein Ensembles.	The protein's post-folding state can be predicted using AlphaFold. Adding invertible flow matching allows you to significantly increase modeling capability throughout the whole protein landscape.
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models.	Researchers have created a new technique that focuses on "expert-level sparsification," which minimizes model size without sacrificing performance, to make LLMs more effective and user-friendly. For Mixture-of-Experts LLMs, which are strong but typically too large to manage simply, this is very helpful.
Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion.	A novel method called GeneOH Diffusion enhances models' comprehension of and ability to manipulate objects with their hands. The goal of this technique is to improve the naturalness of these interactions by fixing mistakes in hand gestures and object relationships.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.	Except Sora, Snap Research has developed a video creation model that is 3 times faster to run than the prior state of the art.
OpenCodeInterpreter.	By training on a synthetic multi-turn dataset and utilizing human feedback, a model built on CodeLlama and DeepSeek Coder was able to achieve 85%+ on the HumanEval programming benchmark.
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models.	A new benchmark called INSTRUCTIR aims to improve search engines' ability to infer users' intentions. INSTRUCTIR assesses how well search engines can obey user instructions and adjust to different and evolving search needs, in contrast to existing approaches that primarily concentrate on the query itself.
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases.	In terms of accuracy in jobs involving contacting API functions, Meta's 350m parameter language model has high reasoning performance, even coming close to Llama 7B. Although the model is not yet available, it is worthwhile to investigate the novelty in fixed parameter models.
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models.	A new multilingual benchmark called ConceptMath is used to assess LLMs' arithmetic proficiency in both Chinese and English. It's special because it deconstructs arithmetic problems into discrete ideas, enabling a more thorough evaluation of an AI's mathematical prowess and shortcomings.
Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion.	DreamRec proposed a revolutionary 'learning-to-generate' technique to sequential recommendation, whereby it generates a 'oracle' item representing the optimal next option for the user, as opposed to the conventional way of identifying user preferences from a mixture of positive and negative things.
FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings.	A novel model called FlowMDM uses text descriptions to create lengthy, continuous sequences of human movements. This groundbreaking diffusion-based model excels in accuracy and realism on important datasets by using Blended Positional Encodings to create realistic motion without the need for additional denoising stages.
VSP-LLM (Visual Speech Processing incorporated with LLMs).	We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task.
Repetition Improves Language Model Embeddings.	We present echo embeddings, an embedding strategy designed to address an architectural limitation of autoregressive models: that token embeddings cannot contain information from tokens that appear later in the input. Echo embeddings resolve this issue by repeating the input twice in the input to the embedding model. Our method has strong performance on MTEB and is compatible with many other methods for improving embedding models.
Range-Agnostic Multi-View Depth Estimation With Keyframe Selection.	Multi-View 3D reconstruction techniques process a set of source views and a reference view to yield an estimated depth map for the latter.
ChatMusician: Understanding and Generating Music Intrinsically with LLM.	Adding a modality-specific encoder to a language model is usually necessary for comprehending music. This is unstable and costly. This study demonstrated that tokenizing music into ABC notation significantly boosted music knowledge without affecting basic language proficiency.
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs.	Bytedance has produced a system called MegaScale that can be used to train massively parallel large language models. It succeeded in training a 175B LLM on 12,288 GPUs with 55.2% Model FLOP utilization (MFU), which is extremely impressive. Bytedance plans to open source some aspects of the codebase.
ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval.	ListT5 presents a novel reranking technique that not only increases information retrieval precision but also provides a workable solution to the issues that earlier listwise rerankers encountered.
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT.	Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands.
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention.	A novel method called IR-QLoRA improves quantized big language model accuracy, which makes them more appropriate for usage on low-resource devices.
Video as the New Language for Real-World Decision Making.	An incredible research that presents video as a possible improvement over current methods for AI to communicate with humans. It demonstrates the usage of video models as environment simulators, planners, agents, and computation engines.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.	A parameter in the majority of language models is represented by 16 bits or more. This produces strong models that may be costly to operate. This study suggests a technique where each parameter is in {-1, 0, 1} and requires 1.58 bits. Performance is precisely matched by this approach up to 3B parameters. Models and codes are not yet available.
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models.	Enhancing multi-modality foundation models such as GPT-4V in low-level visual perception tasks is the main goal of this research. The extensive study collected comments on 18,973 photos from 58,000 people and produced the Q-Pathway dataset for brightness, color, and clarity analysis.
Graph Diffusion Policy Optimization.	The primary objective of this work is to improve multi-modality foundation models, like GPT-4V, in low-level visual perception tasks. The comprehensive study created the Q-Pathway dataset for brightness, color, and clarity analysis by gathering feedback on 18,973 photographs from 58,000 users.
HiGPT: Heterogeneous Graph Language Model.	A method for learning across many heterogeneous graphs without requiring fine-tuning is called HiGPT. It excels at adapting to different data distributions thanks to its integration with a unique graph tokenizer and a large corpus of graph commands.
PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning.	PromptMM uses Multi-modal Knowledge Distillation to enhance recommendation systems on sites like Amazon and TikTok. In order to avoid overfitting, it eliminates errors in user preferences and streamlines systems by extracting key characteristics from different kinds of content (textual, audio, or visual).
Genie: Generative Interactive Environments.	We introduce Genie, a foundation world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.
UniVS: Unified and Universal Video Segmentation with Prompts as Queries.	With a unique prompt-based methodology, UniVS is a unified architecture for video segmentation that addresses the difficulties of diverse segmentation jobs. UniVS removes the requirement for heuristic inter-frame matching by utilizing prompt characteristics as queries and providing a target-wise prompt cross-attention layer. This allows UniVS to adapt to various video segmentation settings with ease.
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis.	With a deep semantic knowledge of pictures, the Coarse-to-Fine Latent Diffusion (CFLD) method avoids overfitting and offers a novel Pose-Guided Person Image Synthesis technique that overcomes the drawbacks of previous models.
Evaluating Quantized Large Language Models.	Large language models like OPT and LLaMA2 can be rendered more compute- and memory-efficient through the use of post-training quantization.
Representing 3D sparse map points and lines for camera relocalization.	With minimal memory and processing power, this study presents a novel method for 3D mapping and localization that processes both point and line information using a lightweight neural network, greatly improving pose accuracy.
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving.	Drive-WM can produce high-quality multiview films to forecast future events, allowing self-driving cars to make more intelligent and safe driving choices.
Do Large Language Models Latently Perform Multi-Hop Reasoning?.	This study delves into the fascinating world of Large Language Models (LLMs) and their ability to engage in multi-hop reasoning, akin to human thought processes. By crafting intricate prompts like "The mother of the singer of 'Superstition' is", researchers probe how LLMs navigate complex queries. They uncover compelling evidence suggesting that these models can indeed perform multi-hop reasoning, often relying on a bridge entity like Stevie Wonder to connect disparate pieces of information. The findings highlight both the strengths and limitations of LLMs in this regard, offering valuable insights for their future development and application.

News

Link	description
Microsoft reportedly makes AI server gear to cut Nvidia dependence.	Microsoft is creating its own AI server hardware to intensify actions to decrease its dependency on Nvidia, according to a source familiar with the matter speaking to The Information.
‘Embarrassing and wrong’: Google admits it lost control of image-generating AI.	Google has apologized (or come very close to apologizing) for another embarrassing AI blunder this week, an image-generating model that injected diversity into pictures with a farcical disregard for historical context. While the underlying issue is perfectly understandable, Google blames the model for “becoming” oversensitive.
Is OpenAI the next challenger trying to take on Google Search?	The Information says OpenAI is working on web search (partially powered by Bing) that would more directly compete with Google. It’s unclear if it would be standalone, or a part of ChatGPT.
Transformer Circuits Thread - Updates - February 2024.	The research experts at Anthropic have been developing a Circuit-based approach to comprehend deep neural networks. These circuits seek to pinpoint model components that are employed in particular applications. Every month, the research team publishes an update on the trials they conducted and the
A new tool targets voter fraud in Georgia – but is it skirting the law?.	A tech company supported by Trump’s former lawyer is injecting chaos into the state’s vote-counting process
Democratic political operative admits he commissioned robocall of AI Biden.	Steve Kramer said ‘easy-to-use technology’ enabled him to send automated call while New Orleans magician says he was paid $150 to make it
Mistral Large.	Mistral Large is our new cutting-edge text generation model. It reaches top-tier reasoning capabilities. It can be used for complex multilingual reasoning tasks, including text understanding, transformation, and code generation. Mistral Large achieves strong results on commonly used benchmarks, making it the world's second-ranked model generally available through an API (next to GPT-4)
Scale AI to set the Pentagon’s path for testing and evaluating large language models .	The company will create a comprehensive T&E framework for generative AI within the Defense Department.
DatologyAI is building tech to automatically curate AI training datasets.	Morcos’ company, DatologyAI, builds tooling to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini and other like GenAI models. The platform can identify which data is most important depending on a model’s application (e.g. writing emails), Morcos claims, in addition to ways the dataset can be augmented with additional data and how it should be batched, or divided into more manageable chunks, during model training.
Bay Bridge: A supercomputer built for startups.	With flexible short-term renting options, San Francisco Compute Company is now providing the lowest-cost H100 training clusters in the world to customers who require intensive computing for AI model training but do not want to commit to long-term agreements. Its first cluster, Angel Island, is operational at the moment, and Bay Bridge will follow shortly. The unique business strategy of SF Compute places a premium on cost and accessibility for AI entrepreneurs without requiring long-term commitments.
mlabonne/AlphaMonarch-7B.	AlphaMonarch-7B is a new DPO merge that retains all the reasoning abilities of the very best merges and significantly improves its conversational abilities. Kind of the best of both worlds in a 7B model.
LazyAxolotl.	This notebook allows you to fine-tune your LLMs using Axolotl and Runpod
Apple’s electric car project is dead.	After a decade of work, the company is reportedly giving up on its ambitious effort to create an autonomous electric car.
Expressive Whole-Body Control for Humanoid Robots.	UCSD researchers trained robust, socially inclined, expressive policies for humanoid robots. Their unchoreographed dancing on grass videos is quite amazing.
Meta plans launch of new AI language model Llama 3 in July, The Information reports.	Meta Platforms (META.O), opens new tab is planning to release the newest version of its artificial-intelligence large language model Llama 3 in July which would give better responses to contentious questions posed by users, The Information reported on Wednesday.
Tim Cook Says Apple Will 'Break New Ground' in Generative AI.	Cook said that the company will "break new ground" in generative AI in 2024. "We believe it will unlock transformative opportunities for our users," said Cook.
Elon Musk sues OpenAI accusing it of putting profit before humanity.	Lawsuit says chief executive Sam Altman’s deal with Microsoft has broken organization’s mission
Figure raises $675M at $2.6B valuation.	In order to continue developing humanoid robots, Figure, a robotics startup, has secured $675 million from a number of significant investors, including OpenAI.

Resources

Link	description
Pearl - A Production-ready Reinforcement Learning AI Agent Library.	Pearl is a new production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. Pearl enables to development Reinforcement Learning AI agents.
Large Language Models for Data Annotation: A Survey.	This is a curated list of papers about LLM for Annotation
Automotive Object Detection with Spiking Neural Networks (SNNs).	One novel and effective model for autonomous cars is Spiking Neural Networks. High performance is attained using up to 85% less energy.
Berkeley function calling leaderboard.	When a language model can access resources through synthesized functions to carry out commands, this is known as function calling. To pass to such functions, the parameters must be properly synthesized. The purpose of this leaderboard is to evaluate the model's performance on function-calling tasks.
FuseChat.	FuseChat is a novel approach to combine the advantages of many huge language models into a single, more potent model without having to pay expensive training fees again.
ShieldLM .	ShieldLM is a bilingual (Chinese and English) safety detector that mainly aims to help detect safety issues in LLMs' generations. It aligns with general human safety standards, supports fine-grained customizable detection rules, and provides explanations for its decisions.
Enable decision-making based on LLM-based simulations.	An open-source project called Simulatrex is dedicated to GABM or generative agent-based modeling. Large language models are used to provide more accurate simulations.
Training-Free Long-Context Scaling of Large Language Models.	Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
DPO to encourage descriptiveness.	A minimal code set up with TRL to tune a model to be more descriptive.
Shape suffixes for ML coding.	The readable nature of shapes in tensors is significantly enhanced by a coding style at Character AI.
Getting started with MAX Developer Edition.	To drastically reduce complexity and accelerate AI implementations, Modular developed the MAX toolset. It is currently accessible.
Bonito.	Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Awesome-LLMs-for-Video-Understanding.	A selection of helpful resources for comprehending videos with huge language models can be found in this repository.
Mist text to speech.	A new text-to-speech technology called Rime has strong conversational capabilities. This model may incorporate "ums" and realistic pauses, in contrast to earlier ones.
Add your own Ollama models.	Guidelines for contributing your own models to the Ollama repository for public usage.
2x speed up HF inference with static KV Cache.	Increased inference speed can lead to new use cases. This code proposes a method to accelerate Hugging Face inference using Llama models.

Perspectives

Link	description
Sam Altman Wants $7 Trillion.	In order to meet the fast-rising costs of developing generative AI models such as GPT, Sam Altman has proposed a $7 trillion budget, indicating an exponential increase in resources required for further iterations. This goal highlights a critical juncture in the development of AI, striking a balance between the quickening pace of scientific improvement and its wider effects on safety and societal preparedness.
Ten AI Insights from Databricks, Anyscale, and Microsoft.	This article features interviews with founders of AI-forward firms, including their perspectives on the emergence of artificial intelligence (AGI), how to approach LLMs, and basic strategies for entrepreneurs integrating AI into their products.
What the EU’s tough AI law means for research and ChatGPT.	The EU AI Act is the world’s first major legislation on artificial intelligence and strictly regulates general-purpose models.
Online images amplify gender bias.	We find that gender bias is consistently more prevalent in images than text for both female- and male-typed categories. We also show that the documented underrepresentation of women online is substantially worse in images than in text, public opinion, and US census data.
ChunkLlama.	Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
distilabel.	AI Feedback (AIF) framework for building datasets with and for LLMs.
StarCoder2.	StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages from The Stack v2, with opt-out requests excluded.
The paradox of diffusion distillation.	Diffusion models decompose complex issues, such as image production, into numerous smaller issues, such as minimizing a small amount of noise in an image. Single-step diffusion generation has received a lot of attention, however it appears to miss the mark. This article examines the diffusion distillation conundrum and lists the various avenues of inquiry that might be pursued.

Back to index

ML news: Week 19 - 25 February

Research

Link	description
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning.	Deciding which examples to employ when aligning language models with preference data is frequently difficult. This paper proposes an unexpectedly strong baseline: pick the 1,000 longest cases.
Extreme Video Compression with Pre-trained Diffusion Models.	As diffusion models get more adept at synthesizing pictures and videos, they may be used for other purposes due to their extensive "knowledge" of the world. This study discovered an astounding 0.02 bits per pixel reduction. The secret here was to track perceptual similarities along the route and deliver a new frame of the original movie as necessary.
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset.	To train open-source Large Language Models in math that equal the performance of closed-source models, researchers have developed a new dataset called OpenMathInstruct-1. With 1.8 million problem-solution pairings, this innovation paves the way for more competitive and approachable AI systems for math teaching.
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.	A feature of the Transformer design that allows it to consume less memory at inference time is the quantization of the KV cache. The process of decreasing floating point accuracy with the least amount of quality loss is called quantization.
Pushing the Limits of Zero-shot End-to-End Speech Translation.	ZeroSwot is a novel approach to voice Translation (ST) that addresses the data scarcity and distinctions between text and voice. It may operate with a multilingual translation model by using special strategies to train a voice encoder using only speech recognition data.
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE).	A novel technique called SpLiCE simplifies the complicated visual data in CLIP.
TDViT: Temporal Dilated Video Transformer for Dense Video Tasks.	A novel Temporal Dilated Video Transformer (TDViT) has been created to enhance the analysis of tasks involving dense videos, like object detection in videos frame by frame.
Generative Representational Instruction Tuning.	A model that creates embeddings and text has been trained and released by the Contextual team. It performs noticeably better than a single specialized model. With embedding as the output modality, the model offers an intriguing interpretation of the multi-modal trend.
LoRA+: Efficient Low-Rank Adaptation of Large Models.	To improve on the current Low-Rank Adaptation (LoRA) technique for fine-tuning big models, this work introduces LoRA+. By applying multiple learning rates for important process components, LoRA+ achieves improved performance and faster fine-tuning without raising processing loads.
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting.	We propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting, that achieves high rendering quality with only 4 input images.
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single to Sparse-view 3D Object Reconstruction.	This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses.
ChatterBox: Multi-round Multimodal Referring and Grounding.	A vision-language model called ChatterBox performs exceptionally well in multimodal dialogues, particularly in the recently defined job of multimodal multi-round referring and grounding.
Large language models streamline automated machine learning for clinical studies.	A knowledge gap persists between machine learning developers and clinicians. Here, the authors show that the Advanced Data Analysis extension of ChatGPT could bridge this gap and simplify complex data analyses, making them more accessible to clinicians.
Extracting accurate materials data from research papers with conversational language models and prompt engineering.	Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach that uses conversational large language models to achieve high precision and recall in extracting materials data.
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis.	GradSafe is a novel technique that can identify dangerous prompts in big language models without requiring a lot of training. Compared to existing approaches, it can identify dangerous prompts more accurately by examining the gradients of certain parameters.
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition.	A novel technique called Class-Aware Mask-guided (CAM) feature refinement improves text recognition in challenging environments.
Object Recognition as Next Token Prediction.	an innovative approach to object recognition that makes use of a language decoder. With this approach, text tokens are predicted from picture embeddings by using a customized non-causal attention mask. It makes it possible to sample many labels in parallel effectively.
TIER: Text and Image Encoder-based Regression for AIGC Image Quality Assessment.	To evaluate the quality of the generated images, TIER makes use of both written prompts and the images that result from them.
Large Language Models for Data Annotation: A Survey.	a taxonomy of techniques that use LLMs for data annotation; covers three aspects: LLM-based data annotation, evaluating LLM-generated annotations, and learning using LLM-generated annotations. an overview and a decent list of references that utilize LLMs for data annotation.
Generative Representational Instruction Tuning.	provides new state-of-the-art on MTEB and the unification is reported to speed up RAG by 60% for long documents. It does this by using generative representational instruction tuning, in which an LLM is trained to perform both generative and embedding tasks and designed to distinguish between them via the instructions.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.	demonstrates that a more straightforward version of REINFORCE performs better than both PPO and recently suggested alternatives like DPO and RAFT; all in all, it demonstrates that online RL optimization may be advantageous and inexpensive. Moreover, it demonstrates that many of the components of PPO are superfluous in an RLHF setting.
In Search of Needles in an 11M Haystack: Recurrent Memory Finds What LLMs Miss.	finds that both GPT-4 and RAG performance significantly depend on the first 25% of the input, suggesting that there is room for improved context processing mechanisms. It also reports that recurrent memory augmentation of transformer models achieves superior performance on documents of up to 10 million tokens. This study investigates the capability of transformer-based models in extremely long context processing.
When is Tree Search Useful for LLM Planning? It Depends on the Discriminator.	examines the methods used by LLM to solve multi-step issues using a framework that includes a generator, discriminator, and planner technique (such as tree search and iterative correction); reveals that, although existing LLMs do not exhibit these discrimination skills, planning approaches require discriminators with at least 90% accuracy; furthermore, it is shown that, although tree search performs well, it is at least 10–20 times slower and therefore unsuitable for real-world applications.
Chain-of-Thought Reasoning Without Prompting.	claims to significantly improve a model's reasoning capabilities over greedy decoding across reasoning benchmarks and suggests a chain-of-thought (CoT) decoding method to elicit reasoning capabilities from pre-trained LLMs without explicit prompting. It also finds that the presence of CoT in the decoding path increases the model's confidence in its final answer.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement.	a set of free and open-source tools for writing, running, and improving code iteratively; includes 68K multi-turn interaction dataset; combines human input with code execution for dynamic code refinement; achieves excellent results on benchmarks such as HumalEval and EvalPlus.

News

Link	description
Anthropic takes steps to prevent election misinformation.	Called Prompt Shield, the technology, which relies on a combination of AI detection models and rules, shows a pop-up if a U.S.-based user of Claude, Anthropic’s chatbot, asks for voting information. The pop-up offers to redirect the user to TurboVote, a resource from the nonpartisan organization Democracy Works, where they can find up-to-date, accurate voting information.
OpenAI's next AI product could be after your job (again).	OpenAI is said to be developing AI agents that automate even more complex tasks, though their launch timeline remains unknown. One AI agent is said to take over the customer’s device to perform tasks like transferring data from a document to a spreadsheet, filling out expense reports, and entering them into accounting software. The other AI agent is said to perform more research-oriented, web-based tasks, such as creating itineraries and booking flight tickets.
Our next-generation model: Gemini 1.5.	In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across several dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra while using less computing.
OpenAI on track to hit $2bn revenue milestone as growth rockets.	Thanks in large part to ChatGPT's enormous success, OpenAI has reached an annual revenue run rate of over $2 billion, making it one of the fastest-growing tech companies.
Sam Altman wants Washington's backing for his $7 trillion AI chip venture.	The OpenAI CEO is working to secure US government approval for the project as it risks raising national security and antitrust concerns, Bloomberg reported.
‘Gemini Business’ and ‘Gemini Enterprise’ plans for Google Workspace are coming.	The upcoming changelog — as spotted by Testing Catalog and Dylan Roussel on X/Twitter today — reveals the existence of “Gemini Business” and “Gemini Enterprise” plans. This will give “Google Workspace customers access to one of Google’s most capable Al models, 1.0 Ultra in Gemini, and enterprise-grade data protections.”
OpenAI Reaches $80 Billion Valuation In Venture Firm Deal, Report Says.	OpenAI inked a deal with venture capital firm Thrive Capital that boosted its valuation to $80 billion or more, the New York Times reported, a nearly threefold increase in value from just nine months ago.
Magic raises $117m to continue code generation models.	We've raised $117M to build an AI software engineer.
SoftBank Founder Masayoshi Son Aims to Raise $100 Billion for New Chip Venture, "Izanagi".	Masayoshi Son, the visionary founder of SoftBank Group Corp., has set his sights on revolutionizing the semiconductor industry with the launch of Izanagi, a groundbreaking chip venture backed by a staggering $100 billion investment.
Scribe $25M Series B.	To further its AI-driven platform, Scribe has secured a Series B fundraising round headed by Redpoint Ventures. This round aims to speed up the generation of visual step-by-step tutorials and enable knowledge exchange between enterprises.
Amazon AGI Team Say Their AI Is Showing “Emergent Abilities”.	"Big Adaptive Streamable TTS with Emergent Abilities" (BASE TTS), a language model created by Amazon AGI researchers, exhibits "state-of-the-art naturalness" in conversational text and demonstrates language skills that it wasn't particularly trained on.
Gemma: Introducing new state-of-the-art open models.	We’re releasing model weights in two sizes: Gemma 2B and Gemma 7B. Each size is released with pre-trained and instruction-tuned variants. Ready-to-use Colab and Kaggle notebooks, alongside integration with popular tools such as Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM, make it easy to get started with Gemma.
Reddit has a new AI training deal to sell user content.	Over a decade of valuable user content is now for sale as Reddit preps to go public.
Apple Developing AI Tool to Help Developers Write Code for Apps.	Apple is working on an updated version of Xcode that will include an AI tool for generating code, reports Bloomberg. The AI tool will be similar to GitHub Copilot from Microsoft, which can generate code based on natural language requests and convert code from one programming language to another.
Stable Diffusion 3.	Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, and spelling abilities.
How Bret Taylor’s new company is rethinking customer experience in the age of AI.	The two founders fundamentally see AI agents as a new technology category, providing an entirely new way for customers to interact with brands to improve their overall experience.
Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster.	We're excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4-0125-preview) score of 81.1% in our evaluation.
Marqo Raises $12.5 Million to Help Businesses Build Generative AI Applications.	Marqo has raised $12.5 million in a Series A funding round to advance the adoption of its search platform that helps businesses build generative artificial intelligence (AI) applications that are more relevant and up to date.

Resources

Link	description
minbpe.	Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
GPTScript.	GPTScript is a new scripting language to automate your interaction with a Large Language Model (LLM), namely OpenAI. The ultimate goal is to create a fully natural language-based programming experience. The syntax of GPTScript is largely natural language, making it very easy to learn and use.
QWEN.	We opensource our Qwen series, now including Qwen, the base language models, namely Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B, as well as Qwen-Chat, the chat models, namely Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, and Qwen-72B-Chat.
Sora Reference Papers.	A collection of all papers referenced in OpenAI's "Video generation models as world simulators"
repeng.	Control vectors are a low-cost means of controlling the output of semantic generation. Compared to LoRA, they are less expensive to train yet may still be fairly powerful. It's made simpler with this library.
OpenRLHF.	This is a Ray-based implementation of RLHF for Mistral and other Llama-style models. Several PPO stabilizing techniques are included to enhance performance.
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations.	To enhance robot manipulation, the 3D Diffuser Actor blends 3D scene representations with diffusion strategies. Robots are better able to comprehend and engage with their surroundings thanks to this AI-driven method.
How to jointly tune learning rate and weight decay for AdamW.	AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy to fix this: when doubling the learning rate, the weight decay should be halved.
OpenLLMetry-JS.	OpenLLMetry-JS is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions - Datadog, Honeycomb, and others.
List of GPU clusters for rent.	a list of entire clusters that can be rented on an hourly basis.
Mamba: The Hard Way.	A detailed description of how Mamba works
new benchmark for large language models.	It's a collection of nearly 100 tests I've extracted from my actual conversation history with various LLMs.
BoCoEL.	Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) is 10 times faster with just a few lines of modular code.
FiT: Flexible Vision Transformer for Diffusion Model.	This repo contains PyTorch model definitions, pre-trained weights, and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer-based model that can generate images at unrestricted resolutions and aspect ratios.
RobustVLM.	To defend multi-modal models like OpenFlamingo and LLaVA against visual adversarial assaults, a novel technique is presented in this study. The authors successfully defend these models against manipulative picture assaults by fine-tuning the CLIP visual encoder in an unsupervised way, increasing the models' dependability and security in practical applications without requiring complete model retraining.
HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings.	A popular benchmark called Holistic Evaluation of Language Models (HELM) was issued by the Stanford language modeling group. Additionally, they created HELM-Instruct, a version for instruction following. It is absolute, open-ended, and multifaceted.
LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4.	We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
Multimodal LLM’s Ability to Understand Visual Data.	A new tool called ChartX is designed to assess how well multi-modal large language models (MLLMs) can understand and make sense of visual charts.
A Critical Evaluation of AI Feedback for Aligning Language Models.	The efficacy of integrating reinforcement learning with supervised fine-tuning in training is questioned in this repository. The more involved two-step technique can be outperformed by first training with a more sophisticated model, such as GPT-4.
MMCSG Dataset.	The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
MultiLora inference server.	One base model can have many LoRAs hot-swapped onto it using the Lorax inference server. This allows a large variety of model tunes to be supported with a significant reduction in RAM use.
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations.	GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. GTBench is built on top of OpenSpiel, supporting 10 widely-recognized games
CrewAI.	A library called CrewAI is available for creating and managing AI agents that make use of Replit and LangChain. It offers an easy-to-integrate modular setup comprising tasks, agents, crews, and tools for a variety of applications. LangSmith improves performance insights into non-deterministic LLM calls while streamlining the debugging process.
gemma.cpp.	gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
MMedLM.	The official codes for "Towards Building Multilingual Language Model for Medicine".
LLM Evaluation Metrics for Labeled Data.	How to measure the performance of LLM applications with ground truth data.

Perspectives

Link	description
The data revolution in venture capital.	Investors, data scientists, and tool builders leading the data-driven future of venture capital.
The Three C's: Creativity, Collaboration, and Communication.	The way we communicate, work together, and complete creative projects has changed significantly since the invention of computing. With AI, we're beginning to witness the commencement of another significant change. We undervalue how significant this change will be. Businesses that integrate artificial intelligence (AI) into their products from the start will have a significant edge over those who add it later to already-existing goods.
Inside OpenAI Logan Kilpatrick (head of developer relations).	Have you ever wondered how OpenAI develops and innovates so quickly? The head of developer relations at OpenAI, Logan Kilpatrick, talks about the company's decision-making structure for product launches, high agency and urgency, and OpenAI's distinct culture in this podcast.
Mind-reading devices are revealing the brain’s secrets.	Implants and other technologies that decode neural activity can restore people’s abilities to move and speak — and help researchers understand how the brain works.
Generative AI’s environmental costs are soaring — and mostly secret.	First-of-its-kind US bill would address the environmental costs of the technology, but there’s a long way to go.
Strategies for an Accelerating Future.	With Google's Gemini providing a context window of over a million tokens and Groq's hardware enabling almost instantaneous responses from GPT-3.5 models, among other recent advancements in AI, these represent a significant advancement in practical AI applications and highlight the pressing need for leaders to comprehend and adjust to the rapidly changing AI landscape.
How to lose at Generative AI!	Despite its excitement, generative AI is likely to let most startups down since it benefits established players with data advantages, established workflows, and the capacity to integrate AI without requiring significant system changes. A difficult road lies ahead for startups hoping to make a significant impact in the Generative AI space, even in spite of venture capital flooding the space. These startups are essentially preparing the market for incumbents who can readily adopt and integrate AI innovations into their dominant platforms by concentrating on expeditious engineering and UX improvements at the workflow layer.
Stockholm declaration on AI ethics: why others should sign.	The use of artificial intelligence (AI) in science has the potential to do both harm and good. As a step towards preventing the harm, we have prepared the Stockholm Declaration on AI for Science.
This is why the idea that AI will just augment jobs, never replace them, is a lie!.	AI will automate labor in certain areas. The response thus far has been divided: would increased efficiency allow for more human workers to accomplish the same duties, or w