Table of Contents
How AI is helping to explore the brain, and robots have learned to understand a three-dimensional world: top 10 AI studies for March 2025

Hi, Hubr, here’s a look at ten artificial intelligence (AI) studies that were particularly memorable to me last month: the multimodal Qwen2.5-Omni, automated AI labs, new approaches to medical simulation, and brain neuroimaging. I tried to summarize everything in short and simple words.
If you want to keep up to date with the latest research in artificial intelligence, use theDataist AI– a free bot that reviews the latest scientific publications on a daily basis.
And alsosubscribe to my telegram feed.where I share insights from the AI industry, tips for implementing AI in business and developing AI startups, and comment on the most important news. Let’s go!
1. Qwen2.5-Omni
Large Language Models (LLMs) already know how to solve textual problems, write code, and translate documents, but humans navigate beyond text – we see the world, hear sounds, simultaneously perceive speech, video, images, and even talk back. Creating an AI that integrates all of these modalities in real time is a huge challenge. We need to synchronize audio and video, be able to respond quickly and still maintain quality across tasks.

So the developers of Qwen2.5-Omni proposed an architecture of Thinker-Talker. “Thinker” (the “brain” module) processes any input – audio, video, images, text – and “Talker” (the “speech” module) generates a voice response using a separate decoder. This, by design, prevents “cross-contamination”, where text can interfere with audio output and vice versa.

In order for all these types of data to be combined correctly in time and space, TMRoPE (Time-aligned Multimodal RoPE) was invented – a kind of 3D position coding mechanism: audio content is split into short 40-ms segments, and video tokens are assigned dynamic timestamps. This allows the model to neatly “stitch” audio and video together. Block streaming is then applied to keep latency to a minimum: you want a live dialog after all!

-
As a result, the model achieves 88.7% on key benchmarks like GSM8K (arithmetic), which is close to the more “pure” text-based version of Qwen2.5-7B;
-
The speech recognition (ASR) tasks recorded a very low WER (~1.8%). This is the level of the best highly specialized audio models;
-
Speech-to-text translation (S2TT) also provides improvements: the model seems to “sense” synchronization, rather than just mindlessly mapping;
-
Speech generation: they had a separate metric (WER after generation + NMOS), and there they managed to get the zero-training error reduction to 6.54%. The model that underwent additional RL tuning sounds almost “human-like” (NMOS ~4.5);
-
When working with images and video – high accuracy in VQA (TextVQA, DocVQA) and understanding video dynamics. On a specially built OmniBench at Qwen2.5-Omni, the results were state-of-the-art.
The all-in-one, real-time model is a great foundation for voice assistants, video surveillance systems with recognition and voice-over, interactive robots that not only “read” the world, but also speak. Of course, it requires gigantic power and tons of data (the authors claim they used about 800 billion tokens from images and videos, 300 billion audio and another 100 billion audio-video). But, if things get going, we’ll get truly human-like multimodal systems.
In this case there are difficulties: really high computational costs, interference between modalities (although the architecture tries to minimize this), and ethical aspects – because such a universal model can potentially be used, say, in total surveillance systems. But the technological benefit is already obvious: we are one step closer to a full-fledged AI that sees, hears, and speaks simultaneously.
Qwen2.5-Omni Technical Report article
The model on HuggingFace
2. MedAgentSim
When we train and test medical LLMs, we usually take static sets: “here is the patient information”, “model, say the diagnosis”. But a real doctor first questions the patient, prescribes tests, specifies something else – and only then makes a diagnosis. The researchers want to model just such a dynamic conversation.
-
MedAgentSim is an open source multi-agent simulation. There’s a “doctor agent”, a “patient agent”, and a “measurement agent” (which gives you data like MRIs, cardiograms).
-
In the conversational phase, the doctor actively questions the patient and at any moment can ask the “measurement agent” to give, for example, the result of an X-ray. Just like in a real clinic.
-
The system then memorizes successful cases: Medical Records and Experience Records buffers. In repeated sessions, the model can peek into these “past dialogs”, which has a self-improvement effect.

-
The resulting approach was tested on the NEJM, MedQA, and MIMIC-IV sets, comparing it to the baseline Multi-Agent Clinic approach. Accuracy rates increased significantly: on MIMIC-IV, for example, from 42.7% to 79.5%, which seems very impressive;
-
A succession of studies has shown that the greatest contribution to the increase in accuracy comes from the bundle: consider measurement + memory + reasoning chain + multi-agent-doctor ensemble.
The model actually got better at specifying symptoms in stages.

This is how we simulate a clinic closer to reality. This can form a much more robust clinical AI that takes into account the steps of diagnosis, rather than just giving a pre-predicted answer. But we need to keep in mind the risks: the entire study is still a simulation, and it cannot be applied directly on live patients without regulation. Plus there are ethical issues, because any mistake here is related to people’s health.
However, if properly developed, such medical simulators will become the basis for future support systems for physicians, who have increased adaptability and the ability “to conduct long dialogues, prescribing studies step by step.
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions
Code on GitHub
3. CodeScientist
There are automated scientific discovery systems (ASD, Automated Scientific Discovery) that can, for example, search for new proteins or perform hyperparameter optimization. But they often work in a narrow subject area, and huge arrays of artifacts (autogenerated articles, experiment codes) are evaluated at the top. Researchers wanted a universal system that generates ideas, writes code for experiments, tests them, and writes the results itself – all as independently as possible.

-
CodeScientist takes ideas from two sources: scientific publications and a library of ready-made templates for solutions. A genetic search algorithm collects and combines, generating dozens of ideas.
-
Experts select the 50 most interesting ones, for each idea the system generates a plan, the necessary pieces of code. Then everything is run: the average cost of the experiment is ~$4.23, and the time is ~131 min.
-
The output is a report in LaTeX that describes the results. If there are a lot of errors somewhere, the system corrects the code.
-
A total of 50 ideas, 5 runs of each: 250 runs. In the end, 19 ideas yielded interesting findings, and 6 passed peer review (i.e. really look like valid and new).
-
Total, about 41% of experiments are successful, 32% hit the debug limit, and 18% ran out of time. So the system is still pretty unpredictable.
-
Among the “discoveries” mentioned: low correlation between the declared confidence of LLM and its accuracy, methods of step-by-step environment generation (improves simulations), complexity of LLM in combinatorial optimization problems.
This work shows that ASD transcends the boundaries of narrow problems and can generate generally new research, saving scientists’ time. But there are risks: you need experts to screen out hallucinations, and the process is still expensive (~250 runs are not free). Still, CodeScientist is an important step toward AI being able to do turnkey research.
CODESCIENTIST: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Code on GitHub
4. AgentRxiv
Autonomous AI agents don’t usually “share” their findings, each learning on its own, sometimes from scratch. But in a true scientific environment, scientists publish preprints in arXiv and bioRxiv, and this accelerates scientific discovery. The authors of the study want to give agents “their own arXiv.”

-
The authors have created the AgentRxiv platform where “labs” (also AI agents) post preprints and share results;
-
They test this on the task of improving the accuracy of MATH-500: if an agent sees someone else’s work, the accuracy increases from 70.2% to 78.2%. Parallel experiments raise the bar to 79.8%;
-
The upside is that agents can revisit methods that someone else has already published and refine them, building up shared experience.
The top part of the illustration shows the three stages of the lab (literature review, experimentation, report writing). People collaborate with AI agents and specialized tools to automate tasks and produce high-quality scientific results Lab 1 requests and receives articles from other labs, and Lab 2 uploads its results.
Instead of isolated AI systems, we get a shared research space. This accelerates progress many times over, improving accuracy on tasks like MATH-500, GPQA, MedQA, and gives “live” knowledge sharing. True, parallel mode is more expensive (cost increase from ~$92 to ~$280) and can lead to duplication of effort. But the idea of “collaborative” science for AI looks very promising.
AgentRxiv: Towards Collaborative Autonomous Research
Code on GitHub
5. Open Deep Search
While Google or ChatGPT provide search + short answer, these systems are closed to both researchers and opaque. And open solutions, alas, fall short in quality. You need something that can be modular and still deliver top results.

The ODS (Open Deep Search) framework consists of two parts:
-
Open Search Tool: it can expand and reformulate a query, process results (chunking + re-ranking), prioritize reliable sources (Wikipedia, ArXiv);
-
Open Reasoning Agent:
-
ODS-v1 (ReAct + Chain-of-Thought) is a classic variant.
-
ODS-v2 (Chain-of-Code + CodeAct) – can generate and execute code, calls to different tools.
-
-
In tests (FRAMES, SimpleQA) ODS-variants + DeepSeek-R1 model outperformed Perplexity AI and came close or even outperformed GPT-4o Search Preview on complex tasks;
-
The framework has learned to save web requests (especially in v2) if it realizes it has already gotten enough results.
Any company can take ODS, plug in their LLM (or open source, or GPT-4o, whatever) and get a search assistant on par with top commercial systems. There’s no hardwiring to Google or Perplexity. Of course, you still need to store indexes, optimize for real queries. But ODS shows that open source solutions can catch up with proprietary giants.
Link to a more detailed review
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Code on GitHub
6. Tracing the Thoughts of LLMs
Anthropic researchers have presented an approach to learning the inner workings of large language models (using the Claude 3.5 Haiku model as an example). The goal of the project is to create methods similar to neuroscience tools to understand how the model “thinks”, plans and reasons when generating text.
Here are the authors’ findings:
-
Multilingualism and Universal Thinking: The model does not separate internal mechanisms into different languages, but uses a common “language of thought” that allows knowledge transfer between languages.
English, French and Chinese share common features indicating some conceptual universality -
Planning ahead: Experiments with poetry writing have shown that the model selects words to rhyme in advance, planning several words ahead, rather than simply generating text one word at a time sequentially. By the way, this is what I’m talking aboutreviewed another article on Emergent Response Planning in LLM.
-
Mental Arithmetic: Instead of a simple memorization or standard addition algorithm, Claude uses multiple parallel paths of computation, combining approximate calculations and exact checks.
Complex parallel processes in Claude’s thinking during oral computation -
Invalid reasoning: Sometimes the model generates logical but untrue chains of reasoning to agree with the user or to give a convincing but false answer.
To complete the answer, Claude follows several reasoning steps in sequence: first identifying which state Dallas is in and then naming the capital of that state -
Hallucinations: The model tends to reject the default answer, but sometimes an internal error mechanism leads to the generation of plausible but unreliable information.
Left: Claude answers a question about the famous basketball player Michael Jordan because the concept of “known answer” prevents a standard refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin) -
Restriction circumvention (jailbreaks): Researchers have found that grammatical consistency can lead to bypassing a model’s defense mechanisms, causing it to produce unwanted responses.
Despite significant advances, the current approach is limited by the complexity and time-consuming nature of the analysis, requiring improvements for application to longer and more complex problems. Such methods are important for the development of robust, transparent, and controllable AI agents and may be useful in other scientific fields such as medicine and biology.
Tracing the thoughts of a large language model
7. Play2Prompt
Often an LLM needs to call an external tool via API and the documentation is incomplete. Or there are no examples. The model may give wrong parameters, hallucinate. The non-programmer user himself does not know all the subtleties. Researchers from MIT and IBM have proposed the PLAY2PROMPT approach.

-
The PLAY2PROMPT method lets the model “play” with the tool: it tries different calls, sees errors, and improves its own examples;
-
Each challenge is then evaluated for quality and complexity;
-
After that, the best examples are generated, plus the documentation is automatically refined.

-
Bottom line: accuracy gains for LLaMA on BFCL up to +5-7%, GPT-3.5 and GPT-4o also show a noticeable boost;
-
Especially REST APIs, parallel calls (+12-17% there);
-
Even if we remove 50% of the parameter descriptions, the model managed to almost recover the desired performance.
It is now possible to integrate the new tool without manually writing examples: the model itself will “learn” how to pull the right APIs. This makes it easier to implement LLM in a real prod environment. Of course, there can still be problems in multi-tool scenarios, but the approach clearly makes life easier.
PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
8. Chain-of-Tools
Chain-of-Thought is cool for LLM reasoning, but the real-world task often requires calling calculators, search engines, KBQA, etc. Fine-tuning is hard-wired into already known tools. In-context learning (HuggingGPT) is sometimes cumbersome. Something universal is needed. Chinese researchers have proposed the Chain-of-Tools (CoTools) method:
-
The base model is “untouched” (weights are frozen). Add modules:
-
Tool Judge: decides if a tool should be called now by looking at the hidden state of the current token;
-
Tool Retriever: selects from the tool pool the one that best fits (via contrastive embeddings of descriptions);
-
Tool Calling: creates a promt with parameters.
-

-
The method gives an accuracy gain: e.g. KBQA KAMEL is ~93.8%, STQuestions is 43.6%;
-
CoTools scales to 999+ tools.
-
Analysis has shown that individual hidden state dimensions are indeed “responsible” for the semantics of the invocation.

LLM dynamically connects new services without retraining the whole thing. It’s a step towards universal assistants that can perform a ton of operations just by getting a textual description of how to work with a tool. Great for multitasking applications. The risk is that there may be imperfections on real large datasets, and the quality of the tool descriptions is critical.
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
Code on GitHub
9. Gemini Robotics
We have super-powered multimodal models (like Gemini 2.0), which in the digital realm are good. But how do we bring them out into the physical world to control real robots (hand held, camera controlled, dynamic scene)? You need to understand 3D object locations, trajectories, capture points, plus do it all nimbly.
-
Google launched the Gemini Robotics project, which extends the basic Gemini 2.0 model by adding a Gemini Robotics-ER component to improve spatial reasoning, prediction of captures, etc.
-
Researchers specifically developed the ERQA benchmark (400 questions) where they test “three-dimensional reasoning”: state estimation, movement planning, multiview tasks;
-
We tested zero-shot and few-shot. For example, passing a banana between hands in a simulation: the baseline is 27% and Gemini Robotics-ER is 53% (and up to 65+% with several demonstrations). In reality, the performance increases substantially as well.
-
Complex scenarios (packing a lunch set, folding origami) achieved 79-100% success after training on 2-5k examples;
-
There are safety measures: constitutional pre-training – 96% of dangerous queries are rejected.

Robots are becoming much more intelligent and versatile: the same model is capable of handling a wide range of tasks, from simple grasps to long procedures. Adaptation to new platforms (Franka, Apollo) has also shown ~60-63% success rate. Of course, it’s not perfect yet – complex tasks require precision to avoid damaging objects or humans. But Gemini Robotics has very impressive prospects: it is fast enough to teach a robot a new operation just by showing a few examples.

Gemini Robotics: Bringing AI into the Physical World
10. End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework
Brain research through image analysis is essential, but traditional approaches have many steps (brain extraction, registration, segmentation, networking, classification) requiring separate models and manual adjustments. This leads to accumulation of errors and high costs.
UniBrain offers a unique solution by combining all analysis steps into a single model (end2end approach). Minimal partitioning is used: only extraction and classification masks and a single labeled template. This significantly saves resources and time.

Key components of the model:
-
Extraction: 3D U-Net for accurate brain extraction;
-
Registration: CNN and spatial transformation layer (STL) to align the image with the template;
-
Segmentation and Parcellation: one-shot approach with mask transfer back to the original space;
-
Network construction: multilevel perceptron (MLP) forms a brain connectivity matrix;
-
Classification: graph convolutional network (GCN) diagnoses conditions (e.g., ADHD).
As a result:
-
High accuracy on all tasks (Dice, a measure of similarity between two sets, for extraction – 0.970, registration – 0.942, segmentation – 0.652);
-
Efficient classification (AUC-ROC – 0.712), outperforming traditional methods;
-
High processing speed (approximately 0.22 seconds per image), significantly faster than conventional methods.

By fully integrating all analysis steps, the model significantly reduces the accumulation of errors and also reduces the dependence on volumetric manual markup, which significantly speeds up image processing. UniBrain offers the potential to accelerate and improve the accuracy of neurodiagnosis, an important step towards the effective application of neuroimaging in medicine.
End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework
All of these studies point to one trend: AI is becoming increasingly comprehensive, able to operate in multiple environments (digital, physical, scientific), often without manual intervention. Yes, technical barriers (processing power, need for markup) and socio-ethical barriers (security, privacy, misuse) remain. But progress is in sight: AI systems are getting better at interacting with the real world and solving applied problems.
Well, let’s see where this research takes us in the next few months. We’ll probably see AI interacting even more closely with tools and robots, not just in labs, but also in factories, medicine, and the home. Cautiously but optimistically we are waiting for new breakthroughs!
***
This is the kind of exciting research that came out in March. Don’t forgetto subscribe to my Telegram feedand useDataist AIto stay up to date with the latest reviews on AI research papers. Let’s stay ahead in the world of technology together!