Recommendations

How AI is helping to explore the brain, and robots have learned to understand a three-dimensional world: top 10 AI studies for March 2025

How AI is helping to explore the brain, and robots have learned to understand a three-dimensional world: top 10 AI studies for March 2025

Time to read

Artificial intelligenceMachine learning*

Hi, Hubr, here’s a look at ten artificial intelligence (AI) studies that were particularly memorable to me last month: the multimodal Qwen2.5-Omni, automated AI labs, new approaches to medical simulation, and brain neuroimaging. I tried to summarize everything in short and simple words.

If you want to keep up to date with the latest research in artificial intelligence, use theDataist AI– a free bot that reviews the latest scientific publications on a daily basis.

And alsosubscribe to my telegram feed.where I share insights from the AI industry, tips for implementing AI in business and developing AI startups, and comment on the most important news. Let’s go!

1. Qwen2.5-Omni

Large Language Models (LLMs) already know how to solve textual problems, write code, and translate documents, but humans navigate beyond text – we see the world, hear sounds, simultaneously perceive speech, video, images, and even talk back. Creating an AI that integrates all of these modalities in real time is a huge challenge. We need to synchronize audio and video, be able to respond quickly and still maintain quality across tasks.

"Qwen2.5-Omni

Qwen2.5-Omni is a single model that can process different types of data (text, audio, image, video) and provide real-time responses in text or speech.

So the developers of Qwen2.5-Omni proposed an architecture of Thinker-Talker. “Thinker” (the “brain” module) processes any input – audio, video, images, text – and “Talker” (the “speech” module) generates a voice response using a separate decoder. This, by design, prevents “cross-contamination”, where text can interfere with audio output and vice versa.

"Qwen2.5-Omni

Qwen2.5-Omni uses a Thinker-Talker architecture: the Thinker is responsible for text generation, and the Talker, receiving high-level representations from the Thinker, generates streaming speech tokens.

In order for all these types of data to be combined correctly in time and space, TMRoPE (Time-aligned Multimodal RoPE) was invented – a kind of 3D position coding mechanism: audio content is split into short 40-ms segments, and video tokens are assigned dynamic timestamps. This allows the model to neatly “stitch” audio and video together. Block streaming is then applied to keep latency to a minimum: you want a live dialog after all!

"Time-aligned

Time-aligned Multimodal RoPE (TMRoPE)
  • As a result, the model achieves 88.7% on key benchmarks like GSM8K (arithmetic), which is close to the more “pure” text-based version of Qwen2.5-7B;

  • The speech recognition (ASR) tasks recorded a very low WER (~1.8%). This is the level of the best highly specialized audio models;

  • Speech-to-text translation (S2TT) also provides improvements: the model seems to “sense” synchronization, rather than just mindlessly mapping;

  • Speech generation: they had a separate metric (WER after generation + NMOS), and there they managed to get the zero-training error reduction to 6.54%. The model that underwent additional RL tuning sounds almost “human-like” (NMOS ~4.5);

  • When working with images and video – high accuracy in VQA (TextVQA, DocVQA) and understanding video dynamics. On a specially built OmniBench at Qwen2.5-Omni, the results were state-of-the-art.

The all-in-one, real-time model is a great foundation for voice assistants, video surveillance systems with recognition and voice-over, interactive robots that not only “read” the world, but also speak. Of course, it requires gigantic power and tons of data (the authors claim they used about 800 billion tokens from images and videos, 300 billion audio and another 100 billion audio-video). But, if things get going, we’ll get truly human-like multimodal systems.

In this case there are difficulties: really high computational costs, interference between modalities (although the architecture tries to minimize this), and ethical aspects – because such a universal model can potentially be used, say, in total surveillance systems. But the technological benefit is already obvious: we are one step closer to a full-fledged AI that sees, hears, and speaks simultaneously.

"📄"Qwen2.5-Omni Technical Report article

"💾"The model on HuggingFace

2. MedAgentSim

When we train and test medical LLMs, we usually take static sets: “here is the patient information”, “model, say the diagnosis”. But a real doctor first questions the patient, prescribes tests, specifies something else – and only then makes a diagnosis. The researchers want to model just such a dynamic conversation.

  • MedAgentSim is an open source multi-agent simulation. There’s a “doctor agent”, a “patient agent”, and a “measurement agent” (which gives you data like MRIs, cardiograms).

  • In the conversational phase, the doctor actively questions the patient and at any moment can ask the “measurement agent” to give, for example, the result of an X-ray. Just like in a real clinic.

  • The system then memorizes successful cases: Medical Records and Experience Records buffers. In repeated sessions, the model can peek into these “past dialogs”, which has a self-improvement effect.

  Exploring the Onyx Boox Tab Mini C: An Unconventional E Ink Tablet Experience
"Врач-агент

The physician-agent gathers clinical information from the patient through a series of dialogues. The process involves the physician starting his or her day, the patient finding the physician, having a conversation, performing physical exams, consulting multiple agents, and the patient demonstrating symptoms.
  • The resulting approach was tested on the NEJM, MedQA, and MIMIC-IV sets, comparing it to the baseline Multi-Agent Clinic approach. Accuracy rates increased significantly: on MIMIC-IV, for example, from 42.7% to 79.5%, which seems very impressive;

  • A succession of studies has shown that the greatest contribution to the increase in accuracy comes from the bundle: consider measurement + memory + reasoning chain + multi-agent-doctor ensemble.

The model actually got better at specifying symptoms in stages.

"В

In the conversation phase, the agent-doctor and patient exchange information, the doctor schedules the right tests. Once sufficient data is collected, the Experience Replay phase begins: the system analyzes past cases, extracts examples, and the medical team collaboratively makes a decision using chain reasoning and voting.

This is how we simulate a clinic closer to reality. This can form a much more robust clinical AI that takes into account the steps of diagnosis, rather than just giving a pre-predicted answer. But we need to keep in mind the risks: the entire study is still a simulation, and it cannot be applied directly on live patients without regulation. Plus there are ethical issues, because any mistake here is related to people’s health.

However, if properly developed, such medical simulators will become the basis for future support systems for physicians, who have increased adaptability and the ability “to conduct long dialogues, prescribing studies step by step.

"📄"Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

"💾"Code on GitHub

3. CodeScientist

There are automated scientific discovery systems (ASD, Automated Scientific Discovery) that can, for example, search for new proteins or perform hyperparameter optimization. But they often work in a narrow subject area, and huge arrays of artifacts (autogenerated articles, experiment codes) are evaluated at the top. Researchers wanted a universal system that generates ideas, writes code for experiments, tests them, and writes the results itself – all as independently as possible.

"Обзор

Overview of CodeScientist milestones: ideation, planning, creating and running experiments, reporting and meta-analysis of all experiments
  • CodeScientist takes ideas from two sources: scientific publications and a library of ready-made templates for solutions. A genetic search algorithm collects and combines, generating dozens of ideas.

  • Experts select the 50 most interesting ones, for each idea the system generates a plan, the necessary pieces of code. Then everything is run: the average cost of the experiment is ~$4.23, and the time is ~131 min.

  • The output is a report in LaTeX that describes the results. If there are a lot of errors somewhere, the system corrects the code.

  • A total of 50 ideas, 5 runs of each: 250 runs. In the end, 19 ideas yielded interesting findings, and 6 passed peer review (i.e. really look like valid and new).

  • Total, about 41% of experiments are successful, 32% hit the debug limit, and 18% ran out of time. So the system is still pretty unpredictable.

  • Among the “discoveries” mentioned: low correlation between the declared confidence of LLM and its accuracy, methods of step-by-step environment generation (improves simulations), complexity of LLM in combinatorial optimization problems.

This work shows that ASD transcends the boundaries of narrow problems and can generate generally new research, saving scientists’ time. But there are risks: you need experts to screen out hallucinations, and the process is still expensive (~250 runs are not free). Still, CodeScientist is an important step toward AI being able to do turnkey research.

"📄"CODESCIENTIST: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

"💾"Code on GitHub

4. AgentRxiv

Autonomous AI agents don’t usually “share” their findings, each learning on its own, sometimes from scratch. But in a true scientific environment, scientists publish preprints in arXiv and bioRxiv, and this accelerates scientific discovery. The authors of the study want to give agents “their own arXiv.”

  How to Choose Home Exercise Equipment: A Practical Guide for Every Fitness Level
"Автономные

Autonomous research labs work together on a common scientific problem. A human sets the overall direction, and agents independently conduct research and publish results on a shared AgentRxiv server
  • The authors have created the AgentRxiv platform where “labs” (also AI agents) post preprints and share results;

  • They test this on the task of improving the accuracy of MATH-500: if an agent sees someone else’s work, the accuracy increases from 70.2% to 78.2%. Parallel experiments raise the bar to 79.8%;

  • The upside is that agents can revisit methods that someone else has already published and refine them, building up shared experience.

    "В

    The top part of the illustration shows the three stages of the lab (literature review, experimentation, report writing). People collaborate with AI agents and specialized tools to automate tasks and produce high-quality scientific results
    "Лаборатория

    Lab 1 requests and receives articles from other labs, and Lab 2 uploads its results.

Instead of isolated AI systems, we get a shared research space. This accelerates progress many times over, improving accuracy on tasks like MATH-500, GPQA, MedQA, and gives “live” knowledge sharing. True, parallel mode is more expensive (cost increase from ~$92 to ~$280) and can lead to duplication of effort. But the idea of “collaborative” science for AI looks very promising.

"📄"AgentRxiv: Towards Collaborative Autonomous Research

"💾"Code on GitHub

5. Open Deep Search

While Google or ChatGPT provide search + short answer, these systems are closed to both researchers and opaque. And open solutions, alas, fall short in quality. You need something that can be modular and still deliver top results.

"Компоненты

Open Deep Search components

The ODS (Open Deep Search) framework consists of two parts:

  • Open Search Tool: it can expand and reformulate a query, process results (chunking + re-ranking), prioritize reliable sources (Wikipedia, ArXiv);

  • Open Reasoning Agent:

    • ODS-v1 (ReAct + Chain-of-Thought) is a classic variant.

    • ODS-v2 (Chain-of-Code + CodeAct) – can generate and execute code, calls to different tools.

  • In tests (FRAMES, SimpleQA) ODS-variants + DeepSeek-R1 model outperformed Perplexity AI and came close or even outperformed GPT-4o Search Preview on complex tasks;

  • The framework has learned to save web requests (especially in v2) if it realizes it has already gotten enough results.

Any company can take ODS, plug in their LLM (or open source, or GPT-4o, whatever) and get a search assistant on par with top commercial systems. There’s no hardwiring to Google or Perplexity. Of course, you still need to store indexes, optimize for real queries. But ODS shows that open source solutions can catch up with proprietary giants.

"🔗"Link to a more detailed review

"📄"Open Deep Search: Democratizing Search with Open-source Reasoning Agents

"💾" Code on GitHub

6. Tracing the Thoughts of LLMs

Anthropic researchers have presented an approach to learning the inner workings of large language models (using the Claude 3.5 Haiku model as an example). The goal of the project is to create methods similar to neuroscience tools to understand how the model “thinks”, plans and reasons when generating text.

Here are the authors’ findings:

  1. Multilingualism and Universal Thinking: The model does not separate internal mechanisms into different languages, but uses a common “language of thought” that allows knowledge transfer between languages.

    "В

    English, French and Chinese share common features indicating some conceptual universality
  2. Planning ahead: Experiments with poetry writing have shown that the model selects words to rhyme in advance, planning several words ahead, rather than simply generating text one word at a time sequentially. By the way, this is what I’m talking aboutreviewed another article on Emergent Response Planning in LLM.

  3. Mental Arithmetic: Instead of a simple memorization or standard addition algorithm, Claude uses multiple parallel paths of computation, combining approximate calculations and exact checks.

    "Сложные

    Complex parallel processes in Claude’s thinking during oral computation
  4. Invalid reasoning: Sometimes the model generates logical but untrue chains of reasoning to agree with the user or to give a convincing but false answer.

    "Чтобы

    To complete the answer, Claude follows several reasoning steps in sequence: first identifying which state Dallas is in and then naming the capital of that state
  5. Hallucinations: The model tends to reject the default answer, but sometimes an internal error mechanism leads to the generation of plausible but unreliable information.

    "Слева:

    Left: Claude answers a question about the famous basketball player Michael Jordan because the concept of “known answer” prevents a standard refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin)
  6. Restriction circumvention (jailbreaks): Researchers have found that grammatical consistency can lead to bypassing a model’s defense mechanisms, causing it to produce unwanted responses.

Despite significant advances, the current approach is limited by the complexity and time-consuming nature of the analysis, requiring improvements for application to longer and more complex problems. Such methods are important for the development of robust, transparent, and controllable AI agents and may be useful in other scientific fields such as medicine and biology.

  When Earthquake and Tropical Storm Dance: A Day of Dual Disasters in Southern California

"📄" Tracing the thoughts of a large language model

7. Play2Prompt

Often an LLM needs to call an external tool via API and the documentation is incomplete. Or there are no examples. The model may give wrong parameters, hallucinate. The non-programmer user himself does not know all the subtleties. Researchers from MIT and IBM have proposed the PLAY2PROMPT approach.

"Фреймворк

The PLAY2PROMPT framework uses incremental bame-search to find and improve examples of tool use cases by considering the tool’s performance and bugs
  • The PLAY2PROMPT method lets the model “play” with the tool: it tries different calls, sees errors, and improves its own examples;

  • Each challenge is then evaluated for quality and complexity;

  • After that, the best examples are generated, plus the documentation is automatically refined.

"На

The example shows the bame-search trajectory with the highest degree of solution on the validation set. At each step, new documentation is explored based on error feedback.
  • Bottom line: accuracy gains for LLaMA on BFCL up to +5-7%, GPT-3.5 and GPT-4o also show a noticeable boost;

  • Especially REST APIs, parallel calls (+12-17% there);

  • Even if we remove 50% of the parameter descriptions, the model managed to almost recover the desired performance.

It is now possible to integrate the new tool without manually writing examples: the model itself will “learn” how to pull the right APIs. This makes it easier to implement LLM in a real prod environment. Of course, there can still be problems in multi-tool scenarios, but the approach clearly makes life easier.

"📄" PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

8. Chain-of-Tools

Chain-of-Thought is cool for LLM reasoning, but the real-world task often requires calling calculators, search engines, KBQA, etc. Fine-tuning is hard-wired into already known tools. In-context learning (HuggingGPT) is sometimes cumbersome. Something universal is needed. Chinese researchers have proposed the Chain-of-Tools (CoTools) method:

  • The base model is “untouched” (weights are frozen). Add modules:

    • Tool Judge: decides if a tool should be called now by looking at the hidden state of the current token;

    • Tool Retriever: selects from the tool pool the one that best fits (via contrastive embeddings of descriptions);

    • Tool Calling: creates a promt with parameters.

"CoTools

CoTools decides whether to call the tool each time a new response token needs to be generated. A response token refers to text that has already been generated by the base model.
  • The method gives an accuracy gain: e.g. KBQA KAMEL is ~93.8%, STQuestions is 43.6%;

  • CoTools scales to 999+ tools.

  • Analysis has shown that individual hidden state dimensions are indeed “responsible” for the semantics of the invocation.

"Идеальная

The ideal tool invocation procedure. For example, for the input query “What will the weather be like at my destination tomorrow?”

LLM dynamically connects new services without retraining the whole thing. It’s a step towards universal assistants that can perform a ton of operations just by getting a textual description of how to work with a tool. Great for multitasking applications. The risk is that there may be imperfections on real large datasets, and the quality of the tool descriptions is critical.

"📄"Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

"💾"Code on GitHub

9. Gemini Robotics

We have super-powered multimodal models (like Gemini 2.0), which in the digital realm are good. But how do we bring them out into the physical world to control real robots (hand held, camera controlled, dynamic scene)? You need to understand 3D object locations, trajectories, capture points, plus do it all nimbly.

  • Google launched the Gemini Robotics project, which extends the basic Gemini 2.0 model by adding a Gemini Robotics-ER component to improve spatial reasoning, prediction of captures, etc.

  • Researchers specifically developed the ERQA benchmark (400 questions) where they test “three-dimensional reasoning”: state estimation, movement planning, multiview tasks;

  • We tested zero-shot and few-shot. For example, passing a banana between hands in a simulation: the baseline is 27% and Gemini Robotics-ER is 53% (and up to 65+% with several demonstrations). In reality, the performance increases substantially as well.

  • Complex scenarios (packing a lunch set, folding origami) achieved 79-100% success after training on 2-5k examples;

  • There are safety measures: constitutional pre-training – 96% of dangerous queries are rejected.

"Gemini

Gemini 2.0 already has the ability to understand semantic safety and handle long contexts, and specialized learning allows it to perform a variety of tasks, generate dexterous and reactive movements, and quickly adapt to new embodiments and use advanced visual-spatial reasoning to make decisions

Robots are becoming much more intelligent and versatile: the same model is capable of handling a wide range of tasks, from simple grasps to long procedures. Adaptation to new platforms (Franka, Apollo) has also shown ~60-63% success rate. Of course, it’s not perfect yet – complex tasks require precision to avoid damaging objects or humans. But Gemini Robotics has very impressive prospects: it is fast enough to teach a robot a new operation just by showing a few examples.

"Gemini

Gemini 2.0 does well in detecting objects and points in 2D, using 2D guidance to capture and trajectories, as well as corresponding points and detecting objects in 3D.

"📄"Gemini Robotics: Bringing AI into the Physical World

10. End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework

Brain research through image analysis is essential, but traditional approaches have many steps (brain extraction, registration, segmentation, networking, classification) requiring separate models and manual adjustments. This leads to accumulation of errors and high costs.

UniBrain offers a unique solution by combining all analysis steps into a single model (end2end approach). Minimal partitioning is used: only extraction and classification masks and a single labeled template. This significantly saves resources and time.

"Проблема

The end-to-end learning problem for brain imaging tasks is training a model to simultaneously perform extraction, registration, segmentation, parcellation, network generation, and classification tasks

Key components of the model:

  • Extraction: 3D U-Net for accurate brain extraction;

  • Registration: CNN and spatial transformation layer (STL) to align the image with the template;

  • Segmentation and Parcellation: one-shot approach with mask transfer back to the original space;

  • Network construction: multilevel perceptron (MLP) forms a brain connectivity matrix;

  • Classification: graph convolutional network (GCN) diagnoses conditions (e.g., ADHD).

As a result:

  • High accuracy on all tasks (Dice, a measure of similarity between two sets, for extraction – 0.970, registration – 0.942, segmentation – 0.652);

  • Efficient classification (AUC-ROC – 0.712), outperforming traditional methods;

  • High processing speed (approximately 0.22 seconds per image), significantly faster than conventional methods.

By fully integrating all analysis steps, the model significantly reduces the accumulation of errors and also reduces the dependence on volumetric manual markup, which significantly speeds up image processing. UniBrain offers the potential to accelerate and improve the accuracy of neurodiagnosis, an important step towards the effective application of neuroimaging in medicine.

"📄" End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework

All of these studies point to one trend: AI is becoming increasingly comprehensive, able to operate in multiple environments (digital, physical, scientific), often without manual intervention. Yes, technical barriers (processing power, need for markup) and socio-ethical barriers (security, privacy, misuse) remain. But progress is in sight: AI systems are getting better at interacting with the real world and solving applied problems.

Well, let’s see where this research takes us in the next few months. We’ll probably see AI interacting even more closely with tools and robots, not just in labs, but also in factories, medicine, and the home. Cautiously but optimistically we are waiting for new breakthroughs!

***

This is the kind of exciting research that came out in March. Don’t forgetto subscribe to my Telegram feedand useDataist AIto stay up to date with the latest reviews on AI research papers. Let’s stay ahead in the world of technology together!

MrBest
FOLLOW SUBSCRIBE FOR A COOKIE! ON USIIC.CO Accomplishments - Raised $20,000,000 To Plant 20,000,000 Trees - Given millions to charity - Donated over 100 cars lol - Gave away a private island - Given away over 100 ps4s lol - Gave away 1 million dollars in one video - Counted to 100k - Read the Dictionary - Watched Dance Till You're Dead For 10 Hours - Read Bee Movie Script - Read Longest English Word - Watched Paint Dry - Ubering Across America - Watched It's Every Day Bro For 10 Hours - Ran a marathon in the world's largest shoes - Adopted every dog in a shelter You get the point haha *Do not email me asking for money, I give away money because it makes me happy :)
https://usiic.co/groups/help-for-youtuber/

Добавить комментарий