My Research Statement

4 minute read

Published: March 02, 2026

This is my research statement.

Research Statement

In my junior year, I initially do research not out of grand ambition, but from a simple desire to challenge myself beyond coursework. While exploring different directions, I encountered research on embodied intelligence — systems designed to interact directly with the physical world. Embodied agents must perceive, act, and adapt under real-world constraints. I was drawn to this line of work not only because it was fashionable, but because it felt meaningful. Promoting interaction between intelligent systems and the real world seemed to be a concrete step toward more genuine forms of intelligence.

Although I do not claim that embodied AI is the only path toward Artificial General Intelligence, I believe it represents a necessary exploration. This curiosity led me to actively contact a professor working in this area and begin my research journey.

As I became more involved in research, I gradually encountered challenges that shaped my thinking. While reproducing prior work and implementing baselines, I noticed that experimental settings were often inconsistent across papers, making fair comparison difficult. Their code were sometimes hard to read and extend. These experiences were not merely technical frustrations — they led me to reflect on what it means for research to be meaningful.

I began to realize that performance on a benchmark, by itself, does not necessarily translate to real-world utility(especially in the field of Embodied AI). If a system cannot be reliably reproduced, extended, or deployed, its contribution remains fragile. This realization strengthened my appreciation for rigorous engineering, transparent evaluation, and research that prioritizes robustness over "benchmark SOTAs".

Motivated by my interest in embodied intelligence, I began my first research project under my supervisor’s guidance, building upon a prior work in video-based modeling for embodied tasks. I was particularly drawn to this direction because it combined powerful foundation models with real-world interaction scenarios — a setting where perception, temporal reasoning, and action intersect.

However, as I engaged more deeply with the implementation and evaluation, I began to notice some important limitations. For instance, models that processed full video sequences sometimes performed comparably to, or even no better than, those that relied solely on a final frame. This observation led me to question whether temporal modeling they proposed was genuinely contributing to understanding, or merely adding architectural complexity without deeper mechanistic gains. Similarly, I realized that they just predicting trajectories in a purely 2D setting which may not fully capture the structured physical nature of real-world interaction.

These experiences did not immediately provide clear answers. I began to reflect on a broader issue: strong benchmark performance does not necessarily imply robust generalization or transferable understanding. When models require fine-tuning for every new environment, their apparent intelligence may reflect distribution fitting rather than principled understanding and prediction. This realization did not discourage me; rather, it clarified the kind of questions I wish to explore moving forward.

I also participated in a collaborative project focused on feature upsampling built upon vision foundation models such as DINO, CLIP. The goal was to improve the resolution of backbone features in downstream tasks.

Through this experience, I became more aware of the tension between feature representation quality and computational efficiency. While high-capacity upsampling methods could visually reconstruct detailed structures, many of them were too computationally expensive for real-world deployment. Our approach emphasized lightweight design and practical efficiency, achieving competitive performance under strict resource constraints. This project reinforced my belief that research should not only pursue theoretical or visual appeal, but also consider scalability and deployability from the outset.

In summary, my current research interests centers on how intelligence systems can move beyond surface-level prediction toward structured, causal understanding. I am particularly interested in how agents can form internal representations that capture underlying physical mechanisms, rather than merely fitting observed distributions.

At the same time, I remain committed to grounding these questions within practical engineering constraints. For me, genuine progress lies not only in benchmark improvements, but in building systems that are interpretable, adaptable. I see my research journey as an ongoing process of refining these questions and developing the technical depth required to pursue them rigorously.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Zichen Zhao(赵梓辰)

Research Statement

Share on