Hongyang Du 杜泓洋
Logo Master's Student

Hi, I am Hongyang Du — an incoming Sc.M. Computer Science student at Brown University. I recently completed my undergraduate studies in Computer Science and Mathematics at the University of Maryland, College Park. My academic interests include Vision Language Model, 3D Vision, Embodied AI and random quirky math problem.

Outside of research, I am an amateur powerlifter 💪 and a pro drummer 🥁, I am a bipolar Jazz cat and prog metalhead — think Chick Corea meets Animals as Leaders. Also, I live with a cat named Dingzhen(丁真)🐱, who keeps me grounded.


Education
  • Brown University
    Brown University
    Department of Computer Science
    Mater's Student
    Aug. 2025 - present
  • University of Maryland, College Park
    University of Maryland, College Park
    B.S. in Computer Science and Mathematics
    Aug. 2021 - May. 2025
Honors & Awards
  • Robert Ma Scholarship Recipent
    2024
  • Break Thhrough Tech Scholarship Recipent
    2021
News
2025
Becoming a Brunonian: Starting the Sc.M. Computer Science Program at Brown University in Fall 2025
Sep 02
2024
🐱 found me in the wild!
Jun 14
Joining iFLYTEK, as a Machine Learning Engineer Intern in the AI + Agriculture group this summer.
May 30
Selected Publications (view all )
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Zongxia Li*, Xiyang Wu*, Yubin Qin, Hongyang Du, Guangyao Shi, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber (* equal contribution)

arXiv 2025

Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities.

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Zongxia Li*, Xiyang Wu*, Yubin Qin, Hongyang Du, Guangyao Shi, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber (* equal contribution)

arXiv 2025

Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities.

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Zongxia Li*, Xiyang Wu*, Hongyang Du, Huy Nghiem, Guangyao Shi (* equal contribution)

CVPR TMM-OpenWorld 2025 Oral

Multimodal Vision Language Models (vlms) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on vlms is notably lacking, particularly for researchers aiming to leverage vlms in their specific domains. To this end, we provide a systematic overview of vlms in the following aspects: model information of the major vlms developed over the past five years (2019-2024); the main architectures and training methods of these vlms; summary and categorization of the popular benchmarks and evaluation metrics of vlms; the applications of vlms including embodied agents, robotics, and video generation; the challenges and issues faced by current vlms such as hallucination, fairness, and safety.

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Zongxia Li*, Xiyang Wu*, Hongyang Du, Huy Nghiem, Guangyao Shi (* equal contribution)

CVPR TMM-OpenWorld 2025 Oral

Multimodal Vision Language Models (vlms) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on vlms is notably lacking, particularly for researchers aiming to leverage vlms in their specific domains. To this end, we provide a systematic overview of vlms in the following aspects: model information of the major vlms developed over the past five years (2019-2024); the main architectures and training methods of these vlms; summary and categorization of the popular benchmarks and evaluation metrics of vlms; the applications of vlms including embodied agents, robotics, and video generation; the challenges and issues faced by current vlms such as hallucination, fairness, and safety.

Ipelets for the Convex Polygonal Geometry
Ipelets for the Convex Polygonal Geometry

Nithin Parepally, Ainesh Chatterjee, Auguste Gezalyan, Hongyang Du, Sukrit Mangla, Kenny Wu, Sarah Hwang, David Mount

40th International Symposium on Computational Geometry (SoCG) 2024

There are many structures, both classical and modern, involving convex polygonal geometries whose deeper understanding would be facilitated through interactive visualizations. The Ipe extensible drawing editor, developed by Otfried Cheong, is a widely used software system for generating geometric figures. One of its features is the capability to extend its functionality through programs called Ipelets. In this media submission, we showcase a collection of new Ipelets that construct a variety of geometric objects based on polygonal geometries. These include Macbeath regions, metric balls in the forward and reverse Funk distance, metric balls in the Hilbert metric, polar bodies, the minimum enclosing ball of a point set, and minimum spanning trees in both the Funk and Hilbert metrics. We also include a number of utilities on convex polygons, including union, intersection, subtraction, and Minkowski sum (previously implemented as a CGAL Ipelet).

Ipelets for the Convex Polygonal Geometry

Nithin Parepally, Ainesh Chatterjee, Auguste Gezalyan, Hongyang Du, Sukrit Mangla, Kenny Wu, Sarah Hwang, David Mount

40th International Symposium on Computational Geometry (SoCG) 2024

There are many structures, both classical and modern, involving convex polygonal geometries whose deeper understanding would be facilitated through interactive visualizations. The Ipe extensible drawing editor, developed by Otfried Cheong, is a widely used software system for generating geometric figures. One of its features is the capability to extend its functionality through programs called Ipelets. In this media submission, we showcase a collection of new Ipelets that construct a variety of geometric objects based on polygonal geometries. These include Macbeath regions, metric balls in the forward and reverse Funk distance, metric balls in the Hilbert metric, polar bodies, the minimum enclosing ball of a point set, and minimum spanning trees in both the Funk and Hilbert metrics. We also include a number of utilities on convex polygons, including union, intersection, subtraction, and Minkowski sum (previously implemented as a CGAL Ipelet).

All publications