I am a fourth-year Ph.D. student at the Laboratory of Image and Video Engineering, The University of Texas at Austin, advised by Prof. Alan Bovik. I collaborate with the Meta Video Infrastructure Team on Video Engineering and Perceptual Quality Optimization.
My current research advances generative modeling for vision and Multimodal Large Language Models (MLLMs) — investigating few-step generation frameworks, MeanFlows, and visual grounding. Building on my prior work in representation learning and perceptual quality assessment (IQA/VQA), I design and train generative and multimodal models that are semantically faithful and perceptually high-fidelity.
Investigating MeanFlows as a framework for efficient few-step generation, training generative models that synthesize high-fidelity samples in a handful of sampling steps.
Studying visual grounding techniques for Multimodal Large Language Models, improving how they localize and reason over visual content to produce semantically faithful, perceptually high-fidelity outputs.
Built a generative framework for high-resolution 4K image synthesis, scaling state-of-the-art pretrained diffusion models to 4096×4096 with structural refinement methods that suppress synthesis artifacts and maximize perceptual quality and spatial consistency.
Optimized drone-to-controller video streaming using adaptive streaming protocols and Video Quality Assessment (VQA), curating an empirical dataset and designing multi-tier frame-rate switching driven by visual features and real-time drone telemetry.
Engineered an AI-powered conversational interface using Natural Language Understanding (NLU) for the Oracle Journeys application, integrating it with Oracle HCM Cloud via REST APIs and optimizing intent classification and entity recognition.
Tested restricting Representation Alignment (REPA) loss to high-noise timesteps on SiT-XL/2 flow-matching models, establishing that the effective axis for relaxing alignment is the training iteration, not the diffusion timestep.
Benchmarked seven CFG-family guidance strategies across SDXL, PixArt-Σ, SD3, and FLUX on T2I-CompBench++, revealing that aggregate FID/IS scores conceal compositional hallucinations exposed only by per-prompt distributional evaluation.
Benchmarked diffusion, flow-matching, consistency, and flow-map-matching models for 4K image synthesis, quantifying the perceptual-quality and inference-efficiency trade-offs between knowledge-distilled and principled few-step techniques.