모바일 메뉴 닫기
 

커뮤니티

제목
[세미나] Bringing AGI to the Visual World: Advancing Multimodal Understanding and Generation (Meta)
작성자
첨단컴퓨팅학부
작성일
2025.05.21
최종수정일
2025.05.21
분류
세미나
링크URL
게시글 내용



일시: 2025. 6. 5 (목요일), 11AM

장소: 백양누리 그랜드볼룸 A (사전 신청 없이 연세 구성원 누구나 참석 가능합니다.)


SpeakerCe Liu (AI Research Scientist Director at Meta GenAI, IEEE Fellow)


Title: Bringing AGI to the Visual World: Advancing Multimodal Understanding and Generation  


Abstract: 

In this talk, I will demonstrate how we can approach Artificial General Intelligence (AGI) to "see" and "simulate" the visual world. Built on a highly efficient mixture-of-experts (MoE) architecture, Llama 4 showcases powerful language modeling capabilities and significantly expands its potential through advanced multimodal functionalities, particularly in complex tasks such as detailed image question-answering, document interpretation, and intricate chart and graph analysis. Additionally, our model achieves best-in-class input and output grounding capabilities, effectively aligning user prompts with relevant visual concepts and anchoring model responses to specific regions within images. This results in more precise visual question answering, allowing the LLM to better understand user intent and accurately localize objects of interest. We discuss how Llama 4 establishes new benchmarks, achieving scores of 73.4 on MMMU and 73.7 on MathVista via Maverick, while maintaining efficiency and accessibility, thus reinforcing Meta's commitment to open-source AI innovation.


We also introduce Movie Gen, a family of foundation models capable of generating high-quality, 1080p HD videos with synchronized audio, instruction-based video editing, and personalized videos derived from user images. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video length of 16 seconds at 16 frames per second. Movie Gen achieves state-of-the-art performance across various tasks, including text-to-video synthesis, video personalization, video editing, and both video-to-audio and text-to-audio generation. We present key technical innovations and simplifications in architecture, training methodologies, data curation, and inference optimizations designed to accelerate progress in large-scale media generation research.


Bio: 

Ce Liu is an AI Research Scientist Director at Meta GenAI, overseeing multimedia research in large language models. Before joining Meta, he was a Partner Research Manager at Microsoft GenAI and the Chief Architect of Computer Vision at Microsoft Azure AI. Prior to that, he worked at Google Research for seven years, pioneering research in image and video generation and helping launch creative Google products. Ce received his BE and ME degrees from Tsinghua University in 1999 and 2002, respectively, and earned his PhD from MIT EECS in 2009. He has served as area chair for leading computer vision and machine learning conferences, was Program Co-Chair of CVPR 2020, and is the Lead General Chair of CVPR 2025. Ce received the PAMI Young Researcher Award in 2016, and best (student) paper awards at NeurIPS (2005) and CVPR (2009, 2019). He is an IEEE Fellow.