Monday, September 23, 2024 3:30pm to 5pm
About this Event
Please join the GW Co-Design of Trustworthy AI in Systems (DTAIS) program for this "advanced intro" seminar!
The growing capability of AIs motivated us to use them in more and more complex tasks. At the same time, AIs frequently make subtle mistakes that are difficult for humans to identify. It becomes challenging to get reliable human supervision of AIs, including capability assessment, evaluation of individual predictions, and effective delegation. It seems inevitable that we will need to enlist the help of AIs to help us supervise AIs. In this talk, Professor Feng will start with a high-level overview of the technical alignment and AI safety research landscape, then discuss challenges in using AIs to supervise themselves. The discussion will cover three specific projects on how self-evaluation can be biased and methods for assisting human evaluation can have the opposite effect. The first projects studies language models' self-recognition capability and how it can bias self-evaluation; the second project studies iterative self-refinement and how it can cause LMs to further diverge from human preferences; the third project studies how RLHF from flawed human feedback can make models more deceptive.
Attendance is virtual and the content is suitable for TAI experts and newcomers alike.
0 people are interested in this event
User Activity
No recent activity