How language models reason, and how we keep them aligned with people.
I build methods to probe, evaluate, and steer AI toward trustworthy, human-aligned behavior — and use those models, in turn, to study people.
Keeping models aligned with people
I measure where language models drift from human values and judgments — under persuasive pressure, in constrained decision making, and across cultures — and design interventions that close the gap. This includes evaluating how robust models' stated beliefs are, and making AI–human alignment explainable rather than a black box.
Looking inside the chain of thought
I open up how models reason step by step — tracing moral-framework trajectories across intermediate steps, probing the internal representations that encode them, and using lightweight activation steering to shape how those frameworks are integrated. I also build harnesses and network-of-thought structures that make long-horizon reasoning more reliable.
Auditing and mitigating harm
I audit the biases and harms that surface in models and on online platforms — cognitive and social bias, the (mis)detection of AI-generated text, and implicit hate speech — and build benchmarks and methods to mitigate them, from explanation generation to characterizing harmful online communities.
LLMs as instruments for studying society
I use large language models and network science as instruments to study collective behavior — simulating public opinion, modeling the dynamics of emerging platforms, and predicting scientific collaboration — connecting model behavior back to real social systems.