How language models reason, and how we keep them aligned with people.

I build methods to probe, evaluate, and steer AI toward trustworthy, human-aligned behavior — and use those models, in turn, to study people.

Themes what I'm working on

01 Human–AI alignment

Keeping models aligned with people

I measure where language models drift from human values and judgments — under persuasive pressure, in constrained decision making, and across cultures — and design interventions that close the gap. This includes evaluating how robust models' stated beliefs are, and making AI–human alignment explainable rather than a black box.

Vulnerability of LLMs' Stated Belief → XChoice & more →

02 Reasoning & interpretability

Looking inside the chain of thought

I open up how models reason step by step — tracing moral-framework trajectories across intermediate steps, probing the internal representations that encode them, and using lightweight activation steering to shape how those frameworks are integrated. I also build harnesses and network-of-thought structures that make long-horizon reasoning more reliable.

Moral Reasoning Trajectories → Network-of-Thought, ReFlect →

03 Bias, safety & online harm

Auditing and mitigating harm

I audit the biases and harms that surface in models and on online platforms — cognitive and social bias, the (mis)detection of AI-generated text, and implicit hate speech — and build benchmarks and methods to mitigate them, from explanation generation to characterizing harmful online communities.

Implicit Hate Speech & ChatGPT → CogBias, Dif, YouNICon →

04 Computational social science

LLMs as instruments for studying society

I use large language models and network science as instruments to study collective behavior — simulating public opinion, modeling the dynamics of emerging platforms, and predicting scientific collaboration — connecting model behavior back to real social systems.

DeepSeek opinion simulation → The Rise of Bluesky →