Ph.D. Candidate · Informatics · Indiana University
Aligning humans
and AI
I study human–AI alignment — probing and mitigating the biases and limitations of large language models so they can be implemented more faithfully, with reliable interpretability. My work spans natural language processing, large language models, and computational social science.
About
I am Fan Huang, a fourth-year Informatics Ph.D. candidate in Complex Networks and Systems at Indiana University Bloomington, advised by Prof. Jisun An and Prof. Haewoon Kwak. I also work with Prof. Yong-Yeol Ahn and Prof. Filippo Menczer on research projects.
My current projects investigate the human–AI alignment problems and challenges that can be used to mitigate LLMs' biases and limitations and lead to a better implementation of AI models — working across Natural Language Processing, Large Language Models, and Computational Social Science.
Education
- Ph.D. in Informatics Indiana University Bloomington
- Ph.D. in Computer Science Singapore Management University
- M.S. in Information Systems Nanyang Technological University
- B.S. in Computer Science & Technology Central South University
Latest Publications
All 19 →- Simulating Hate Speech Cascades with Multi-LLM AgentsPreprint · 2026
- ReFlect: A Harness System for Long-Horizon LLM ReasoningPreprint · 2026
- CogBias: Measuring & Mitigating Cognitive Bias in LLMsPreprint · 2026
Featured project
All projects →A single-cell transcriptomics visualization pipeline over ~453k cells — turning massive datasets into legible maps.
Research Directions
Overview →- Keeping models aligned with peopleHuman–AI alignment
- Looking inside the chain of thoughtReasoning & interpretability
- Auditing and mitigating harmBias, safety & online harm
- LLMs as instruments for studying societyComputational social science
Featured Publications
All publications →-
Vulnerability of LLMs' Stated Belief? LLMs Belief Resistance Check Through Strategic Persuasive Conversation InterventionsFeatured
Are LLMs persuaded out of their beliefs? Using the Source–Message–Channel–Receiver (SMCR) framework across six mainstream LLMs and three domains (facts, medical QA, social bias), we measure how stable each model's stated beliefs are under persuasive pressure. The smallest model flips on 82.5% of attempts at the first turn — and, counterintuitively, asking models to verbalize confidence makes them more vulnerable.
-
Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based ExplainabilityFeatured
We introduce moral reasoning trajectories — sequences of ethical-framework invocations across intermediate reasoning steps — and analyze their dynamics across six models and three benchmarks. Reasoning switches frameworks 55–58% of the time; linear probes localize framework-specific encoding to model-specific layers, and lightweight activation steering modulates how models integrate them.
-
Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate SpeechFeatured
Can ChatGPT explain why a post is implicitly hateful? On the LatentHatred dataset, ChatGPT reaches 80% agreement with human labels, and a user study shows its natural-language explanations can shift lay perceptions of hatefulness — revealing both the promise and the risks of LLM-generated explanations.
-
ChatGPT Rates Like Human: Towards a Better Alignment of Text Explanation Quality AssessmentsFeatured
Can ChatGPT judge the quality of natural-language explanations the way people do? Across three NLE datasets and 900 human annotations, ChatGPT aligns well with humans on coarse-grained scales, and paired comparisons plus dynamic prompting further improve agreement — charting where LLM-based evaluation can stand in for human raters.
-
Chain of Explanation: New Prompting Method to Generate Higher Quality Natural Language Explanation for Implicit Hate SpeechFeatured
A prompting method — Chain of Explanation (CoE) — that generates natural-language explanations for why a post is implicitly hateful, guided by heuristic words and the target group. On LatentHatred, CoE sharply improves explanation quality (BLEU 44.0 → 62.3), outperforming baseline generation across models and metrics.