ICLR 2026

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Loka Li*¹, Wong Yu Kang*¹, Minghao Fu^1,3, Guangyi Chen^1,2, Zhenhao Chen¹, Gongxu Luo¹, Yuewen Sun^1,2, Salman Khan^1,4, Peter Spirtes², Kun Zhang^1,2

¹MBZUAI ²Carnegie Mellon University ³UC San Diego ⁴Australian National University

* Equal contribution

arXiv Code Datasets (HF)

Overview

PersonaX is built to answer a simple question: can we study human behavior traits at scale without relying on self-reports or invasive measurements? We link LLM-inferred behavior traits to visual and biographical signals, enabling cross-modal analysis, causal discovery, and reproducible empirical study.

The result is a two-dataset suite that is large enough for population-level insights, but structured enough to support principled causal analysis.

End-to-end pipeline for constructing PersonaX.

Core Contributions

Two multimodal datasets: CelebPersona (9,444 public figures) and AthlePersona (4,181 professional athletes across seven major leagues).
LLM-inferred Big Five trait descriptions and scores, paired with facial embeddings and structured biographical metadata.
Two-level analysis: structured independence testing and causal representation learning with identifiability guarantees.

Datasets & Access

PersonaX is released on Hugging Face as two datasets with unified schema and documentation.

Dataset	Rows	Modalities
CelebPersona	9,444	Face embeddings, curated facial attributes, biographical metadata, LLM-inferred traits
AthlePersona	4,181	Face embeddings, league metadata, biographical attributes, LLM-inferred traits

Browse the full dataset collection: huggingface.co/Persona-X/datasets

Method: From Data to Causal Structure

We define a consistent data processing pipeline that connects public sources to LLM-inferred trait summaries while preserving privacy. The structured pipeline enables classical statistical testing, and the unstructured pipeline supports multimodal causal representation learning.

Trait inference with calibrated prompts and multiple LLMs.
Aggregation into Big Five scores and embeddings.
Privacy-first release: no raw images or raw trait text.

Processing pipeline for multimodal trait inference.

AthlePersona is curated from official league data across NBA, NFL, NHL, ATP, PGA, EPL, and Bundesliga, linking biographical metadata, physical attributes, and LLM-inferred traits.

CelebPersona is built on CelebA identities linked to Wikidata, enabling rich biographical context and stable facial attributes.

Results Highlights

We benchmark prompt designs across models and datasets to quantify stability of inferred traits. Consistent prompting yields reliable trait scores that are suitable for downstream analysis.

LLM consistency analysis for prompt design

Evaluation on LLM consistency for prompt design: prompt format affects trait-score stability.

Estimated causal graph for CelebPersona.

Estimated causal graph for AthlePersona.

The learned graphs reveal interpretable cross-modal pathways between visual, biographical, and LLM-inferred trait variables, supporting structured reasoning about behavior signals across populations.

Ethics & Usage

PersonaX uses publicly available sources with non-commercial academic usage constraints. No raw images or raw trait text are released; only embeddings and structured attributes are provided. Use is restricted to non-commercial research and excludes high-stakes applications such as employment, insurance, or lending.

BibTeX

@article{li2025personax,
  title   = {PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits},
  author  = {Li, Loka and Kang, Wong Yu and Fu, Minghao and Chen, Guangyi and Chen, Zhenhao and Luo, Gongxu and Sun, Yuewen and Khan, Salman and Spirtes, Peter and Zhang, Kun},
  journal = {arXiv preprint arXiv:2509.11362},
  year    = {2025}
}