Personality Psychology meets Natural Language Processing

TakeLab, FER, University of Zagreb

About us

We are a team of computer scientists that want to be psychologists and psychologists that want to be computer scientists.

With this project, we aim to set the ground for a truly interdisciplinary perspective on computational personality research by developing datasets and models for personality prediction and analysis based on online textual interactions. The overarching goal of our project is to bring the two communities closer together and ultimately increase their capacity to carry out relevant and valid research using computational text analysis methods, contributing to both research fields. To this end, the project will focus on three research objectives:

Development of datasets adequate for text-based personality research.
Development of comprehensive NLP models for personality prediction and analysis.
Investigating a number of research questions that relate personality psychology to language use by means of confirmatory and exploratory studies that leverage the developed datasets and computational models.

We are part of the Text Analysis and Knowledge Engineering Lab at the Faculty of Electrical Engineering and Computing, University of Zagreb.

Projects

PSY.TXT project (IP-2020-02-8671)

The topic of this project are computational models for personality analysis and prediction from text. Although NLP and personality psychology have a high potential for synergy, the two fields have different goals and values and thus far remain largely disconnected from each other.

Publications

Matej Gjurković, ́Iva Vukojević, Jan Šnajder

December 2021 Future Generation Computer Systems

SIMPA: Statement-to-Item Matching Personality Assessment from text

Automated text-based personality assessment (ATBPA) methods can analyze large amounts of text data and identify nuanced linguistic personality cues. However, current approaches lack the interpretability, explainability, and validity offered by standard questionnaire instruments. To address these weaknesses, we propose an approach that combines questionnaire-based and text-based approaches to personality assessment. Our Statement-to-Item Matching Personality Assessment (SIMPA) framework uses natural language processing methods to detect self-referencing descriptions of personality in a target’s text and utilizes these descriptions for personality assessment. The core of the framework is the notion of a trait-constrained semantic similarity between the target’s freely expressed statements and questionnaire items. The conceptual basis is provided by the realistic accuracy model (RAM), which describes the process of accurate personality judgments and which we extend with a feedback loop mechanism to improve the accuracy of judgments. We present a simple proof-of-concept implementation of SIMPA for ATBPA on the social media site Reddit. We show how the framework can be used directly for unsupervised estimation of a target’s Big 5 scores and indirectly to produce features for a supervised ATBPA model, demonstrating state-of-the-art results for the personality prediction task on Reddit.

PDF DOI

Matej Gjurković, Mladen Karan, ́Iva Vukojević, Mihaela Bošnjak, Jan Šnajder

June 2021 Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, NAACL 2021

PANDORA Talks: Personality and Demographics on Reddit

Personality and demographics are important variables in social sciences, while in NLP they can aid in interpretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.

PDF DOI

Matej Gjurković, Jan Šnajder

June 2018 *Proceedings of the Second Workshop on Computational Modeling of People′s Opinions, Personality, and Emotions in Social Media, NAACL 2018

Reddit: A Gold Mine for Personality Prediction

Automated personality prediction from social media is gaining increasing attention in natural language processing and social sciences communities. However, due to high labeling costs and privacy issues, the few publicly available datasets are of limited size and low topic diversity. We address this problem by introducing a large-scale dataset derived from Reddit, a source so far overlooked for personality prediction. The dataset is labeled with Myers-Briggs Type Indicators (MBTI) and comes with a rich set of features for more than 9k users. We carry out a preliminary feature analysis, revealing marked differences between the MBTI dimensions and poles. Furthermore, we use the dataset to train and evaluate benchmark personality prediction models, achieving macro F1-scores between 67% and 82% on the individual dimensions and 82% accuracy for exact or one-off accurate type prediction. These results are encouraging and comparable with the reliability of standardized tests.

PDF DOI

All publications

Matej Gjurković, ́Iva Vukojević, Jan Šnajder (2021). SIMPA: Statement-to-Item Matching Personality Assessment from text. Future Generation Computer Systems.

PDF DOI

Matej Gjurković, Mladen Karan, ́Iva Vukojević, Mihaela Bošnjak, Jan Šnajder (2021). PANDORA Talks: Personality and Demographics on Reddit. Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, NAACL 2021.

PDF DOI

Ivan Sekulic, Matej Gjurković, Jan Šnajder (2018). Not Just Depressed: Bipolar Disorder Prediction on Reddit. *Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, EMNLP 2018.

PDF DOI

Matej Gjurković, Jan Šnajder (2018). Reddit: A Gold Mine for Personality Prediction. *Proceedings of the Second Workshop on Computational Modeling of People′s Opinions, Personality, and Emotions in Social Media, NAACL 2018.

PDF DOI

Personality Psychology meets Natural Language Processing

TakeLab, FER, University of Zagreb

About us

People

Researchers

Irina Masnikosa

Linguist on duty

Iva Vukojević

Psychologist on duty

Ivan Crnomarković

TO-DO list master

Jan Šnajder

Martial arts guru

Josip Jukić

Cleaning enthusiast

Matej Gjurković

Mess generator

Mihaela Bošnjak

Photoshop guru

Mladen Karan

Morning bird

Sara Bakić

Debater

Projects

PSY.TXT project (IP-2020-02-8671)

Publications

SIMPA: Statement-to-Item Matching Personality Assessment from text

PANDORA Talks: Personality and Demographics on Reddit

Reddit: A Gold Mine for Personality Prediction

All publications

Datasets

Contact