bert dataset machine-learning natural-language-processing nlp

Dark patterns in e-commerce:
a dataset and its baseline evaluations

Yuki Yada, Jiaying Feng, Tsuneo Matsumoto, Nao Fukushima, Fuyuko Kido, Hayato Yamana

This repository provides the text-based dataset and experimental code for the paper: Dark patterns in e-commerce: a dataset and its baseline evaluations

Accepted at IEEE BigData 2022, 5th Workshop on Big Data for CyberSecurity (BigCyber-2022)

Overview

Dark patterns are user interface designs on online services that are designed to make users behave in ways they do not intend.

In this work, we construct a text-based dataset for dark pattern automatic detection on e-commerce sites and its detection using machine learning as a baseline.

For baseline detection, we applied following two methods:

Classical NLP Methods ( Bag-of-Words + Classical Machine Learning Model )
Transformer-based pre-trained language models ( e.g. BERT )

For more information, please check our Paper.

Project Structure

dataset/: dataset.tsv in this directory is the dataset for text-based dark patterns automatic detection.
darkpattern-auto-detection-classical/: Experimental code of baseline evaluation using classical NLP methods.
darkpattern-auto-detection-deeplearning/: Experimental code of baseline evaluation using transformer-based pre-trained language models.
scraping/: Code for collecting non-dark pattern texts in the dataset.

Dataset

dataset/dataset.tsv is a text-based dataset for dark pattern automatic detection (TSV Format). Dark pattern texts were obtained from Mathur et al.’s study in 2019, which consists of 1,818 dark pattern texts from shopping sites. Then, we collect non-dark pattern texts on e-commerce sites by accessing and segmenting the sites targeted by the Mathur et al.'s study.

Scraping code for non-dark pattern texts is on scraping/. That is implemented using Typscript (Javascript) and Puppeteer.

Baseline Evaluation

We conduct experiment of dark pattern auto detection using the dataset. The code is on darkpattern-auto-detection-classical/ (Classical NLP Methods) and darkpattern-auto-detection-deeplearning/ (Transformer-based pre-trained language models).

Experimental Result of Classical NLP Methods

Model	Accuracy	AUC	F1 score	Precision	Recall
Logistic Regression	0.961	0.989	0.960	0.981	0.940
SVM	0.954	0.987	0.952	0.986	0.922
Random Forest	0.958	0.989	0.957	0.984	0.932
Gradient Boosting	0.962	0.989	0.961	0.976	0.947

Experimental Result of Transformer-based pre-trained language models

Model	Accuracy	AUC	F1 score	Precision	Recall
$\text{BERT}_{base}$	0.972	0.993	0.971	0.982	0.961
$\text{BERT}_{large}$	0.965	0.993	0.965	0.973	0.957
$\text{RoBERTa}_{base}$	0.966	0.993	0.966	0.979	0.954
$\text{RoBERTa}_{large}$	$\mathbf{0.975}$	$\mathbf{0.995}$	$\mathbf{0.975}$	$\mathbf{0.984}$	$\mathbf{0.967}$
$\text{ALBERT}_{base}$	0.959	0.991	0.959	0.972	0.946
$\text{ALBERT}_{large}$	0.965	0.986	0.965	0.973	0.957
$\text{XLNet}_{base}$	0.966	0.992	0.966	0.975	0.958
$\text{XLNet}_{large}$	0.942	0.988	0.940	0.968	0.914

Acknowledgements

This project is based on the Mathur et al.’s study and its dataset. We thank their authors for making the source code publically available.

Changes from Mathur et al.’s Dataset

We extract text data from "Pattern String" column in dark-patterns.csv.
We ignored missing and duplicate text data.

About

[IEEE BigData 2022, 5th Workshop on Big Data for CyberSecurity (BigCyber-2022)] Dark patterns in e-commerce: a dataset and its baseline evaluations

https://arxiv.org/abs/2211.06543

bert dataset machine-learning natural-language-processing nlp

Apache License 2.0

Languages

Language:Python 58.4%Language:TypeScript 36.0%Language:Shell 4.7%Language:JavaScript 0.9%