shreyashankar / streams

STREAMS: A Benchmark of Naturalistic Streaming Data for Online Continual Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jeopardy

shreyashankar opened this issue · comments

Jeopardy
https://www.kaggle.com/tunguz/200000-jeopardy-questions
Interesting dataset because real Jeopardy contestants have used domain-switching as a strategy to outmaneuver other players.
Does represent something of a natural (although admittedly niche) distribution of trivia questions, since it was not specifically created for an NLP task.
Data needs to be cleaned up a little though (remove HTML tags around some questions)
X: question (string)
Y: answer (string)
Domains:
Category
Value ($200, $400, …, $2000)
Domain Shifts:
Covariate Shift: P(answer|question) doesn’t change but P(question) does
Higher value questions tend to be more difficult
Different categories have different questions
Label Shift: P(question|answer) doesn’t change but P(answer) does
N/A here (since p(q|a) changes as well)
Concept Shift:
Many of the questions have factual answers that change over time (e.g., who is the US president)