social-media-prediction / TPIC2017

TPIC: A social media dataset for temporal popularity prediction

Home Page:https://social-media-prediction.github.io/TPIC2017/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TPIC2017 - Temporal Popularity Image Collection

Bo Wu, Chinese Academy of Sciences, Microsoft Research Aisa

Temporal Popularity Image Collection (TPIC) is a large-scale and social media popularity prediction dataset including 680K social media posts with images from anonymized users of Flickr.com and their photo-sharing records range of 3 years. Meanwhile, TPIC is a multi-faceted social media dataset, which consists of photo images, user profiles, and photo metadata. We provide the rescaled and normalized popularity scores based on the view count of each online post. In order to protect the privacy of users and their sharing behaviors, we anonymised user and post identification and converted post timestamps to time segments with integer indexes.

Download Data

File Format

Each row of data has a unique photo id (pid) along with user id (uid). All the CSV files listed above have data header that demonstrate the the meaning of the column.

USER_META.txt

The file organization inside the file contains picture id, user id, comment count, has people, title length, description length, tag count, average view, group count, average member count information:

pid uid commentcount haspeople titlelen deslen tagcount avgview groupcount avgmembercount  
...  
304582	50@N31	0	0	15	0	14	199.32	1188	6601
304592	142@N94	0	0	11	9	0	615.61	67	21637
... 

The data is collected from Flickr, all user ids or photo ids are anonymized.

PHOTO_URL.txt

Data organized inside the file are the phtoto urls correspond to given photo id and user id:

pid uid url
...
9624	25@N92	https://www.flickr.com/photos/7626362@N07/1251837061
665085	275@N38	https://www.flickr.com/photos/7690920@N06/863366976
...

TIME_FLAG.txt

In order to use temporal information from dataset while protecting the user privacy, we extract year, month, day, and hour index with corresponding photo and user from dataset:

pid uid year month day hour_index
...
311862	11@N30	2007	3	16	4
311863	89@N59	2007	3	16	4
...

The definition of hour index is defined below:

  • Hour Index

  • 0: 2am-6am

  • 1: 6am-10am

  • 2: 10am-2pm

  • 3: 2pm-6pm

  • 4: 6pm-10pm

  • 5: 10pm-2am

LABEL.txt

The label file contains the popularity (log-views), picture id with associate user id:

pid uid logview
...
9624	25@N92	3.2
665085	275@N38	2.3
...

Citation

@inproceedings{Wu2017DTCN,
  title={Sequential Prediction of Social Media Popularity with Deep Temporal Context Networks},
  author={Wu, Bo and Cheng, Wen-Huang and Zhang, Yongdong and Qiushi, Huang and Jintao, Li and Mei, Tao},
  booktitle={IJCAI},
  year={2017}
  }

Related Publications

  1. Bo Wu, Tao Mei, Wen-Huang Cheng, and Yongdong Zhang, Unfolding Temporal Dynamics: Predicting Social Media Popularity Using Multi-scale Temporal Decomposition, In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). AAAI Press 272-278, 12-17 February, 2016, Phoenix, USA.

  2. Bo Wu, Wen-Huang Cheng, Yongdong Zhang, and Tao Mei. Time Matters: Multi-scale Temporalization of Social Media Popularity. In Proceedings of the 2017 ACM on Multimedia Conference (ACM MM '17). ACM, New York, NY, USA, 1336-1344

  3. Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, Jiebo Luo. SMP Challenge: An Overview of Social Media Prediction Challenge 2019, In Proceedings of the 27th ACM International Conference on Multimedia (ACM MM), 2019.