Can you release the year each paper was published and organize it in the format where papers published earlier come first, similar to the OGB format?

Question

Can you release the year each paper was published and organize it in the format where papers published earlier come first, similar to the OGB format?

yichuan520030910320 opened this issue 10 months ago · comments

Is your feature request related to a problem? Please describe.
We all know that GNNs perform better on inductive data compared to MLPs. One thing I appreciate about OGB is that they choose nodes with higher degrees (for instance, products with higher purchase volumes or papers published earlier) as the training set. This makes sense as earlier published papers are more likely to have labels. Could IGB also be organized in this way, marking earlier published articles as the training set? I believe this wouldn't be too difficult.

Describe the solution you'd like

Please release the publication year of the papers (each node) used in the construction of IGB. Set the train node to earlier published articles, which should be easy to implement.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Vikram Sharma · Answer 1 · Thu Oct 05 2023 12:48:15 GMT+0800 (China Standard Time)

Please describe in detail why this is relevant to IGB, especially when most nodes are labeled. Even with IGBH-full, compared to OGB, IGBH offers significantly higher label proportions. Anyways, the consequences of the year require more justification than just a comparison that is better. The ask also deviates from the primary purpose of IGB i.e., presenting raw data as it is for better model and system development (this is critical for practical purposes).

yichuan~ · Answer 2 · Fri Oct 06 2023 01:07:33 GMT+0800 (China Standard Time)

Indeed, in the industrial realm, the general consensus is that GNNs are employed to handle historical data to forecast subsequent information. Furthermore, settings like those in ogbn-arxic and ogbn-products resonate more with real-world data distribution patterns, where the more popular nodes are designated as train nodes, while other nodes are deemed temporarily unlabelled. While your provision of a higher label rate undoubtedly benefits the advancement of open-source GNN datasets, I believe that offering the publication year of the papers and providing an option for users to partition the train, val, and test sets based on these years would broaden IGB's user base, making it more in line with practical scenarios.

Vikram Sharma · Answer 3 · Fri Oct 06 2023 01:15:24 GMT+0800 (China Standard Time)

Thanks for the explanation. I agree that IGB should offer "an option to partition the dataset based on publication date/year". This aligns in spirit with the IGB goals - providing flexibility to the user to perform an ablation study to understand the impact of data distribution on downstream GNN tasks. Having said this, IGB will not change the default distribution of the raw data but provide a file with paper-id to date mapping so that the user can perform this task as needed.

@akhatua2 if we have raw data on year distribution for each paper-id then we should release them to ensure the above use case is handled. Releasing this data also can enable new use cases like temporal GNNs.

yichuan~ · Answer 4 · Fri Oct 06 2023 01:44:01 GMT+0800 (China Standard Time)

Yes, I believe this idea is great. If you have the publication dates of the papers, it would be beneficial to just release them for the community.Because I believe many people will have this use case. Thanks so much. Great work

Arpandeep Khatua · Answer 5 · Fri Oct 06 2023 12:12:58 GMT+0800 (China Standard Time)

Hey you should be able to download the paper year data right now using this command:

wget https://igb-public.s3.us-east-2.amazonaws.com/raw_data/paper_id_year_data.npy

The way to read ths file is:

paper_id_year_data = np.load('gnndataset/dataset_generation/paper_id_year_data.npy', allow_pickle=True).tolist()

The data is stored in the form of {paper_id: year}. These paper ids are mapped for each dataset in the paper_id_index_mapping.npy file. We will process this and release this along with the next update of the dataset soon.

yichuan~ · Answer 6 · Fri Oct 06 2023 12:21:51 GMT+0800 (China Standard Time)

cool thanks!!