IllinoisGraphBenchmark / IGB-Datasets

Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

Home Page:https://arxiv.org/abs/2302.13522

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can you release the year each paper was published and organize it in the format where papers published earlier come first, similar to the OGB format?

yichuan520030910320 opened this issue · comments

Is your feature request related to a problem? Please describe.
We all know that GNNs perform better on inductive data compared to MLPs. One thing I appreciate about OGB is that they choose nodes with higher degrees (for instance, products with higher purchase volumes or papers published earlier) as the training set. This makes sense as earlier published papers are more likely to have labels. Could IGB also be organized in this way, marking earlier published articles as the training set? I believe this wouldn't be too difficult.

Describe the solution you'd like

Please release the publication year of the papers (each node) used in the construction of IGB. Set the train node to earlier published articles, which should be easy to implement.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Indeed, in the industrial realm, the general consensus is that GNNs are employed to handle historical data to forecast subsequent information. Furthermore, settings like those in ogbn-arxic and ogbn-products resonate more with real-world data distribution patterns, where the more popular nodes are designated as train nodes, while other nodes are deemed temporarily unlabelled. While your provision of a higher label rate undoubtedly benefits the advancement of open-source GNN datasets, I believe that offering the publication year of the papers and providing an option for users to partition the train, val, and test sets based on these years would broaden IGB's user base, making it more in line with practical scenarios.

Thanks for the explanation. I agree that IGB should offer "an option to partition the dataset based on publication date/year". This aligns in spirit with the IGB goals - providing flexibility to the user to perform an ablation study to understand the impact of data distribution on downstream GNN tasks. Having said this, IGB will not change the default distribution of the raw data but provide a file with paper-id to date mapping so that the user can perform this task as needed.

@akhatua2 if we have raw data on year distribution for each paper-id then we should release them to ensure the above use case is handled. Releasing this data also can enable new use cases like temporal GNNs.

Yes, I believe this idea is great. If you have the publication dates of the papers, it would be beneficial to just release them for the community.Because I believe many people will have this use case. Thanks so much. Great work

Hey you should be able to download the paper year data right now using this command:

wget https://igb-public.s3.us-east-2.amazonaws.com/raw_data/paper_id_year_data.npy

The way to read ths file is:

paper_id_year_data = np.load('gnndataset/dataset_generation/paper_id_year_data.npy', allow_pickle=True).tolist()

The data is stored in the form of {paper_id: year}. These paper ids are mapped for each dataset in the paper_id_index_mapping.npy file. We will process this and release this along with the next update of the dataset soon.

cool thanks!!