Will the information leakage issue affect the performance of the EGAT model?

Question

Will the information leakage issue affect the performance of the EGAT model?

iamxpy opened this issue 3 years ago · comments

Could you please explain more details about the influence that the information leakage issue will bring to the EGAT model?

I am testing the code provided by the authors of G2Gs, and surprisingly, it truns out that replacing the original USPTO dataset with the csv files under folder RetroXpert/data/USPTO50K/canonicalized_csv/ will cause the performance of the R-GCN (the model used in the first step in their method) to decrease. I am not very sure about this result because I might mess something up, so I am curious about if you also found out that the information leakage issue will affect the graph neural networks. Thanks in advance!

Allen · Answer 1 · Sun Jun 13 2021 07:03:13 GMT+0800 (China Standard Time)

Hi, this information leakage will not influence EGAT.

I am not surprised that G2Gs performs badly on canonicalized data. Their first step accuracy is too good to be true!

If you have G2Gs code, can you look into the code and find the atom features of their implementation? In their ICML paper, they did not mention what atom features are used in their method. My guess is that their use the atom index in the feature, which results in the information leak.

pyxiea · Answer 2 · Sun Jun 13 2021 10:57:24 GMT+0800 (China Standard Time)

My guess is that their use the atom index in the feature, which results in the information leak.

What is atom index? Do you mean atomic number or the mapping numbers?

The code used in G2Gs to extract node features:

def construct_discrete_node_features(mol, max_size):
    # node features: atom type(16 dim), #hydrogen(5 dim), #neighbor(7 dim), #tot_valence(6 dim), 
    # if_benzene_ring(1 dim),  if in ring(1 dim), total 36 dim  (max_size, 36)
    if mol is None:
        raise ValueError('mol is None')
    N = mol.GetNumAtoms()

    node_features = np.zeros((max_size, 36), dtype=np.float32)

    atom_list = [5, 6, 7, 8, 9, 12, 14, 15, 16, 17, 29, 30, 34, 35, 50, 53]
    num_hydrogen_list = [0, 1, 2, 3, 4]
    num_neighbor_list = [0, 1, 2, 3, 4, 5, 6]
    num_valence_list = [1, 2, 3, 4, 5, 6]

    for atom in mol.GetAtoms():
        atom_id = atom.GetIdx()
        atomic_num = atom.GetAtomicNum()
        assert atomic_num in atom_list
        index_atom_type = atom_list.index(atomic_num)
        node_features[atom_id, index_atom_type] = 1.0

        num_hydrogen = atom.GetTotalNumHs()
        assert num_hydrogen in num_hydrogen_list
        node_features[atom_id, len(atom_list) + num_hydrogen] = 1.0

        num_neighbor = atom.GetTotalDegree()
        assert num_neighbor in num_neighbor_list
        node_features[atom_id, len(atom_list) + len(num_hydrogen_list) + num_neighbor] = 1.0

        num_valence = atom.GetTotalValence()  # 化合价
        assert num_valence in num_valence_list
        node_features[atom_id, len(atom_list) + len(num_hydrogen_list) + len(num_neighbor_list) + num_valence - 1] = 1.0

        if atom.GetIsAromatic() is True:
            node_features[atom_id, -2] = 1.0
        if atom.IsInRing() is True:
            node_features[atom_id, -1] = 1.0
    return node_features  # (max_size, 36)

Allen · Answer 3 · Sun Jun 13 2021 11:03:12 GMT+0800 (China Standard Time)

Well, my guess is wrong. It may be related to their R-GCN. Usually, a GNN will not have this problem, but I am not sure why their R-GNN has this problem.

pyxiea · Answer 4 · Sun Jun 13 2021 11:09:45 GMT+0800 (China Standard Time)

Thanks, I will check the code again.

Vignesh Ram Somnath · Answer 5 · Mon Jun 21 2021 16:38:02 GMT+0800 (China Standard Time)

G2Gs concatenates the atom embeddings before predicting the edit score, and in our own experiments, we found the concatentation order played a major role in performance difference between canonicalized and non-canonicalized versions, which is why we also replaced the concatenation with an invariant function.

Allen · Answer 6 · Mon Jun 21 2021 23:02:04 GMT+0800 (China Standard Time)

That is interesting! Thanks for the info.

Shuan Chen · Answer 7 · Wed Aug 18 2021 10:16:02 GMT+0800 (China Standard Time)

Hi, I still cannot understand how the data leakage influence the overall performance.
If you are only predicting which bond to break and which atom to pick to generate the reactant, why does the atom number matter if you are not using atom map number as part of the atom features? Or is it related to the concatenation order issue?
Thank you.

Allen · Answer 8 · Wed Aug 18 2021 22:36:20 GMT+0800 (China Standard Time)

@shuan4638
Hi, this is a good question! The data leakage influences the second stage of our method, and the first stage is not influenced. The second stage model is based on the Transformer, and the atom map number determines the atom order, atom order matters for the Transformer model. For more details, please refer to the Problem in our implementation in README document.

Shuan Chen · Answer 9 · Wed Aug 18 2021 22:44:36 GMT+0800 (China Standard Time)

Thanks!
I am also wondering why don't you change all the atom map number to zero using
[atom.SetAtomMapNum(0) for atom in mol.GetAtoms()]
so you can get the canonical smiles input with completely no data leakage?

Allen · Answer 10 · Wed Aug 18 2021 23:11:56 GMT+0800 (China Standard Time)

Because we need atom map numbers to help find the disconnection bonds.

Shuan Chen · Answer 11 · Thu Aug 19 2021 00:27:26 GMT+0800 (China Standard Time)

I get it. Thank you very much!