uta-smile / RetroXpert

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Will the information leakage issue affect the performance of the EGAT model?

iamxpy opened this issue · comments

Could you please explain more details about the influence that the information leakage issue will bring to the EGAT model?

I am testing the code provided by the authors of G2Gs, and surprisingly, it truns out that replacing the original USPTO dataset with the csv files under folder RetroXpert/data/USPTO50K/canonicalized_csv/ will cause the performance of the R-GCN (the model used in the first step in their method) to decrease. I am not very sure about this result because I might mess something up, so I am curious about if you also found out that the information leakage issue will affect the graph neural networks. Thanks in advance!

commented

Hi, this information leakage will not influence EGAT.

I am not surprised that G2Gs performs badly on canonicalized data. Their first step accuracy is too good to be true!

If you have G2Gs code, can you look into the code and find the atom features of their implementation? In their ICML paper, they did not mention what atom features are used in their method. My guess is that their use the atom index in the feature, which results in the information leak.

My guess is that their use the atom index in the feature, which results in the information leak.

What is atom index? Do you mean atomic number or the mapping numbers?

The code used in G2Gs to extract node features:

def construct_discrete_node_features(mol, max_size):
    # node features: atom type(16 dim), #hydrogen(5 dim), #neighbor(7 dim), #tot_valence(6 dim), 
    # if_benzene_ring(1 dim),  if in ring(1 dim), total 36 dim  (max_size, 36)
    if mol is None:
        raise ValueError('mol is None')
    N = mol.GetNumAtoms()

    node_features = np.zeros((max_size, 36), dtype=np.float32)

    atom_list = [5, 6, 7, 8, 9, 12, 14, 15, 16, 17, 29, 30, 34, 35, 50, 53]
    num_hydrogen_list = [0, 1, 2, 3, 4]
    num_neighbor_list = [0, 1, 2, 3, 4, 5, 6]
    num_valence_list = [1, 2, 3, 4, 5, 6]

    for atom in mol.GetAtoms():
        atom_id = atom.GetIdx()
        atomic_num = atom.GetAtomicNum()
        assert atomic_num in atom_list
        index_atom_type = atom_list.index(atomic_num)
        node_features[atom_id, index_atom_type] = 1.0

        num_hydrogen = atom.GetTotalNumHs()
        assert num_hydrogen in num_hydrogen_list
        node_features[atom_id, len(atom_list) + num_hydrogen] = 1.0

        num_neighbor = atom.GetTotalDegree()
        assert num_neighbor in num_neighbor_list
        node_features[atom_id, len(atom_list) + len(num_hydrogen_list) + num_neighbor] = 1.0

        num_valence = atom.GetTotalValence()  # 化合价
        assert num_valence in num_valence_list
        node_features[atom_id, len(atom_list) + len(num_hydrogen_list) + len(num_neighbor_list) + num_valence - 1] = 1.0

        if atom.GetIsAromatic() is True:
            node_features[atom_id, -2] = 1.0
        if atom.IsInRing() is True:
            node_features[atom_id, -1] = 1.0
    return node_features  # (max_size, 36)
commented

Well, my guess is wrong. It may be related to their R-GCN. Usually, a GNN will not have this problem, but I am not sure why their R-GNN has this problem.

Thanks, I will check the code again.

G2Gs concatenates the atom embeddings before predicting the edit score, and in our own experiments, we found the concatentation order played a major role in performance difference between canonicalized and non-canonicalized versions, which is why we also replaced the concatenation with an invariant function.

commented

That is interesting! Thanks for the info.

Hi, I still cannot understand how the data leakage influence the overall performance.
If you are only predicting which bond to break and which atom to pick to generate the reactant, why does the atom number matter if you are not using atom map number as part of the atom features? Or is it related to the concatenation order issue?
Thank you.

commented

@shuan4638
Hi, this is a good question! The data leakage influences the second stage of our method, and the first stage is not influenced. The second stage model is based on the Transformer, and the atom map number determines the atom order, atom order matters for the Transformer model. For more details, please refer to the Problem in our implementation in README document.

Thanks!
I am also wondering why don't you change all the atom map number to zero using
[atom.SetAtomMapNum(0) for atom in mol.GetAtoms()]
so you can get the canonical smiles input with completely no data leakage?

commented

Because we need atom map numbers to help find the disconnection bonds.

I get it. Thank you very much!