Change in column names changes the outcome of HillClimbSearch
wakidal opened this issue · comments
Subject of the issue
HillClimbSearch() returns different outputs from same datasets with different column names: one dataset with English and another one with Japanese font.
Why this change is happening and what is the logic behind it?
Is there any ways to controll it?
environment
- Python 3.10.2
- pgmpy 0.1.19
- Windows
from pgmpy.estimators import (
HillClimbSearch,
BicScore)
# creating test data
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 6)), columns=['X1', 'X2', 'X3','X4','X5','X6'])
data['X1'] = data["X2"] + data["X3"]
data["X4"] = data["X5"] + data["X6"]
# rename col-names with japanese-font
data2 = data.rename(columns={'X1':'いち_X1','X2':'に_X2','X3':'さん_X3','X4':'よん_X4','X5':'ご_X5','X6':'ろく_X6'})
def do_HCS(x):
HC1 = HillClimbSearch(x)
network = HC1.estimate(scoring_method=BicScore(x))
return network.edges()
do_HCS(data)
#[('X2', 'X1'), ('X3', 'X1'), ('X5', 'X4'), ('X6', 'X4')]
do_HCS(data2)
#OutEdgeView([('に_X2', 'いち_X1'), ('さん_X3', 'いち_X1'), ('よん_X4', 'ご_X5'), ('ろく_X6', 'ご_X5'), ('ろく_X6', 'よん_X4')])
@wakidal Thanks for reporting this but I am not able to reproduce the issue. On my machine, it gives the same result.
In [16]: do_HCS(data)
...:
0%| | 4/1000000 [00:00<7:16:25, 38.19it/s]
Out[16]: OutEdgeView([('X2', 'X1'), ('X3', 'X1'), ('X5', 'X4'), ('X6', 'X4')])
In [17]: do_HCS(data2)
...:
0%| | 4/1000000 [00:00<7:31:00, 36.95it/s]
Out[17]: OutEdgeView([('に_X2', 'いち_X1'), ('さん_X3', 'いち_X1'), ('ご_X5', 'よん_X4'), ('ろく_X6', 'よん_X4')])
I can confirm, even i am not able to reproduce the issue