Error in DBN fit given data, "unexpected states"

Question

Error in DBN fit given data, "unexpected states"

Paschas opened this issue a year ago · comments

Christos Paschalidis commented a year ago

Subject of the issue

I created a DBN given a specific structure of nodes and edges. When I try to fit the model with my data,
I get the following error: Data contains unexpected states for variable: "node_x"
I have confirmed that this error is a result of the following:

My (binary) data , do not exhibit all their possible states [-1, 1] in the first time point

Your environment

pgmpy 0.1.19
Python 3.9.13
PyCharm

Steps to reproduce

Create binary data that contain at least one column with the same values.
In the example below, columns one and two exhibit only one of their future possible states.

@@           Data  Example     @@
## |        (1_0)   |   (2_0)   |   (3_0)   |   (4_0)   |   (5_0)   |   (6_0)   |   (7_0)   | .......    (1_150)   |   (1_150)   |   ##
==============================================================================
  1 |             1   |     -1        |     1        |     -1       |       1       |      -1     |     1      | ..........       -1       |      1          |
==============================================================================
  2 |             1   |      -1       |     -1       |     1        |     -1       |       1     |      -1       ..........       1       |      1          |
==============================================================================
  3 |            1    |     -1        |     1        |     -1       |       1       |      -1     |     1      | ..........       -1       |      1          |
 ==============================================================================

from pgmpy.models import DynamicBayesianNetwork as DBN

dbn = DBN(given_structure)

dbn.fit(example_data)

Expected behaviour

It must be possible to fit the model on data that dont have all their possible states in the first time point

Actual behaviour

error: Data contains unexpected states for variable: "node_x"

Possible Solution

I observed that BN.fit has a state_names argument and tried to pass to the DBN.fit.
Unfortunately I run into many collateral error

B Kirtley Amos · Answer 1 · Fri Feb 03 2023 07:22:50 GMT+0800 (China Standard Time)

I'm having a similar issue.

I can run the test data just fine and it fits states that are between 0 and 10. But, when I try to do this with my data that is also int64 I have the same error.

Any thoughts?

Christos Paschalidis · Answer 2 · Fri Feb 03 2023 18:49:06 GMT+0800 (China Standard Time)

@BKAmos dbn.fit() assumes the states of each node from the first time point of your data, thats equivalent with the values of the column (NodeX_0) of your data matrix.
If a future column of the same NodeX contains a value that was not present at t=0, you get the unexpected states Error.

x : a state-value 
( x is in (NodeX_t) for t>0  ) & ( x is not in (NodeX_0) ) => "unexpected states Error"

How many node do you use in your DBN? Do they all own the same states?

B Kirtley Amos · Answer 3 · Sat Feb 04 2023 06:01:00 GMT+0800 (China Standard Time)

Got it. Thanks for that explanation and for digging for the code. That really doesn't lend it self to a lot of nodes and shallow data (batches). Which is what I have.

In my DBN I have 84 nodes and 11 time points. Each node in is accompanied by 4 options for state due to the 4 batch replicate. It works on the data if I limit states to 0 or 1 with my data. But, If I increase states to greater, say between 0 and 10. I get the errors mentioned.

Ankur Ankan · Answer 4 · Sat Feb 04 2023 17:10:33 GMT+0800 (China Standard Time)

@BKAmos Would it be possible to share a reproducible code example so that I can have a look at the bug?

B Kirtley Amos · Answer 5 · Fri Feb 17 2023 03:32:00 GMT+0800 (China Standard Time)

Hello @ankurankan !

Thanks for the software. Also, apologies for taking so long to get back to you.

So, I figured it out. It had to do with the way that I was binning the data. If there is a state that is observed in time_n that isn't observed in time_0 then it throws the errors. So, if we want to make sure that we include many states, we need to make sure that each state is represented for each node in time_0. At least that's how I understand it.

Below is the code that I am running. Currently it generates 3 states (0,1,2) based on values across the 4 samples.

from pgmpy.models import DynamicBayesianNetwork as DBN
import networkx as nx
from pgmpy.models import BayesianModel
from pgmpy.inference import DBNInference, VariableElimination
from pgmpy.estimators import ExhaustiveSearch, K2Score, MmhcEstimator, ParameterEstimator, HillClimbSearch, ExpectationMaximization
from itertools import product
from pgmpy.metrics import correlation_score
import numpy as np
import pandas as pd
import multiprocessing as mp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

def binningZeroOne(data, columnHeadersClean, listOfLabels):
    
    discretizedDF1 = pd.DataFrame()
    bins1 = []

    for i in range(len(columnHeadersClean)):

        bins = data[columnHeadersClean[i]]

        sortedBins = bins.sort_values()

        # sortedBins = sortedBins.iloc[1:]

        # sortedBins += 1

        print(sortedBins)
        
        discretizedDF1[columnHeadersClean[i]] = pd.cut(x=data[columnHeadersClean[i]], bins=sortedBins, precision = 30, duplicates = 'drop', labels = listOfLabels, include_lowest=True)
        
        binRange = pd.cut(x=data[columnHeadersClean[i]], bins=sortedBins, precision = 30, duplicates = 'drop', labels = listOfLabels, include_lowest=True, retbins=True)
        
        bins1.append(binRange)

    return discretizedDF1, bins1

if __name__ == __main__:
    
    ### import the data
    data_raw = pd.read_csv('test_FilteredData.csv')col_list = data_raw.columns.tolist()
    sample_list = data_raw.Sample.values.tolist()
    scaler = MinMaxScaler()
    
    ### Scale data and format appropriatly
    noSample = data_raw.drop(['Sample'], axis=1)
    data_scaled = scaler.fit_transform(noSample.to_numpy())
    data_scaled = pd.DataFrame(data_scaled)
    data_scaled_100 = data_scaled.select_dtypes(exclude=['object', 'datetime'])


    # for i in data_raw.columns:
    #     if 'DPI' in i:
    #         data_raw[i] = clr(data_raw[i])
    #     else:
    #         continue

    # for i in data_raw.columns:
    #     if 'DPI' in i:
    #         data_raw[i] = data_raw[data_raw[i] != '-inf']
    
    #     data_scaled_100[i] = pd.cut(x = data_scaled_100[i], bins = 2, labels = [0,1], precision=30)
    data_scaled_100.insert(0, 'Sample', sample_list)
    data_scaled_100.columns = col_list

    ### Additional data formatting
    data = data_scaled_100.set_index('Sample').T
    data['Day'] = data.index.str.split('_').str[1]
    data['Rep'] = data.index.str.split('_').str[2]
    data = data.sort_values(by='Day')



    # %% reformat data to list consecutive samples in the same row
    isolate_names = data.columns[data.columns.str.contains('Lj')]
    dbn_data_names = list('('+isolate_names+',0)')+list('('+isolate_names+',1)')
    dbn_samples = data.index[data['Day'] != max(data['Day'])]
    dbn_df = pd.DataFrame(index=dbn_samples,columns=dbn_data_names)
    t0_data = data.loc[data['Day'] != max(data['Day']),data.columns[data.columns.str.contains('Lj')]]
    t1_data = data.loc[data['Day'] != min(data['Day']),data.columns[data.columns.str.contains('Lj')]]
    dbn_df.loc[:,dbn_df.columns.str.contains(',0')] = t0_data.values
    dbn_df.loc[:,dbn_df.columns.str.contains(',1')] = t1_data.values



    #%% list all backwards-in-time connections to exclude
    blacklist = list(product('('+isolate_names+',1)','('+isolate_names+',0)'))
    blacklist1 = list(product('('+isolate_names+',1)','('+isolate_names+',1)'))
    
    blacklist2 = blacklist + blacklist1
    
    # %% dbn structure
    hcs = HillClimbSearch(dbn_df)
    dbn_edges = hcs.estimate(black_list=blacklist2).edges


    #%% Node modification fo DBN creation
    dbn_edges_correct = []
    for i in dbn_edges:
        node0 = i[0].split(',')
        node1 = i[1].split(',')
        print(node0)
        print(node1)
        isolate0 = node0[0]
        time0 = node0[1]
        isolate1 = node1[0]
        time1 = node1[1]
        isolate0 = isolate0[1:]
        time0 = time0[:-1]
        isolate1 = isolate1[1:]
        time1 = time1[:-1]
        dbn_edges_correct.append(((str(isolate0), int(time0)), (str(isolate1), int(time1))))
        # print(type(test[1]))

    #%% Modify the original data to work within the DBN framework
    
    ### Gather list of isolates
    isolates = data_raw.iloc[:,0]

    # ### Get days, batches, and sample distribution (sample dist was orginially an idea for binning)
    days1 = []
    batches1 = []
    # data_distrubution = []
    for cols in data_raw.columns:
        if cols.startswith('DPI'):
            components = cols.split("_")
            batches = components[2]
            days = components[1]
            # print(data[cols].describe())
            # data_distrubution.append(data_raw[cols].describe())
            
            days1.append(days)
            # print(days)
            batches1.append(batches)
           
        else:
            pass


    columnHeadersClean = data_raw.columns[1:]

    ### Discretize the data that will be in relative 
    # initialBinning = binninglabels(data_raw, columnHeadersClean, 10, [0,1,2,3,4,5,6,7,8,9])
    
    # dDFnoNan = initialBinning.fillna(0)

    timePoints = set(days1)
    batches = set(batches1)
    batches = list(batches)

    ### For 2 t step BN structure prediction    
    colNames = []
    for i in isolates:
        for j in range(len(timePoints)):
            colNames.append(str(i) +'_'+str(j))

    ### Create a list of DFs by batch
    batchDFs = []
    for i in range(len(batches)):
        batchDFs.append(data_raw.filter(regex=batches[i]))

    ### Flatten the batches into isolate by time arrays
    flattenedDFs = []
    for i in batchDFs:
        test7 = i.to_numpy().flatten()
        test8 = np.float64(test7)
        flattenedDFs.append(test8)

    ### Transform the data frame so that it's in batches by isolate by time
    dbnFrame = pd.DataFrame(np.column_stack(flattenedDFs))
    dbnTransformed = dbnFrame.T

    dbnTransformed.columns = colNames

    colNames.sort(key = lambda x: x.split('_')[1])

    dbnTransformed1 = dbnTransformed[colNames]

    newColNames = []

    for r in dbnTransformed1.columns:
        split = r.split('_')
        newColNames.append((str(split[0]), int(split[1])))

    
    
    dbnTransformed1.columns = newColNames
    final_decomp, binRanges = binningZeroOne(dbnTransformed1, newColNames, [0,1,2])

    isolate_time = dbnTransformed1.columns
    startRange1 = []
    midRange1 = []
    endRange1 = []

    for i in binRanges:
        startRange = i[1][0]
        midRange = i[1][1]
        endRange = i[1][2]
        startRange1.append(startRange)
        midRange1.append(midRange)
        endRange1.append(endRange)

    isolateRange = []
    isolateRangeTime = []

    for i in isolate_time:
        isolateRange.append(i[0])
        isolateRangeTime.append(i[1])

    data_tuples = list(zip(isolateRange,isolateRangeTime,startRange1,midRange1,endRange1))
    df1 = pd.DataFrame(data_tuples, columns=['Isolate', 'Time', 'startRange', 'midRange', 'endRange'])
    # dbn_edges_correct.pop()
    ### Fit the model with the edges and data
    model1 = DBN()
    # model1.add_nodes_from(['LjN209', 'LjR1', 'LjR10', 'LjR104'])
    
    # print(model1.nodes)
    model1.add_edges_from(dbn_edges_correct)
    # print(len(model1.edges()))
    model1.fit(final_decomp)
    model1.initialize_initial_state()

From there you can simulate the model. In it's current state I don't think it's most appropriate. But at least I figured out the state issue, from what I can tell.

test_FilteredData.csv

Ankur Ankan · Answer 6 · Sun Feb 19 2023 22:48:00 GMT+0800 (China Standard Time)

@Paschas Thanks for sharing the code. And yes currently the DBN implementation makes the assumption of it being a 2-TBN and hence requires all the nodes to be the same at each time point.

B Kirtley Amos · Answer 7 · Mon Feb 20 2023 04:25:18 GMT+0800 (China Standard Time)

Just to make sure that I understand correctly. All possible states of a node need to be represented in t_0? If there are states that are present in nodes at later time points that are not present at t_0, that's when the error occurs? How does this make sense for modeling items that are low in presence early in time but are abundant later in time?