运行报错：ZeroDivisionError: division by zero

Question

运行报错：ZeroDivisionError: division by zero

serend1p1ty opened this issue 6 years ago · comments

您好，我使用alipy在scene数据集(多标签数据集)上比较MMC和Random的性能。但是不知为何，总是报错ZeroDivisionError: division by zero。我把alipy的文档再三看过了，我认为我的代码是没有问题的。

import numpy as np
import copy
from alipy import ToolBox
from alipy.index import get_Xy_in_multilabel
from alipy.query_strategy.multi_label import QueryMultiLabelAUDI, QueryMultiLabelMMC, \
                        QueryMultiLabelAdaptive, QueryMultiLabelRandom, LabelRankingModel

import arff
dataset = arff.load(open('scene.arff', 'r'))
data = np.array(dataset['data'])
X = data[:, :294].astype('float64')
mult_y = data[:, 294:]
mult_y[mult_y == 0] = -1
mult_y = mult_y.astype('float64')

alibox = ToolBox(X=X, y=mult_y, query_type='PartLabels')
alibox.split_AL(test_ratio=0.2, initial_label_rate=0.05, all_class=False)


def main_loop(alibox, round, strategy):
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Get intermediate results saver for one fold experiment
    saver = alibox.get_stateio(round)
    # base model
    model = LabelRankingModel()

    # A simple stopping criterion to specify the query budget.
    while len(label_ind) <= 1500:
        # query and update
        import ipdb; ipdb.set_trace(context=7)
        select_labs = strategy.select(label_ind, unlab_ind)
        # use cost to record the amount of queried instance-label pairs
        if len(select_labs[0]) == 1:
            cost = mult_y.shape[1]
        else:
            cost = len(select_labs)
        label_ind.update(select_labs)
        unlab_ind.difference_update(select_labs)

        # train/test
        X_tr, y_tr, _ = get_Xy_in_multilabel(label_ind, X=X, y=mult_y, unknown_element=0)
        model.fit(X=X_tr, y=y_tr)
        pres, pred = model.predict(X[test_idx])
        perf = alibox.calc_performance_metric(y_true=mult_y[test_idx], y_pred=pred, performance_metric='hamming_loss')

        # save
        st = alibox.State(select_index=select_labs, performance=perf, cost=cost)
        saver.add_state(st)

    return copy.deepcopy(saver)


audi_result = []
random_result = []
mmc_result = []
adaptive_result = []

for round in range(5):
    # init strategies
    # audi = QueryMultiLabelAUDI(X, mult_y)
    mmc = QueryMultiLabelMMC(X, mult_y)
    # adaptive = QueryMultiLabelAdaptive(X, mult_y)
    random = QueryMultiLabelRandom()

    # audi_result.append(main_loop(alibox, round, strategy=audi))
    mmc_result.append(main_loop(alibox, round, strategy=mmc))
    # adaptive_result.append(main_loop(alibox, round, strategy=adaptive))
    random_result.append(main_loop(alibox, round, strategy=random))

analyser = alibox.get_experiment_analyser(x_axis='cost')
# analyser.add_method(method_name='AUDI', method_results=audi_result)
analyser.add_method(method_name='RANDOM', method_results=random_result)
analyser.add_method(method_name='MMC', method_results=mmc_result)
# analyser.add_method(method_name='Adaptive', method_results=adaptive_result)
analyser.plot_learning_curves()

Debug显示问题就出在

# train/test
X_tr, y_tr, _ = get_Xy_in_multilabel(label_ind, X=X, y=mult_y, unknown_element=0)
model.fit(X=X_tr, y=y_tr)
pres, pred = model.predict(X[test_idx])
perf = alibox.calc_performance_metric(y_true=mult_y[test_idx], y_pred=pred, performance_metric='hamming_loss')

里面的

pres, pred = model.predict(X[test_idx])

这一句。
我对于LabelRank算法实在不熟悉，只是大致知道这个是通过对标签排序来预测未标记数据的一个方法。我认为问题出在LabelRank算法的实现上，因为我在debug的过程中没有发现任何问题。
scene数据集下载地址
我是通过libac-arff读取的arff文件，我确定数据读取没有问题。
libac-arff安装方式为：pip install liac-arff
运行我上面的代码就可以重现这个问题。
请问您可以帮帮我吗？这个问题我已经花了很久时间但没有解决。

Zhengjia Li · Answer 1 · Thu Mar 21 2019 20:54:07 GMT+0800 (China Standard Time)

在运行那句话之前，程序一切良好，MMC算法运行正常，并且返回了一个查询的ID: (296, )

Tang · Answer 2 · Thu Mar 21 2019 21:20:43 GMT+0800 (China Standard Time)

您好，这个错误确实是LabelRank的问题，我们在文档中提了一下，可能不太明显。之后我们会对这个模型进行更详细的说明。

Note that, the model is scaling sensitive. If you find the optimization procedure is not converge, please try to scale you feature matrix.

您可以尝试归一化一下你的特征矩阵。或者如果您的训练数据不包含有部分标记未知的情况，您也可以考虑使用其他支持多标记的分类模型。

Zhengjia Li · Answer 3 · Thu Mar 21 2019 22:02:53 GMT+0800 (China Standard Time)

确实是没有归一化的问题，因为原始数据本身就是在0-1的范围内，所以之前我就没有归一化。真没想到这个模型对归一化这么敏感。
还有一个小问题想请教您：还是上述代码，我把评价标准从hamming_loss换成了accuracy_score。为什么每轮迭代的accuracy_score一直都是0。一个都预测不对，这也太夸张了吧。

Tang · Answer 4 · Thu Mar 21 2019 22:14:08 GMT+0800 (China Standard Time)

这个问题我们之前也有发现过，对于某些数据集，这个模型是会fail的。
另外，也有可能是您的初始标记率设置的太小了initial_label_rate=0.05。这个模型用SGD优化的，是需要一定量的样本训练才会有较稳定的效果的。

如果LabelRank模型在当前数据集效果不理想，您也可以尝试使用sklearn的其他多标记模型作为基本分类器。

Zhengjia Li · Answer 5 · Thu Mar 21 2019 23:42:29 GMT+0800 (China Standard Time)

好的，谢谢

Zhengjia Li · Answer 6 · Fri Mar 22 2019 12:32:02 GMT+0800 (China Standard Time)

@tangypnuaa 很抱歉再次打扰您，我按照您的指示尝试用sklearn的其他多标签分类算法代替LabelRank。
但是我找了一下，发现sklearn并没有使用样本-标签对作为一个基本训练单元的多标签分类算法，所以我目前只能选择LabelRank。
正如我之前所说，LabelRank在我的程序里表现不是很好，在测试集合上的准确率是0！
按照您的提示，我应该逐步提高迭代次数，但是训练时间是如此之长，尝试的代价太过昂贵，所以我想知道您在用LabelRank模型测试MMC，Adaptive，Audi，QUIRE算法的时候使用的是哪个数据集，最小迭代次数是如何设置的呢？我想找到一个不太大的数据集，和一个最小的迭代次数，能够重现这几个算法的性能，即QUIRE>AUDI>Adaptive>MMC。

Zhengjia Li · Answer 7 · Fri Mar 22 2019 13:54:21 GMT+0800 (China Standard Time)

我实在不知道如何让这个示例代码在其他数据集上运行起来：

from sklearn.datasets import load_iris, make_multilabel_classification
from sklearn.preprocessing import OneHotEncoder

from alipy import ToolBox
from alipy.query_strategy.multi_label import *

X, y = load_iris(return_X_y=True)
mlb = OneHotEncoder()
mult_y = mlb.fit_transform(y.reshape((-1, 1)))
mult_y = np.asarray(mult_y.todense())

# Or generate a dataset with any sizes
# X, mult_y = make_multilabel_classification(n_samples=5000, n_features=20, n_classes=5, length=5)

# Since we are using the label ranking model, the label 0 means unknown. we need to
# set the 0 entries to -1 which means irrelevant.
mult_y[mult_y == 0] = -1

alibox = ToolBox(X=X, y=mult_y, query_type='PartLabels')
alibox.split_AL(test_ratio=0.2, initial_label_rate=0.05, all_class=False)


def main_loop(alibox, round, strategy):
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Get intermediate results saver for one fold experiment
    saver = alibox.get_stateio(round)
    # base model
    model = LabelRankingModel()

    # A simple stopping criterion to specify the query budget.
    while len(label_ind) <= 120:
        # query and update
        select_labs = strategy.select(label_ind, unlab_ind)
        # use cost to record the amount of queried instance-label pairs
        if len(select_labs[0]) == 1:
            cost = mult_y.shape[1]
        else:
            cost = len(select_labs)
        label_ind.update(select_labs)
        unlab_ind.difference_update(select_labs)

        # train/test
        X_tr, y_tr, _ = get_Xy_in_multilabel(label_ind, X=X, y=mult_y, unknown_element=0)
        model.fit(X=X_tr, y=y_tr)
        pres, pred = model.predict(X[test_idx])
        perf = alibox.calc_performance_metric(y_true=mult_y[test_idx], y_pred=pred, performance_metric='hamming_loss')

        # save
        st = alibox.State(select_index=select_labs, performance=perf, cost=cost)
        saver.add_state(st)

    return copy.deepcopy(saver)


audi_result = []
quire_result = []
random_result = []
mmc_result = []
adaptive_result = []

for round in range(5):
    # init strategies
    audi = QueryMultiLabelAUDI(X, mult_y)
    quire = QueryMultiLabelQUIRE(X, mult_y, kernel='rbf')
    mmc = QueryMultiLabelMMC(X, mult_y)
    adaptive = QueryMultiLabelAdaptive(X, mult_y)
    random = QueryMultiLabelRandom()

    audi_result.append(main_loop(alibox, round, strategy=audi))
    quire_result.append(main_loop(alibox, round, strategy=quire))
    mmc_result.append(main_loop(alibox, round, strategy=mmc))
    adaptive_result.append(main_loop(alibox, round, strategy=adaptive))
    random_result.append(main_loop(alibox, round, strategy=random))

analyser = alibox.get_experiment_analyser(x_axis='cost')
analyser.add_method(method_name='AUDI', method_results=audi_result)
analyser.add_method(method_name='QUIRE', method_results=quire_result)
analyser.add_method(method_name='RANDOM', method_results=random_result)
analyser.add_method(method_name='MMC', method_results=mmc_result)
analyser.add_method(method_name='Adaptive', method_results=adaptive_result)
analyser.plot_learning_curves()

当我用其他数据集的时候总是会出现一些很诡异的错误，比如

/home/ppnman/.local/lib/python3.5/site-packages/numpy/linalg/linalg.py:2481: RuntimeWarning: overflow encountered in reduce
  return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))

或者

/home/ppnman/.local/lib/python3.5/site-packages/numpy/linalg/linalg.py:2480: RuntimeWarning: overflow encountered in multiply
  s = (x.conj() * x).real

Tang · Answer 8 · Fri Mar 22 2019 18:28:47 GMT+0800 (China Standard Time)

您好，RuntimeWarning: overflow encountered in multiply是因为LabelRanking模型的优化过程没有收敛，导致参数趋向于无穷。

关于这个模型我们是参考作者最早的一个版本的代码实现的，可能有一些细节没有处理好，我们会仔细review一下这部分代码。另外，还是请您尝试对特征做一些预处理，用不同的处理方式对性能会有比较大的影响。作者的处理方式是对数据进行normalize。

Matlab代码如下

function n_data=normalize(data)

fprintf('normalizing data\n');
%n_data=zeros(size(data));
[n,d]=size(data);
n_data=sparse(n,d);
for i=1:d
    [row,col,v]=find(data);
    n_data(row(col==i),i)=data(row(col==i),i)/max(abs(v(col==i)));
end

end

Tang · Answer 9 · Fri Mar 22 2019 18:54:06 GMT+0800 (China Standard Time)

模型性能度量使用micro-f1会更合理一些。我们测试过enron数据集，使用normalize预处理特征，以20%的数据作为初始标记集合训练模型，结果还是比较合理的。

另外，由于每次只查1个样本标记对，如果每次查询都画出模型性能曲线，会非常震荡，请尝试更多轮的查询（作者的停止条件为20000轮），并每隔n次查询再画一个点。

最后，关于模型训练时间较长的问题，您可以尝试降低模型训练的epoch的数量，model.fit(X, y, n_repeat=10)

Zhengjia Li · Answer 10 · Sat Mar 23 2019 16:48:00 GMT+0800 (China Standard Time)

好的，我按照您的指示尝试下，谢谢您了。

Chen Youyou · Answer 11 · Sun Sep 13 2020 10:17:37 GMT+0800 (China Standard Time)

除以0了呗