marcotcr / checklist

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In jupyter viwer, INV and DIR test would change two sentences pred in reverse order.

ahzz1207 opened this issue · comments

- In INV and DIR test, every example would be two sentences, one original and one changed. I use Suite.summary() it would be okay. But in jupyter that I used Suite.Summarizer() to show the result, I found that every INV and DIR test, the predict label and conf of two sentences are in reversed order.
  • For one example, (orginal: I'm the guy. model predict is 1), I use add_typos to this sentence and get(changed:I'm eth guy. model predict is 0).
    

    If I use Suite.summary(), I can get the corret result just like:
    Example fails: 1 (0.8) I'm the guy. 0 (0.9) I'm eth guy.
    But in Suite.Summarizer(), I may get the result show in jupyter just like:
    I'm the guy.→ I'm eth guy. Pred: 0(0.9)→1(0.8)
    I can't find the bug where it happends, so please help me to debug , thanks!

Hi! Could you please give me a more complete example test, so I can take a closer look? e.g. what do you see if you print test.data and test.conf

Hi! Could you please give me a more complete example test, so I can take a closer look? e.g. what do you see if you print test.data and test.conf

Hi, just gusse you are chinese from your user_name, so I post one example for you to take a look .

Examples: I remove the puncuation of sentence for INV test, because it's NER model so test.conf is None.
In jupyter html, it would be show like:
报道 说 , 印度 目前 外汇 储备 为 255.2亿→ 2552亿 美元 。 Pred: ['印度' '2552亿美元']→['印度' '255.2亿美元']
print(test.data):
[['报道说,印度目前外汇储备为255.2亿美元。', '报道说印度目前外汇储备为2552亿美元']]

Another issue is that if sentence and pred of testcase is too long, jupyter would not should the complete pred in testcase'box of html because I think the pred box can not split lines automatic which not like sentence . So I need to change width of jupyter to 200% to show the hole contenxt.

Thanks for catching both bugs!
I've fixed the display in 8a0f05c, and now the display should also automatically wrap lines for the prediction (in 7997666).
Both should work if you reinstall from the repo; We will bump the pip install later.
Please feel free to re-open the issue if you still run into problems!

(And yes, I'm Chinese :P)

Thanks for catching both bugs!
I've fixed the display in 8a0f05c, and now the display should also automatically wrap lines for the prediction (in 7997666).
Both should work if you reinstall from the repo; We will bump the pip install later.
Please feel free to re-open the issue if you still run into problems!

(And yes, I'm Chinese :P)

Thank you for doing that, I want to ask can I just copy some folders like checklist/viewer and checklist/visual_interface and then reinstall the procject? Because I modfied a lot for the project in some files.
By the way, I have seen your photo in 公众号 haha.

Yeah that works, just copy-n-replace /checklist/viewer/static/, and then install locally again: pip install -e . :)
On a side note, in case checklist has future updates, you probably want to consider fork the repo, so later you can just fetch our updates!
(Well, at least that photo is with cute doggy lol)

Yeah that works, just copy-n-replace /checklist/viewer/static/, and then install locally again: pip install -e . :)
On a side note, in case checklist has future updates, you probably want to consider fork the repo, so later you can just fetch our updates!
(Well, at least that photo is with cute doggy lol)

Hi, I just copy-n-replace /checklist/viewer/static/ just like you said, and then use pip install -e . in the project. But I get the Output below I don't know it reinstall successfully?

Installing collected packages: checklist
Attempting uninstall: checklist
Found existing installation: checklist 0.0.4
Can't uninstall 'checklist'. No files were found to uninstall.
Running setup.py develop for checklist
Successfully installed checklist

I couldn't really guess what's going on, but hopefully this stackoverflow thread can help?

Another hack to try, essentially forcing pip to upgrade the package:

  1. bump the version in setup.py (here)
  2. pip install --upgrade -e .

I couldn't really guess what's going on, but hopefully this stackoverflow thread can help?

Another hack to try, essentially forcing pip to upgrade the package:

  1. bump the version in setup.py (here)
  2. pip install --upgrade -e .

Thanks for reply, I do nothing but when I test the jupyter found that Bugs is gone!

Hi :)
I have a similar problem, but related to MFT test type. In the visual summary all is well, but in the textual summary the "Example fails" reports that phrases with negative sentiment have confidence for the negative label = 1.0, while in the visual table the example is reported as misclassified as 2, i.e. positive (Expect: 0, Pred: 2).

The strange thing is that I only get this error when I test a different model than the released ones (amazon, google, etc). Are there particular format-rules to follow besides saving the tests_n500 predictions in a txt file having the format "pred - prob for 0 - prob for 1 - prob for 2"?

Sentiment-laden words in context
Test cases: 8658
Test cases run: 500
Fails (rate): 208 (41.6%)

Example fails:
1.0 0.0 0.0 This was a creepy aircraft.
1.0 0.0 0.0 That cabin crew is creepy.
1.0 0.0 0.0 This food is lame.

Hm, this is odd. Can you provide us with a small example?
The default prediction file format is indeed the prediction followed by the softmax (no matter how many labels there are)

Yes, of course and thank you
These are the first 10 lines of the txt file situated in the predictions folder. The only thing that is different from roberta/amazon/* files is that the probabilities for Negative and Positive counts 5 decimals instead of 6, but I mean, ... This is not a real difference (right?)

2 0.99984 0.000000 0.00016
2 0.99190 0.000000 0.00810
2 0.99999 0.000000 0.00001
2 0.98929 0.000000 0.01071
2 0.99970 0.000000 0.00030
2 0.99999 0.000000 0.00001
2 0.99999 0.000000 0.00001
2 0.99848 0.000000 0.00152
2 0.99999 0.000000 0.00001
2 0.99998 0.000000 0.00002

I report the first lines of the textual summary too: the problem persists only when negative labels are involved

Vocabulary

single positive words
Test cases: 34
Fails (rate): 0 (0.0%)

single negative words
Test cases: 35
Fails (rate): 32 (91.4%)

Example fails:
1.0 0.0 0.0 regretted
1.0 0.0 0.0 dislike
0.4 0.0 0.6 horrible

single neutral words
Test cases: 13
Fails (rate): 13 (100.0%)

Example fails:
1.0 0.0 0.0 international
1.0 0.0 0.0 Israeli
1.0 0.0 0.0 see

Sentiment-laden words in context
Test cases: 8658
Test cases run: 500
Fails (rate): 208 (41.6%)

Example fails:
1.0 0.0 0.0 This airline is dreadful.
1.0 0.0 0.0 That service is terrible.
1.0 0.0 0.0 The cabin crew is terrible.

neutral words in context
Test cases: 1716
Test cases run: 500
Fails (rate): 500 (100.0%)

Example fails:
1.0 0.0 0.0 This is an American crew.
1.0 0.0 0.0 That customer service was Indian.
1.0 0.0 0.0 This was an Indian aircraft.

Sorry for the long delay in responding. The format you are using is pred_and_softmax, where the first column is the prediction and the next columns have the prediction probabilities. You'll note that your predictions do not match your probabilities: the first ten lines in your file have prediction=2 (positive), even though the probability of negative is 99%.