abenkhadra / gosat

SMT solver for the theory of floating-point arithmetic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

potentially unsound answers

florianschanda opened this issue · comments

Hi,

We've found the following answers of gosat to be probably incorrect (sat that should be unsat and vice versa).

The filenames below come from my bag of benchmarks. https://github.com/florianschanda/smtlib_schanda I've tried my best to make sure the various status are correct by mistakes are of course possible.

nyxbrain/crafted-edge-cases/RNE:IEEE_equality_is_almost_zero_difference_5_unsat.smt2
nyxbrain/crafted-edge-cases/RNE:the_answer_is_NaN_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTN:IEEE_equality_is_almost_zero_difference_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTN:the_answer_is_NaN_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTP:IEEE_equality_is_almost_zero_difference_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTP:the_answer_is_NaN_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTZ:IEEE_equality_is_almost_zero_difference_5_unsat.smt2
nyxbrain/crafted-edge-cases/RTZ:the_answer_is_NaN_5_unsat.smt2
random/fp.cast/cast_rna_20033.smt2
random/fp.cast/cast_rna_20193.smt2
random/fp.cast/cast_rna_20201.smt2
random/fp.cast/cast_rna_20202.smt2
random/fp.cast/cast_rna_20204.smt2
random/fp.cast/cast_rna_20208.smt2
random/fp.cast/cast_rne_19904.smt2
random/fp.cast/cast_rne_19908.smt2
random/fp.cast/cast_rne_19916.smt2
random/fp.cast/cast_rne_19929.smt2
random/fp.cast/cast_rne_19930.smt2
random/fp.cast/cast_rtn_20753.smt2
random/fp.cast/cast_rtn_20760.smt2
random/fp.cast/cast_rtn_20761.smt2
random/fp.cast/cast_rtn_20763.smt2
random/fp.cast/cast_rtn_20775.smt2
random/fp.cast/cast_rtz_20828.smt2
random/fp.cast/cast_rtz_20879.smt2
random/fp.cast/cast_rtz_20966.smt2
random/fp.cast/cast_rtz_21033.smt2
random/fp.cast/cast_rtz_21034.smt2
random/fp.cast/cast_rtz_21036.smt2
random/fp.cast/cast_rtz_21037.smt2
random/fp.cast/cast_rtz_21040.smt2
random/fp.cast/cast_rtz_21054.smt2
wintersteiger/mul/mul-has-no-other-solution-10044.smt2
wintersteiger/mul/mul-has-no-other-solution-13484.smt2
wintersteiger/mul/mul-has-no-other-solution-15092.smt2
wintersteiger/mul/mul-has-no-other-solution-15226.smt2
wintersteiger/mul/mul-has-no-other-solution-15435.smt2
wintersteiger/mul/mul-has-no-other-solution-15552.smt2
wintersteiger/mul/mul-has-no-other-solution-19889.smt2
wintersteiger/mul/mul-has-no-other-solution-3202.smt2
wintersteiger/mul/mul-has-no-other-solution-3804.smt2
wintersteiger/mul/mul-has-no-other-solution-7814.smt2
wintersteiger/mul/mul-has-no-other-solution-8691.smt2

Also, for some of these the platform of the computer I am on seems to matter.

wintersteiger/mul/mul-has-no-other-solution-15884.smt2 for example returns SAT on my AMD computer but UNKNOWN on my Intel computer.

I am using the same binary (which I compiled on my intel). Both computers are running the same Debian. Below the relevant extracts from /proc/cpuinfo
model name : Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
model name : AMD Opteron(tm) Processor 6328

Some of the issues could be resolved by not using -march=native that the CMakeLists hard-codes. However the bulk of the issued described above remain.

I can confirm that all smt examples you provided can lead to unsound results. The following bug fixes have been applied:

  • Bug fix in how we treated NaNs in the distance function. The distance between any number and a NaN should by > 0. Currently, it is set to 1024.

  • Bug fix in how negation (not operator) propagated to sub-expressions.

  • Bug fix in handling logical equality and inequality. Earlier, we did not make a distinction between logical equality and FP equality which is wrong. Two separate distance functions were added to support this distinction.

Also, the examples exposed two current limitations which might be addressed in the future:

  • goSAT does not honor rounding mode spec and keeps the default mode of the underlaying machine. For now, we warn users about unsupported rounding modes. Actually, LLVM does not support rounding modes directly in the IR yet. There are discussions about this issue, but still nothing in the upstream. Note that the C++ facility fesetround can be useful for setting the mode per formula (before calling the optimizer). What we need, however, is a method of finer granularity to set the mode per sub-expression.

  • We natively support FP64 (double) only, which is the input accepted by nlopt. FP32 is partially supported by casting it to double at the interface with nlopt. FP16 and FP128 are not supported. However, you can still specify variables of size FP16 or FP128 which will internally be treated as FP64.

Additionally, the changes have passed the regression tests on the griggio benchmarks. Hence, the results of our paper are still reproducible. However, the changes are significant and may cause bugs on other benchmarks :)

Finally, I do not think the -march=native plays a role here. The answers for a few benchmarks were unstable, with repeated runs, since they are very close to zero. Something in the order of 10^-17.
Playing with the XTOL parameter might help. However, I have not looked much into tuning nlopt parameters.

Thanks for experimenting with goSAT!

Thank you for the fast fixes :) I am currently re-running our benchmark suite and will let you know if anything new pops up.

I think it would be good that for anything you see that you don't support (like different rounding modes in the same expression) to instantly emit "unsupported" (see page 47 of the smtlib standard) or give the "unknown" response to a check-sat query.

Treating Float16 or Float128 as Float64 will not be sound, I would again encourage you to emit "unknown" at that point.

I've also now compiled without -march and with -march=opteron-sse3 and both work. -march=native really does not if you build on a recent Intel box and then use the binary on an AMD machine. Many executions terminate with "illegal instruction" and others just give incorrect results.

Results are different now, we've found 4970 unsound answers, but they are not the same ones as before. Below a reduced list, but it should cover everything.

crafted/QF_FP/why3_review_01.smt2
nyxbrain/crafted-edge-cases/RNE:gt_to_lt_unsat.smt2
nyxbrain/crafted-edge-cases/RTN:gt_to_lt_unsat.smt2
nyxbrain/crafted-edge-cases/RTP:gt_to_lt_unsat.smt2
nyxbrain/crafted-edge-cases/RTZ:gt_to_lt_unsat.smt2
random/fp.cast/cast_rne_19862.smt2
random/fp.div/div_rne_00703.smt2
random/fp.div/div_rne_00758.smt2
random/fp.div/div_rne_00852.smt2
random/fp.div/div_rne_00920.smt2
random/fp.div/div_rne_00925.smt2
random/fp.div/div_rne_00939.smt2
random/fp.div/div_rne_01008.smt2
random/fp.div/div_rne_01093.smt2
random/fp.div/div_rne_01107.smt2
random/fp.div/div_rne_01125.smt2
random/fp.div/div_rne_01156.smt2
random/fp.div/div_rne_01181.smt2
random/fp.div/div_rne_01209.smt2
random/fp.mul/mul_rne_08016.smt2
random/fp.mul/mul_rne_08042.smt2
random/fp.mul/mul_rne_08061.smt2
random/fp.mul/mul_rne_08081.smt2
random/fp.mul/mul_rne_08114.smt2
random/fp.mul/mul_rne_08137.smt2
random/fp.mul/mul_rne_08292.smt2
random/fp.mul/mul_rne_08324.smt2
random/fp.mul/mul_rne_08326.smt2
random/fp.mul/mul_rne_08341.smt2
random/fp.mul/mul_rne_08345.smt2
random/fp.mul/mul_rne_08395.smt2
random/fp.sub/sub_rne_11182.smt2
random/fp.sub/sub_rne_11531.smt2
wintersteiger/add/add-has-no-other-solution-10020.smt2
wintersteiger/add/add-has-no-other-solution-1003.smt2
wintersteiger/div/div-has-no-other-solution-10317.smt2
wintersteiger/div/div-has-no-other-solution-11528.smt2
wintersteiger/eq/eq-has-no-other-solution-10005.smt2
wintersteiger/eq/eq-has-no-other-solution-10015.smt2
wintersteiger/gt/gt-has-no-other-solution-10.smt2
wintersteiger/gt/gt-has-no-other-solution-10013.smt2
wintersteiger/lt/lt-has-no-other-solution-10001.smt2
wintersteiger/lt/lt-has-no-other-solution-10003.smt2
wintersteiger/mul/mul-has-no-other-solution-10112.smt2
wintersteiger/mul/mul-has-no-other-solution-10506.smt2
wintersteiger/sub/sub-has-no-other-solution-1.smt2
wintersteiger/sub/sub-has-no-other-solution-10068.smt2

I added a few usability features which includes reporting "unsupported" upon detecting FP16 or FP128 variables. More importantly, it turns out that the parser of z3 provides Z3_OP_EQ (logical equality) in places where we expect Z3_OP_FPA_EQ (FP equality). This causes a bug since the treatment of the two equality types was split, for good, in the recent updates.

For example, consider the benchmark wintersteiger/eq/eq-has-no-other-solution-10015.smt2 which is partially reproduced below:

(set-logic QF_FP)
(set-info :status unsat)
(define-sort FPN () (_ FloatingPoint 11 53))
(declare-fun x () FPN)
(declare-fun y () FPN)
(assert (= x (fp #b0 #b01000101001 #b0000011101011111100011111011000000011101101101000010)))
(assert (= y (fp #b0 #b11000011110 #b0110111100000001000011010011110101010001011010111100)))
(assert  (not (= (fp.eq x y) false)))
(check-sat)

the parser gives Z3_OP_FPA_EQ for the comparison between x and y which is expected. However, it gives Z3_OP_EQ for the comparison between x and its corresponding constant despite the fact that the comparison is between FP (not booleans).

I've implemented a workaround where for operation Z3_OP_EQ, we look ahead into the type of the arguments. Based on that, we decide what code should be generated.

The Z3 parser is correct, the comparison is SMTLIB =, not FP EQ.

I believe your workaround is only going to add unsoundness when benchmarks depend on subtle difference between the two. For these specific constants they are equivalent. But they are not equivalent if either of the constant is 0 or NaN.

In C terms you might want to translate (fp.eq x y) as x == y, and (= x y) as *(int*)(&x) == *(int*)(&y).

I should further note that the SMTLIB floating point standard makes this distinction, and several benchmarks depend on this.

Sorry for the misunderstanding. I believe that z3 is doing the correct thing, however, as far as the code generator is concerned, it was unexpected.

I'm interested in supporting a reasonable subset of QF_FP in a sound or, for practical reasons, may be "soundy" way. Along the way, we can learn about the limits of the stochastic search approach. So far, goSAT passes the soundness test on the griggio benchmarks in addition to the examples provided.