symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Home Page:https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Coverage for Java is tracked for lines, while Go is tracked for ranges

bauersimon opened this issue · comments

Looking at similar implementations in both languages (and tests with full 100% coverage reported by gotestsum and maven respectively):

go

package light

func validDate(day int, month int, year int) bool {
	monthDays := []int{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}

	if year < 1583 {
		return false
	}
	if month < 1 || month > 12 {
		return false
	}
	if day < 1 {
		return false
	}
	if month == 2 {
		if (year%400) != 0 && (year%4) == 0 {
			if day > 29 {
				return false
			}
		} else {
			if day > 28 {
				return false
			}
		}
	} else {
		if day > monthDays[month-1] {
			return false
		}
	}

	return true
}

result of symflower test:

[
  {
    "FileRange": "light/validateDate.go:12:2-light/validateDate.go:14:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:15:2-light/validateDate.go:19:5",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:20:9-light/validateDate.go:23:5",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:25:8-light/validateDate.go:28:4",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:31:2-light/validateDate.go:31:13",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:3:51-light/validateDate.go:8:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:9:2-light/validateDate.go:11:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  }
]

coverage objects (all entries with count > 0): 7

java

package com.eval;

class ValidDate {
    static boolean validDate(int day, int month, int year) {
		int[] monthDays = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};

		if (year < 1583) {
			return false;
		}
		if (month < 1 || month > 12) {
			return false;
		}
		if (day < 1) {
			return false;
		}
		if (month == 2) {
			if ((year % 400) != 0 && (year % 4) == 0) {
				if (day > 29) {
					return false;
				}
			} else {
				if (day > 28) {
					return false;
				}
			}
		} else {
			if (day > monthDays[month-1]) {
				return false;
			}
		}

		return true;
	}
}

result of symflower test

[
  {
    "FileRange": "com/eval/ValidDate.java:10:1-com/eval/ValidDate.java:10:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 8
  },
  {
    "FileRange": "com/eval/ValidDate.java:10:1-com/eval/ValidDate.java:10:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:11:1-com/eval/ValidDate.java:11:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:11:1-com/eval/ValidDate.java:11:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:13:1-com/eval/ValidDate.java:13:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 7
  },
  {
    "FileRange": "com/eval/ValidDate.java:13:1-com/eval/ValidDate.java:13:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 9
  },
  {
    "FileRange": "com/eval/ValidDate.java:14:1-com/eval/ValidDate.java:14:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:14:1-com/eval/ValidDate.java:14:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:16:1-com/eval/ValidDate.java:16:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:16:1-com/eval/ValidDate.java:16:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:17:1-com/eval/ValidDate.java:17:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:17:1-com/eval/ValidDate.java:17:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 7
  },
  {
    "FileRange": "com/eval/ValidDate.java:18:1-com/eval/ValidDate.java:18:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:18:1-com/eval/ValidDate.java:18:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:19:1-com/eval/ValidDate.java:19:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:19:1-com/eval/ValidDate.java:19:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:22:1-com/eval/ValidDate.java:22:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:22:1-com/eval/ValidDate.java:22:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 4
  },
  {
    "FileRange": "com/eval/ValidDate.java:23:1-com/eval/ValidDate.java:23:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:23:1-com/eval/ValidDate.java:23:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:27:1-com/eval/ValidDate.java:27:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:27:1-com/eval/ValidDate.java:27:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:28:1-com/eval/ValidDate.java:28:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:28:1-com/eval/ValidDate.java:28:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:32:1-com/eval/ValidDate.java:32:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:32:1-com/eval/ValidDate.java:32:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 4
  },
  {
    "FileRange": "com/eval/ValidDate.java:4:1-com/eval/ValidDate.java:4:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:4:1-com/eval/ValidDate.java:4:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 11
  },
  {
    "FileRange": "com/eval/ValidDate.java:5:1-com/eval/ValidDate.java:5:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:5:1-com/eval/ValidDate.java:5:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 11
  },
  {
    "FileRange": "com/eval/ValidDate.java:7:1-com/eval/ValidDate.java:7:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 10
  },
  {
    "FileRange": "com/eval/ValidDate.java:7:1-com/eval/ValidDate.java:7:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:8:1-com/eval/ValidDate.java:8:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:8:1-com/eval/ValidDate.java:8:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  }
]

coverage objects (all entries with count > 0): 24

even worse... the go coverage report of symflower test is also wrong... the actual result of gotestsum is:

mode: set
light/validateDate.go:3.51,6.17 2 1
light/validateDate.go:6.17,8.3 1 1
light/validateDate.go:9.2,9.29 1 1
light/validateDate.go:9.29,11.3 1 1
light/validateDate.go:12.2,12.13 1 1
light/validateDate.go:12.13,14.3 1 1
light/validateDate.go:15.2,15.16 1 1
light/validateDate.go:15.16,16.39 1 1
light/validateDate.go:16.39,17.16 1 1
light/validateDate.go:17.16,19.5 1 1
light/validateDate.go:20.9,21.16 1 1
light/validateDate.go:21.16,23.5 1 1
light/validateDate.go:25.8,26.31 1 1
light/validateDate.go:26.31,28.4 1 1
light/validateDate.go:31.2,31.13 1 1

which means that in lines 3-6 there are two statements covered:

  • monthDays := []int{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}
  • year < 1583

But in the json report of symflower test that same range only has one coverage count...

Well... internally we just increment the coverage if the count>0 which... that is also wrong then.

also the json output of symflower test does not contain all the statements from the actual go coverage report 🤔

in theory there are 16 statements in both implementations, so the correct result must be a coverage of 16 for both of them, not 7 and not 24

Java:

  • should not count line 4 cause that is the method signature (it is even marked type="method" in the clover.xml), change this in symflower test
  • since the coverage origin is reported as line, count the unique occurrence of each coverage with count>0 , change this in eval-dev-quality
  • will arrive at 16 ✔️
  • REMARK: clover counts the total number of times a statement was covered

Go:

  • figure out why we are missing ranges compared to the output of gotestsum, BUG!
  • TODO figure out how to arrive at 16 with go coverage (maybe we change how count is interpreted completely and make it the count of statements, not the count of executions, that's I think our only chance)
  • REMARK: plain go test counts only if a statement was covered at all or not, this behavior could be changed to the Java behavior using covermode=count, however since Go counts ranges, then it is not clear anymore which distinct statements got executed