weggli-rs / weggli

weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: JSON/YML multi-pattern input and result output

LaurensBrinker opened this issue · comments

I'm quite new to Weggli, but as far as I can tell, it currently does not support providing input and output files. And each rule pattern check requires a separate execution of Weggli.

For context - I've been playing around with Semgrep, which allows you to specify a patterns yml file with multiple patterns to check, and can output the findings to a .json files for easy parsing. Keen to hear thoughts, but it would be nice if Weggli could support something like this:

Provide patterns.yml file containing multiple patterns like this:

  - id: double-free
    metadata:
      references:
        - https://cwe.mitre.org/data/definitions/415
        - https://github.com/struct/mms
        - https://www.sei.cmu.edu/downloads/sei-cert-c-coding-standard-2016-v01.pdf
        - https://docs.microsoft.com/en-us/cpp/sanitizers/asan-error-examples
        - https://dustri.org/b/playing-with-weggli.html
      confidence: MEDIUM
    message: >-
      The software calls free() twice on the same memory address,
      potentially leading to modification of unexpected memory locations.
    severity: ERROR
    languages:
      - c
      - cpp
    pattern: "{free($a); NOT: goto _; NOT: break; NOT: continue; NOT: $a = _; free($a);}" 
    extra_args:
      - "--unique"
  
  - id: uninit-pointers
   .....

Run something like weggli --input /path/to/patterns.yml --output /path/to/results.json /path/to/codebase
Where Weggli will run all patterns on a specified codebase (if possible), and e.g. generate a json output file which looks something like this:

{
  "errors": []
  "results: [{
      "id": "double-free",
      "start": { "col": 10, "line": 42, "offset": 701 },
      "end": { "col": 25, "line": 42, "offset": 716 },
      "extra": {
        "fingerprint": "79965871385669e43",
        "is_ignored": false,
        "lines": "  ... 
                int alloc_and_free2()
                {
                    char *ptr = (char *)malloc(MEMSIZE);
                    free(ptr);
                    ptr = NULL;
                    free(ptr);
                }
                ....",
        "message": "The software calls free() twice on the same memory address, potentially leading to modification of unexpected memory locations.",
        "metadata": {
          "confidence": "HIGH",
          "references": [
              - https://cwe.mitre.org/data/definitions/415
              - https://github.com/struct/mms
              - https://www.sei.cmu.edu/downloads/sei-cert-c-coding-standard-2016-v01.pdf
              - https://docs.microsoft.com/en-us/cpp/sanitizers/asan-error-examples
              - https://dustri.org/b/playing-with-weggli.html
          ]
        },
        "metavars": {},
        "severity": "ERROR"
      },
      "path": "test-data/sample_inputs/c-and-cpp/double-free.c"
   }]
}

Again - I know that Weggli doesn't support this kind of behavior atm and that it runs for each individual pattern (afaik, specifying additional patterns with -p is an "AND", rather than an "OR"). But just wanted to see if this is something that has been considered already?

I add this feature in my fork, maybe you want to try it.
weggli-enhance