slackhq / tree-sitter-hack

Hack grammar for tree-sitter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tree-sitter-hack

build

At Slack proactively securing our systems is a top priority. One way we achieve this is by automating the detection of vulnerabilities with static code analysis scanning. Although an abundance of tools exist for scanning the majority of programming languages, our codebase is overwhelmingly written in Hack - a language not widely used outside of Slack. Rather than building our own tool from scratch, we are extending the functionality of an open source static analysis tool, Semgrep, to be compatible with Hack. But how do we teach Semgrep the Hack programming language?

Like all human languages, programming languages have a structure to them known as grammar. Grammar rules are used to create a parser which converts source code into a concrete syntax tree (CST) which is a structural representation of the code. Tree-Sitter is a fast and robust library that can generate a CST from our Hack grammar rules. This CST has many use cases such as robust syntax highlighting, code folding, linting, etc. Most importantly, Semgrep uses this CST to understand Hack on a semantic level. This semantic understanding in conjunction with Semgrep rules can detect vulnerabilities in source code. This process is demonstrated by the following diagram.

tree-sitter-hack use in Semgrep

In summary, we use tree-sitter-hack to teach Semgrep the Hack programming language.

Installation

$ git clone https://github.com/slackhq/tree-sitter-hack
$ cd tree-sitter-hack
$ npm install

Usage

$ echo 'function main(): void { print "wyd, world\\n"; }' > script.hack
$ npx tree-sitter generate
$ npx tree-sitter parse script.hack
(script [0, 0] - [3, 0]
  (function_declaration [0, 0] - [2, 1]
    name: (identifier [0, 9] - [0, 13])
    (parameters [0, 13] - [0, 15])
    return_type: (primitive_type [0, 17] - [0, 21])
    body: (compound_statement [0, 22] - [2, 1]
      (expression_statement [1, 2] - [1, 23]
        (print_expression [1, 2] - [1, 22]
          (string [1, 8] - [1, 22]))))))

Testing

$ npx tree-sitter generate
$ bin/test-corpus

Scripts

bin/generate-parser

Wrapper around tree-sitter generate that skips parser generation if grammar.js hasn't changed since last run.

bin/generate-corpus

Unlike most other Tree-sitter projects, we breakout test cases into separate files (see test/cases). This is done so editors have an easier time syntax highlighting test cases. But also I find individual files easier to navigate than the corpus.txt files used by Tree-sitter.

We use bin/generate-corpus to generate the test/corpus/case1.txt from individual test/cases files so we can still use tree-sitter test.

bin/test-corpus

Run bin/generate-corpus and bin/generate-parser before running tree-sitter test.

bin/test-dir

Run bin/ts-errors on all files with .hack or .php extension in the given directory recursively.

$ ./bin/test-dir hhvm/hphp/hack/test
examples/hhvm/hphp/hack/test/error_formatting_highlighted/zero_width_syntax_err.php
(3,11)-(3,18) extends
examples/hhvm/hphp/hack/test/autocomplete/not_shape_key_string.php
(3,1)-(6,1) function foo(): string {\n  return "AUTO332\n}\n
(4,10)-(6,1) "AUTO332\n}\n
...

bin/test-dir-quiet

A quieter version of bin/test-dir that only outputs failing files.

Contributing

If you're interested in contributing, please see the guide.

Note

npm doesn't allow packages with the word "hack" in their registry which is why the repo name does not match the package name.

Unfortunately, the word "hack" triggers our spam detection and can't be used in package names. We recommend choosing other keywords that highlight your package's functionality.

References

There's no published official Hacklang language spec so we have to make do.

About

Hack grammar for tree-sitter

License:MIT License


Languages

Language:JavaScript 36.1%Language:Hack 30.9%Language:Shell 14.3%Language:C 13.0%Language:Rust 3.0%Language:C++ 1.0%Language:Dockerfile 0.7%Language:Python 0.6%Language:Scheme 0.4%