philippeitis / nlp_specifier

Formal verification for natural language software documentation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Formal verification allows users to verify properties of program execution - for instance, they can verify properties of a function's output, or that a program does not crash. This is useful for a variety of purposes - from formally verifying the behaviour of cyptography libraries to ensuring that applications used for sensitive medical application do not crash.

NLPSpecifier is an application that allows you to produce formal specifications from software documentation. These formal specifications can then be used by Prusti to verify the behaviour of your source code. In particular, NLPSpecifier can parse documentation embedded within Rust documentation pages (produced through cargo doc, or downloaded from rustup) and Rust source code.

Installation

Python 3.6+ and Rust should already be installed for the parsing steps. To perform verification of output specifcations, Prusti should also be installed. Dependencies for individual components of the system are specified below, or are otherwise manually installed using provided setup.sh scripts. Note that all .sh files and commands provided are specific to Linux.

Also note that to avoid contaminating your Python installation, it is best to use venv. The main functionality of this application is available through src/doc_parser/. To build this executable for your system, use

cd ./src/doc_parser/ && cargo build --release ; cd ..

Usage

Once installation is complete, this project can be used through doc_parser. To run the program, use ./doc_parser, or cargo run --release at the appropriate locations. Run the following command to see a list of all possible commands.

foo@bar:~$ ./doc_parser -h
doc_parser

USAGE:
    doc_parser <SUBCOMMAND>

FLAGS:
    -h, --help       Print help information
    -V, --version    Print version information

SUBCOMMANDS:
    end-to-end    Demonstrates entire pipeline from start to end on provided file, writing
                  output to terminal
    help          Print this message or the help of the given subcommand(s)
    render        Visualization of various components in the system's pipeline
    specify       Creates specifications for a variety of sources

To see more specific help, do the following:

foo@bar:~$ ./doc_parser end-to-end --help
doc_parser-end-to-end 

Demonstrates entire pipeline from start to end on provided file, writing output to terminal

USAGE:
    doc_parser end-to-end [OPTIONS]

FLAGS:
    -h, --help       Print help information
    -V, --version    Print version information

OPTIONS:
    -p, --path <PATH>    Source file to specify [default: ../../data/test3.rs]

NLP Processing

The NLP parsing code tokenizes, assigns parts of speech tags to tokens, and detects named entities using spaCy. To set up the dependencies, run src/setup.sh:

cd ./src/ && sudo chmod +x ./setup.sh && ./setup.sh && cd .

If you also need to use SRL and NER capabilities, run jml_nlp/setup.sh:

cd ./jmlnlp/ && sudo chmod +x ./setup.sh && ./setup.sh && cd .

Then, to launch the base services, which include a web interface and a REST API used in doc_parser for NLP:

sudo docker-compose --env-file nlp.env up -d --build

And using --profile jml_nlp to also launch the SRL and NER services:

sudo docker-compose --env-file --profile jml_nlp nlp.env up -d --build

Alternatively, it is possible to run the base services directly on your computer, without using docker.

Python Implementation

The python implementation is available at https://github.com/philippeitis/nlp_specifier/tree/62d4d51a30c173f65daaf631b7acca0ffbf572a3

WordNet POS tags reference

This link provides a useful reference for the POS tags generated by spaCy: http://erwinkomen.ruhosting.nl/eng/2014_Longdale-Labels.htm

Named-entity Recognition and Semantic Role Labelling

Requirements

To use NER and SRL analysis for documentation, Docker and Docker Compose must be installed. Additionally, downloading the relevant models requires installing Git and Git LFS. All other dependencies for this are set up using jml_nlp/setup.sh.

cd ./jml_nlp/ && sudo chmod +x ./setup.sh && ./setup.sh && cd .

After running this script, the SRL service will be available at 127.0.0.8:701, and the NER service will be available at 127.0.0.8:702. src/nlp/ner.py provides functions for annotating text using these services. The Tokenizer class in src/nlp/tokenizer.py transforms these annotations to a format that can be rendered by spaCy's displaCy tool.

The NER and SRL models are sourced from Combining formal and machine learning techniques for the generation of JML specifications.

Major TODOs:

  • Build server interface for NLP

  • Tree formatting

  • Convert codegen to Rust build.rs (to ensure that we don't introduce inconsistencies when parsing grammar)

  • Build synonym sets for this problem domain

  • Train model which correctly tags VB cases (currently tagged as "NN")

  • Build pseudo-compiler to iteratively resolve specification

  • Detect fn purity

  • Build tool to clean up resulting specifications

  • Detect vacuously true specifications

  • Quality of specification?

  • Desugar for loops?

Plan

  1. Straight up search using current methods, and plug in symbols
  • No slotting at all (but methods / struct fields should be attached as .x, while lone functions are wrapping)

Examples of unaccepted (maybe should be accepted?)

  • Replaces first N matches of a pattern with another string
  • ::ffff:a.b.c.d becomes a.b.c.d
  • The elements are passed in opposite order from their order in the slice , so if same_bucket(a, b) returns true , a is moved at the end of the slice
  • Returns the capacity this OsString can hold without reallocating
  • Note that the capacity of self does not change
  • Returns true if the associated Once was poisoned prior to the invocation of the closure passed to Once::call_once_force()

Examples of Unverifiable spec

  • replace_with does not need to be the same length as range
  • This method is equivalent to CString::new except that no runtime assertion is made that v contains no 0 bytes , and it requires an actual byte vector , not anything that can be converted to one with Into
  • See send for notes about guarantees of whether the receiver has received the data or not if this function is successful

Options

  • Find matches which do not include the sentence (this approach does not handle interior skipwords)
  • Preprocess sentence before tree - detect and remove common fillers (already done in part with SPACE)
  • Process fragments, have grammar for interior fragments (eg. "otherwise x is true")
  • Allow chaining (requires Context)
  • Add error messages when parsing (eg. "You typed '`self.capacity is unchanged', missing 'must' or 'will' to specify that this is a pre/post condition")

Allow using "but"

About

Formal verification for natural language software documentation


Languages

Language:Rust 82.0%Language:Python 17.4%Language:Shell 0.4%Language:Dockerfile 0.1%