Lingua Franca

Welcome to the ESEC/FSE'19 artifact for the ESEC/FSE paper "Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions", by J.C. Davis, L.G. Michael IV, C.A Coghlan, F. Servant, and D. Lee, all of Virginia Tech.

This paper describes our study into regex portability practices and problems. In this empirical work, we:

surveyed 158 professional software developers about their regex beliefs and re-use practices
extracted regular expression-like entities from Stack Overflow and RegExLib to understand re-use practices
extracted regular expressions from about 200,000 software projects written in 8 programming languages
analyzed these production regular expressions for portability problems: syntactic, semantic, and performance

Artifact

Our artifact includes the following:

Item	Description	Corresponding content in the paper	Scientific interest	Relation to prior work
Internet Sources collectors	Tools to extract regexes from Internet Sources	Section 6.2.1
Internet Sources corpus	Entities that look like regexes across Stack Overflow and RegExLib	Section 6.2	First snapshot of regexes in Internet forums	No prior work has examined the regexes from these Internet sources. Our analysis was in the spirit of work on more general code re-use from Stack Overflow to GitHub.
Regex extractors	Tools to statically extract regexes from software written in 8 programming languages	Section 5		Adds 6 programming languages to the tools from our FSE'18 paper
Regex corpus	A polyglot regex corpus of 537,806 unique regexes extracted from 193,524 projects written in 8 programming languages	Collection: Section 5, esp. Table 1. Experiments: Section 7	This is the largest and most diverse regex corpus ever collected. It should be useful for future regex analysis purposes, e.g. in testing a visualization or input generation tool.	Our FSE'18 paper included a corpus of about 400,000 regexes extracted from about 670,000 npm and pypi modules (See Table 1 in that paper, and that artifact). This new corpus covers 6 more programming languages.
Regex analyses: Semantic	Drivers for 5 input generators	Section 7.1	Collects, improves, and unifies existing input generators
Regex analyses: Performance	Drivers for 3 super-linear regex detectors	Section 7.2	Extends existing super-linear regex detectors to partial-match semantics	Builds on the tooling from our FSE'18 paper

In addition to this directory's README.md, each sub-tree comes with one or more READMEs describing its contents.

Installation

By hand

To install, execute the script ./configure.sh on an Ubuntu 16.04 machine with root privileges. This will obtain and install the various dependencies (e.g. OS packages, REDOS detectors) and compile all analysis tools.

The final line of this script is echo "Configuration complete. I hope everything works!". If you see this printed to the console, great! Otherwise...alas.

Container

!!!!!!!!!!!!!! !Not yet done! !!!!!!!!!!!!!!

(However, see containerized/Dockerfile -- this may work?)

To facilitate replication, we have published a containerized version of this project on hub.docker.com. The container is based on an Ubuntu 16.04 image so it is fairly large.

For example, you might run:

docker pull jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
docker run -ti jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
> vim .env
# Set ECOSYSTEM_REGEXP_PROJECT_ROOT=/davis-fse19-artifact/LinguaFranca-FSE19
> . .env
> # Proceed to use our tools, see some examples below

Use

Environment variables

Export the following environment variables to ensure the tools know how to find each other.

ECOSYSTEM_REGEXP_PROJECT_ROOT
VULN_REGEX_DETECTOR_ROOT (dependency, set it to ECOSYSTEM_REGEXP_PROJECT_ROOT/analysis/performance/vuln-regex-detector)

See .env for examples.

Analysis phases

Our analyses work on a set of regexes. You can use the tail of the full corpus to see how things go.

tail -10 $ECOSYSTEM_REGEXP_PROJECT_ROOT/data/production-regexes/uniq-regexes-8.json > 10-regexes.json

Syntax

$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-syntax-portability.py --regex-file 10-regexes.json --out-file 10-syntax.json 2>10-syntax.log

This should run quickly. If you examine the tail of 10-syntax.log, you'll see output like this:

11/06/2019 02:05:13 woody/22034: Generating a quick summary of regex syntax support
11/06/2019 02:05:13 woody/22034: Number of supporting languages    Number of regexes
11/06/2019 02:05:13 woody/22034:                              7                    1
11/06/2019 02:05:13 woody/22034:                              8                    9
11/06/2019 02:05:13 woody/22034:


11/06/2019 02:05:13 woody/22034:        Language Number of supported regexes
11/06/2019 02:05:13 woody/22034:      javascript                   10
11/06/2019 02:05:13 woody/22034:            rust                    9
11/06/2019 02:05:13 woody/22034:             php                   10
11/06/2019 02:05:13 woody/22034:          python                   10
11/06/2019 02:05:13 woody/22034:            ruby                   10
11/06/2019 02:05:13 woody/22034:            perl                   10
11/06/2019 02:05:13 woody/22034:            java                   10
11/06/2019 02:05:13 woody/22034:              go                   10
11/06/2019 02:05:13 woody/22034:

Apparently one regex was unsupported in Rust, while the other 9 regexes were supported in all 8 languages.

If you examine 10-syntax.json, you'll see enhanced libLF.Regex objects -- they now have the supportedLangs member populated. The row with the pattern containing the string "avatar" lists 7 of the languages but not Rust, because Rust does not support the escaped forward slash notation as a valid construct.

Semantics

The semantics test should be run on the result of the syntax test, since it needs to know the supportedLangs of the libLF.Regex objects.

$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file 10-syntax.json --out-file 10-semantic.json 2>10-semantic.log

This may take a few minutes. Once it's done, you can look at the tail of 10-semantic.log. These 10 regexes are fairly dull from a semantic perspective:

  0 (0.00%) of the 10 completed regexes had at least one witness for different behavior

If you examine 10-semantic.json, you'll see that the libLF.Regex objects have been enhanced in a different way:

They have the nUniqueInputsTested member set to the number of inputs that were attempted for each regex
They have the semanticDifferenceWitnesses member set, though since none were found all of those lists are empty

For demonstration purposes, we have prepared a regex file that has semantic difference witnesses. (cf. the final row of Table 4).

$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file demo/semantic-difference-witness-regex.json --out-file demo-semantic.json 2>demo-semantic.log

The log file now ends more enticingly:

  1 (100.00%) of the 1 completed regexes had at least one witness for different behavior

If you examine demo-semantic.json, you'll see the inputs that triggered semantic differences, with a breakdown of the distinct behaviors observed and the languages that evinced each behavior.

Performance

Run a performance analysis on the 10-regexes.json file like this:

$ECOSYSTEM_REGEXP_PROJECT_ROOT/test-for-SL-behavior.py --regex-file 10-regexes.json --out-file 10-performance.json --sl-timeout 10 --power-pumps 100000 2>10-performance.log

This takes a few minutes parallelized across my 8-core desktop. If you're desperate, you can just run it on a 1-regex file instead of the 10-regex file we've been using.

Once complete, take a look at the end of 10-performance.log. It says:

11/06/2019 01:56:43 woody/20094: Successfully performed SLRegexAnalysis on 10 regexes, 0 exceptions
11/06/2019 01:56:43 woody/20094: 1 of 10 successful analyses timed out in some language
11/06/2019 01:56:43 woody/20094: 1 of the regexes had different performance in different languages
11/06/2019 01:56:43 woody/20094: 0 of the regexes had different performance in the languages they actually appeared in

If you examine 10-performance.json, you should see enhanced libLF.Regex objects. According to the log, one of these exhibited super-linear behavior If you search the 10-performance.json file for the string '100000": true', you will see that a regex pattern beginning proxy.*fooo timed out on an input of 100000 pumps in the following programming languages:

javascript
php
python
ruby
java

Directory structure

File or Directory/	Description
README.md	You're in it
PAPER.pdf	Non-anonymized manuscript we submitted for review (not camera-ready)
LICENSE	Terms of software release
STATUS	Claims of artifact quality
INSTALL	"Install instructions"
-----------------------	-----------------------------------------------------------------
containerized/	Dockerfile for building container
configure.sh	One-stop-shop for configuration
-----------------------	-----------------------------------------------------------------
data/	Corpuses (internet and production) and tools to reproduce them
analysis/	Experimental analyses (syntax, semantic, performance)
full-analysis/	Run each analysis step on a regex
lib/	Python libraries -- utility routines, serializers and parsers for types expressed in JSON
bin/	Symlinks to the tools scattered throughout the tree, easing access from analysis scripts

Each directory contains its own README with additional details.

Style and file formats

Style

Most of the scripts in this repository are written in Python. They tend to write status updates to STDERR, and write their output to an NDJSON-formatted --out-file of serialized libLF objects.

If you have dependencies on other scripts in the repo, require the invoker to define ECOSYSTEM_REGEXP_PROJECT_ROOT. This environment variable should name the location of your clone of this repository.

File formats

This project uses JSON to describe research data. Files named *.json are generally NDJSON-formatted files that contain one JSON object per line.

Why giant flat files?

Makes it easy to do a line-by-line streaming analysis on the objects in the file, even if the file is large.
Makes it easy to divide work amongst the nodes in a compute cluster.
Makes it easy to share data with other researchers.

Contact

Contact J.C. Davis at davisjam@vt.edu with any questions.

davisjam / LinguaFranca-FSE19