Lingua Franca
Welcome to the ESEC/FSE'19 artifact for the ESEC/FSE paper "Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions", by J.C. Davis, L.G. Michael IV, C.A Coghlan, F. Servant, and D. Lee, all of Virginia Tech.
This paper describes our study into regex portability practices and problems. In this empirical work, we:
- surveyed 158 professional software developers about their regex beliefs and re-use practices
- extracted regular expression-like entities from Stack Overflow and RegExLib to understand re-use practices
- extracted regular expressions from about 200,000 software projects written in 8 programming languages
- analyzed these production regular expressions for portability problems: syntactic, semantic, and performance
Artifact
Our artifact includes the following:
Item | Description | Corresponding content in the paper | Scientific interest | Relation to prior work |
---|---|---|---|---|
Internet Sources collectors | Tools to extract regexes from Internet Sources | Section 6.2.1 | ||
Internet Sources corpus | Entities that look like regexes across Stack Overflow and RegExLib | Section 6.2 | First snapshot of regexes in Internet forums | No prior work has examined the regexes from these Internet sources. Our analysis was in the spirit of work on more general code re-use from Stack Overflow to GitHub. |
Regex extractors | Tools to statically extract regexes from software written in 8 programming languages | Section 5 | Adds 6 programming languages to the tools from our FSE'18 paper | |
Regex corpus | A polyglot regex corpus of 537,806 unique regexes extracted from 193,524 projects written in 8 programming languages | Collection: Section 5, esp. Table 1. Experiments: Section 7 | This is the largest and most diverse regex corpus ever collected. It should be useful for future regex analysis purposes, e.g. in testing a visualization or input generation tool. | Our FSE'18 paper included a corpus of about 400,000 regexes extracted from about 670,000 npm and pypi modules (See Table 1 in that paper, and that artifact). This new corpus covers 6 more programming languages. |
Regex analyses: Semantic | Drivers for 5 input generators | Section 7.1 | Collects, improves, and unifies existing input generators | |
Regex analyses: Performance | Drivers for 3 super-linear regex detectors | Section 7.2 | Extends existing super-linear regex detectors to partial-match semantics | Builds on the tooling from our FSE'18 paper |
In addition to this directory's README.md
, each sub-tree comes with one or more READMEs describing its contents.
Installation
By hand
To install, execute the script ./configure.sh
on an Ubuntu 16.04 machine with root privileges.
This will obtain and install the various dependencies (e.g. OS packages, REDOS detectors) and compile all analysis tools.
The final line of this script is echo "Configuration complete. I hope everything works!"
.
If you see this printed to the console, great!
Otherwise...alas.
Container
!!!!!!!!!!!!!! !Not yet done! !!!!!!!!!!!!!!
(However, see containerized/Dockerfile -- this may work?)
To facilitate replication, we have published a containerized version of this project on hub.docker.com. The container is based on an Ubuntu 16.04 image so it is fairly large.
For example, you might run:
docker pull jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
docker run -ti jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
> vim .env
# Set ECOSYSTEM_REGEXP_PROJECT_ROOT=/davis-fse19-artifact/LinguaFranca-FSE19
> . .env
> # Proceed to use our tools, see some examples below
Use
Environment variables
Export the following environment variables to ensure the tools know how to find each other.
ECOSYSTEM_REGEXP_PROJECT_ROOT
VULN_REGEX_DETECTOR_ROOT
(dependency, set it toECOSYSTEM_REGEXP_PROJECT_ROOT/analysis/performance/vuln-regex-detector
)
See .env
for examples.
Analysis phases
Our analyses work on a set of regexes. You can use the tail of the full corpus to see how things go.
tail -10 $ECOSYSTEM_REGEXP_PROJECT_ROOT/data/production-regexes/uniq-regexes-8.json > 10-regexes.json
Syntax
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-syntax-portability.py --regex-file 10-regexes.json --out-file 10-syntax.json 2>10-syntax.log
This should run quickly.
If you examine the tail of 10-syntax.log
, you'll see output like this:
11/06/2019 02:05:13 woody/22034: Generating a quick summary of regex syntax support
11/06/2019 02:05:13 woody/22034: Number of supporting languages Number of regexes
11/06/2019 02:05:13 woody/22034: 7 1
11/06/2019 02:05:13 woody/22034: 8 9
11/06/2019 02:05:13 woody/22034:
11/06/2019 02:05:13 woody/22034: Language Number of supported regexes
11/06/2019 02:05:13 woody/22034: javascript 10
11/06/2019 02:05:13 woody/22034: rust 9
11/06/2019 02:05:13 woody/22034: php 10
11/06/2019 02:05:13 woody/22034: python 10
11/06/2019 02:05:13 woody/22034: ruby 10
11/06/2019 02:05:13 woody/22034: perl 10
11/06/2019 02:05:13 woody/22034: java 10
11/06/2019 02:05:13 woody/22034: go 10
11/06/2019 02:05:13 woody/22034:
Apparently one regex was unsupported in Rust, while the other 9 regexes were supported in all 8 languages.
If you examine 10-syntax.json
, you'll see enhanced libLF.Regex objects -- they now have the supportedLangs
member populated.
The row with the pattern containing the string "avatar" lists 7 of the languages but not Rust, because Rust does not support the escaped forward slash notation as a valid construct.
Semantics
The semantics test should be run on the result of the syntax test, since it needs to know the supportedLangs
of the libLF.Regex objects.
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file 10-syntax.json --out-file 10-semantic.json 2>10-semantic.log
This may take a few minutes.
Once it's done, you can look at the tail of 10-semantic.log
.
These 10 regexes are fairly dull from a semantic perspective:
0 (0.00%) of the 10 completed regexes had at least one witness for different behavior
If you examine 10-semantic.json
, you'll see that the libLF.Regex objects have been enhanced in a different way:
- They have the
nUniqueInputsTested
member set to the number of inputs that were attempted for each regex - They have the
semanticDifferenceWitnesses
member set, though since none were found all of those lists are empty
For demonstration purposes, we have prepared a regex file that has semantic difference witnesses. (cf. the final row of Table 4).
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file demo/semantic-difference-witness-regex.json --out-file demo-semantic.json 2>demo-semantic.log
The log file now ends more enticingly:
1 (100.00%) of the 1 completed regexes had at least one witness for different behavior
If you examine demo-semantic.json
, you'll see the inputs that triggered semantic differences, with a breakdown of the distinct behaviors observed and the languages that evinced each behavior.
Performance
Run a performance analysis on the 10-regexes.json
file like this:
$ECOSYSTEM_REGEXP_PROJECT_ROOT/test-for-SL-behavior.py --regex-file 10-regexes.json --out-file 10-performance.json --sl-timeout 10 --power-pumps 100000 2>10-performance.log
This takes a few minutes parallelized across my 8-core desktop. If you're desperate, you can just run it on a 1-regex file instead of the 10-regex file we've been using.
Once complete, take a look at the end of 10-performance.log
. It says:
11/06/2019 01:56:43 woody/20094: Successfully performed SLRegexAnalysis on 10 regexes, 0 exceptions
11/06/2019 01:56:43 woody/20094: 1 of 10 successful analyses timed out in some language
11/06/2019 01:56:43 woody/20094: 1 of the regexes had different performance in different languages
11/06/2019 01:56:43 woody/20094: 0 of the regexes had different performance in the languages they actually appeared in
If you examine 10-performance.json
, you should see enhanced libLF.Regex objects.
According to the log, one of these exhibited super-linear behavior
If you search the 10-performance.json
file for the string '100000": true', you will see that a regex pattern beginning proxy.*fooo
timed out on an input of 100000 pumps in the following programming languages:
- javascript
- php
- python
- ruby
- java
Directory structure
File or Directory/ | Description |
---|---|
README.md | You're in it |
PAPER.pdf | Non-anonymized manuscript we submitted for review (not camera-ready) |
LICENSE | Terms of software release |
STATUS | Claims of artifact quality |
INSTALL | "Install instructions" |
----------------------- | ----------------------------------------------------------------- |
containerized/ | Dockerfile for building container |
configure.sh | One-stop-shop for configuration |
----------------------- | ----------------------------------------------------------------- |
data/ | Corpuses (internet and production) and tools to reproduce them |
analysis/ | Experimental analyses (syntax, semantic, performance) |
full-analysis/ | Run each analysis step on a regex |
lib/ | Python libraries -- utility routines, serializers and parsers for types expressed in JSON |
bin/ | Symlinks to the tools scattered throughout the tree, easing access from analysis scripts |
Each directory contains its own README with additional details.
Style and file formats
Style
Most of the scripts in this repository are written in Python. They tend to write status updates to STDERR, and write their output to an NDJSON-formatted --out-file of serialized libLF objects.
If you have dependencies on other scripts in the repo, require the invoker to define ECOSYSTEM_REGEXP_PROJECT_ROOT
.
This environment variable should name the location of your clone of this repository.
File formats
This project uses JSON to describe research data. Files named *.json
are generally NDJSON-formatted files that contain one JSON object per line.
Why giant flat files?
- Makes it easy to do a line-by-line streaming analysis on the objects in the file, even if the file is large.
- Makes it easy to divide work amongst the nodes in a compute cluster.
- Makes it easy to share data with other researchers.
Contact
Contact J.C. Davis at davisjam@vt.edu with any questions.