`srcdupchk`: Program to find duplication between two source trees

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

I originally wrote this program in 2003, when the tech press was full of allegations that the Linux kernel contained proprietary Unix code. At the time IBM was being sued by one of the historical incarnations of SCO.

srcdupchk will take two trees of source code, and look for lines of code that are identical between the two trees. It has some heuristics tuned for C and C++ code, in order to reduce false positives. But it should work fairly well on any line-oriented input.

This code was originally written in 24 hours in 2003. I've updated it to build on modern Linux systems and tweaked the output a little. All the test still pass, but proceed with appropriate caution: This code is old enough to drive.

How to Install It

This code works on Linux, and it requires a C++ development environment. On Ubuntu, you can set things up using:

sudo apt-get install build-essential

You should be able to install this program as follows:

git clone https://github.com/emk/srcdupchk.git
cd srcdupchk
./configure
make
sudo make install

This may or may not work on other Unix variants, including MacOS X or the Windows Subsystem For Linux.

What It Does

srcdupchk [--show-shreds] [--shred-size N] TREE1 TREE2

srcdupchk analyzes two source trees—call them TREE1 and TREE2—for matches, and outputs any source ranges in TREE2 which can also be found in TREE1.

srcdupchk uses the "shred" algorithm proposed by an anonymous Inquirer reader, but does a little preprocessing to account for typical differences in coding style, including indentation, whitespace, brace placement and choice of comment characters. It also ignores any CVS directories and any lines starting with '#include' (which tend to contain large numbers of spurious duplicates).

See Egan Orion's article on the Wayback Machine.

My main goal in writing srcdupchk is to make it easy for proprietary software vendors to notify maintainers of free software about any contributors who are misusing copyrighted code. In my experience, free software developers are well-informed about copyrights, and deeply wish to respect copyright law. But if you don't notify them of any problems, they can't help you, and they can't bar dishonest contributors.

Besides tracking down potential copyright violations in free software so you can report them, there are a few other interesting and legitimate uses of srcdupchk:

You can find out if any Linux or GNU code has been copied into your company's proprietary software, and remove it. If you're improperly using GPL'd code, you can contact the GPL Compliance Lab at the Free Software Foundation http://www.fsf.org/. I've worked with them before; they're exceptionally helpful and committed to ending problems discretely.
You can compare various free projects with each other, and look to see how much code they share. It would be fun to look at 4.4 BSD Lite, FreeBSD, NetBSD, OpenBSD, Linux and the HURD, and see how much overlap there is between the various kernels.

Legal Issues

If you run srcdupchk on proprietary code, it will reveal certain information about that code. Do not publish the results unless you have permission to do so.

An example

I deleted a few autogenerated files from srcdupchk, and compared it to GNU textutils (the origin of the md5.h and md5.c files in this distribution). I found the following matches:

// Output from 'srcdupchk-bin textutils-2.1 srcdupchk'.
srcdupchk/COPYING:1:340               // Verbatim copy of GPL.
srcdupchk/DcHeaders.h:6:18            // GPL license notice.
srcdupchk/DirWalker.cpp:6:18          // GPL license notice.
srcdupchk/DirWalker.h:6:18            // GPL license notice.
srcdupchk/DupChecker.cpp:6:18         // GPL license notice.
srcdupchk/FileInfo.cpp:6:18           // GPL license notice.
srcdupchk/FileInfo.h:6:18             // GPL license notice.
srcdupchk/HashCode.cpp:6:18           // GPL license notice.
srcdupchk/HashCode.h:6:18             // GPL license notice.
srcdupchk/INSTALL:1:230               // Standard GNU file.
srcdupchk/SourceTreeChecker.cpp:6:18  // GPL license notice.
srcdupchk/SourceTreeChecker.h:6:18    // GPL license notice.
srcdupchk/SourceTreeInfo.cpp:6:18     // GPL license notice.
srcdupchk/SourceTreeInfo.h:6:18       // GPL license notice.
srcdupchk/UnitTest.cpp:6:18           // GPL license notice.
srcdupchk/UnitTest.h:6:18             // GPL license notice.
srcdupchk/config.h.in:3:19            // Output of autoheader program.
srcdupchk/md5.c:1:416                 // Taken from textutils.
srcdupchk/md5.h:1:161                 // Taken from textutils.
srcdupchk/test1/subdir/c.txt:4:7      // Coincidence: Similar test data!

Algorithm

Here's how we compute file shreds for a directory tree:

Walk the directory tree recursively, skipping ".", ".." and "CVS".
For each file in the directory tree, check whether it is binary by looking at the first 512 characters. If any of these characters are NULL, or more than 10% have the high bit set, assume the file is binary.
For each non-binary file:
- Read in all the lines in the file.
- Normalize lines by removing the characters " \t\r/*{}". This attempts to correct for differences in whitespace, commenting style and brace placement.
- Discard empty lines and all lines beginning with "#include". The latter don't contain useful information, and tend to generate false positives for small shred sizes.
- Group the remaining lines into overlapping N-line chunks and calculate a 128-bit cryptographic hash for each chunk.

About cryptographic hashes: On average, you'd have to look at 18,446,744,073,709,551,616 such chunks before two cryptographic hashes matched accidentally. This number is about 9 billion times larger than the number of chunks in GM's infamous 2 billion line codebase—probably the largest in the world. Modern operating systems typically contain between 1 million and 100 million chunks.

Performance

srcdupchk is a CPU and RAM hog, because I wrote in C++ in under 24 hours, and haven't bothered to optimize it. It can process the Linux kernel in about 3 minutes and 140MB of RAM on an 800MHz laptop.

The biggest CPU hotspot, according to gprof, is the FileInfo::Shredder function, which takes up 80% of the time (excluding the call to md5_buffer, which can't be avoided or optimized much further). Within that function, the biggest hotspots appear to be std::getline and Normalize, which each take around 30% of the CPU time.

The biggest use of memory is SourceTreeInfo::ShredMap, which contains one entry for each of several million shreds. It uses std::map, which is remarkably fast in my tests, but definitely uses more memory than a well-designed custom data structure.

Contributing

Please feel free to submit pull requests; I'll try to get back to you within a week or two. If you want to add new features, pleaseE use the existing coding style, check carefully for bugs, and write test cases before writing code.

srcdupchk's internals are fairly reusable; feel free to adapt the code to solve other problems.

emk / srcdupchk

`srcdupchk`: Program to find duplication between two source trees

How to Install It

What It Does

Legal Issues

An example

Algorithm

Performance

Contributing

About

Languages

srcdupchk: Program to find duplication between two source trees

How to Install It

What It Does

Legal Issues

An example

Algorithm

Performance

Contributing

About

Languages

`srcdupchk`: Program to find duplication between two source trees