Clone Doctor: Software Clone Detection and Reporting
A Tool that Aids the Tracking and Removal of Duplicate Code to Reduce Maintenance Cost
The cost of maintaining a software system is directly proportional to the size of its source code base. Large software systems typically contain 10-25% duplicated code. This redundancy is caused by the common programming practice of replicating (typically by copy and paste methods, e.g., "cloning") code fragments and then customizing to handle new demands on an application. An IT organization consequently spends corresponding amounts of its budget redundantly maintaining this code; a bug in one code fragment is also a bug in all of its hidden clones. If redundant code is removed, the savings can be significant, because it typically costs at least $1.00 per source line per year to maintain application software in running order. If the redundant code is tracked, changes made to one version can be reliably installed in others.
The CloneDR, built with the DMS/Software Reengineering Toolkit, identifies not only exact, but near-miss duplicates in software systems and can be used on a wide variety of languages. It will find clones even where different formatting, different variable names and even different code snippets are used. It detects clones in code and/or data declarations.
CloneDR Features
- Compares files exhaustively across whole systems
- Available for many different languages and dialects
- Parameterized to identify exact matches or near misses within specifiable range of difference
- Clones can be filtered by threshold size
- Identifies parameterization schemes for matching near-miss clones
- Uses compiler technology rather than string matching, finding clones irrespective of changed comments, white space, line breaks, and case changes. Proven to minimize false positives.
- Detects clones in procedural code and/or data declarations
- Supports analysis of thousands of files/millions of lines of code, utilizing parallel computing with multiple processors
- Runs under Windows on uni-processor but makes extensive use of symmetric multiprocessor (SMP) capability
- Eclipse integration for IBM Enterprise languages.
A System Undergoing Clone Detection and Removal
In this example, we have a large system (say, a million lines with hundreds or thousands of files) with a block of code implementing Bubble-sort, that has been cloned, modified, and reformatted several times, spread across that huge code base. A Bubble-sort is a performance bug waiting to be encountered: it performs poorly when the number of items to sort is anything other than small.
Eventually some user of the system encounters slowness in sorting and a bug report is filed. Joe Programmer is assigned to fix it, finds one of the instances, determines that the sorting algorithm is a poor choice, fixes it and declares victory. The improved system goes out to the field and gets reinstalled. Up to this point, a significant expense has been to find and repair the problem in the field. Fixing bugs in the field typically costs organizations close to a thousand dollars.
However, the user eventually discovers the performance problem again because the other clones were not fixed. Consequence: the users (perhaps many) time is wasted more than once, the bug is repaired more than once, wasting significant resources. In addition, the users may also conclude the development organization is doing a poor job, because bugs don't get fixed.
The problem here is that Joe, while he may know that the software has clones, has no way of knowing where they are, or whether the code on which is working is cloned. With systems being typically 20% cloned, Joe has a 20% chance of making this mistake for every fix he makes.
What is needed is a tool to either tell Joe about the clones, so that he can competently find and fix all of them, or to remove the clones, so that Joe can fix just the one instance.
In our example, the CloneDR will identify the 3 regions as being clones, and will produce a covering abstraction (detected code skeleton) that shows precisely what they have in common. Developers can also use the detected abstraction to remove the clones.
You can calculate your annual savings if you do careful clone management.
Remarkably, clone detection can also be used to improve test coverage.
Testimonial and Experience
Salion, Inc.
has used the CloneDR on their 250,000 line Java application.
Inspecting the ClonedDR reports and doing only partial clone removal,
by hand, 27,000 lines of code were eliminated. This gave
Salion's software an impossible growth curve: its size actually
went down. (Does your code base ever get smaller?)
A paper on Salion's experience using the CloneDR to help them find abstractions in code can be found here (PDF: 142Kb). The Software Engineering Institute discusses how Salion does product line development using CloneDR as key driver.
How CloneDR Works
Available Copy and paste detection tools work in variety of different ways. You can learn more about how CloneDR works and how it compares to other detectors.
Available for the Following Languages
- Ada (83 and 95)
- COBOL (IBM Enterprise, VSII, AS400, ANSI 1985). See a sample summary HTML page for a source code base of 77,000 lines (58%! redundant) and a sample page for a typical reported clone. Notice also the Eclipse graphical user interface for the CloneDR.
- C (GCC2-GCC4, VisualC6, ISO899c1990)
- C++ (GCC3/4/5 C++11 and C++14, Visual C++ 6.0, Visual Studio 2005, ISO14882c1998)
- C# (Versions 2.0, 3.0, 4.0, v5 and v6). See a sample on the open source package NHibernate code base of 321,000 lines (12% redundant).
- ECMAScript ("JavaScript"). See a sample on Google's Closure library (8.2% redundant even though it would be expected to be squeaky clean to minimize load times) .
- EGL (v6 and VAGen 4.5)
- Java (1.3, 1.4, 1.5, 1.6, Java 7 and 8). See a sample on a portion of Eclipse source code base of 1M+ lines (9.5% redundant) .
- PHP 4.0 and PHP 5.0. See a report for OSS Joomla framework of 178K SLOC (10.8% redundant).
- Python 2.6 and 3.0. See a sample report for a 100K SLOC of OSS Python code (11.5% redundant).
- Natural. See a sample on a source code base of 635,000 SLOC (27% redundant).
- Visual Basic (VB.net, VB6, VBScript)
- Fortran 90
- IEVER 2.6 Logix5000
- Object Pascal
- PLSQL
- Transact
- XML
Evaluation Versions of CloneDR
Evaluation versions of the CloneDR clone analysis tool for Java, C#, C++, COBOL, JavaScript and PHP are available for download. (There are evaluation versions for a variety of other languages as above; check at the download link, or inquire). The analysis tool will process large source systems to determine the percentage of clones, and will display a number of sample clones.
If you want an evaluation CloneDR for another language listed above, inquire at sales@semanticdesigns.com.
Other languages
Semantic Designs can build a CloneDR for almost any language. Contact us for your special case.