Code Cop: static analysis

Showing posts with label static analysis. Show all posts

7 January 2020

Visualising Architecture: GraphML Charting Module Dependencies

When working with existing code, analysing module dependencies helps me understand the bigger picture. For well defined modules like Eclipse OSGi or Maven there are tools to show all modules and their inter dependencies, e.g. the Eclipse Dependency Visualization tool or the Maven Module Dependency Graph Creator. For legacy technologies like COBOL and NATURAL - and probably some more platforms - we struggle due to the lack of such tools. Then the best approach is to extract the necessary information and put it into a modern, well known format. I wrote about such an approach in the past, e.g. loading references from a CSV into Neo4j for easy navigation and visualization. Today I want to sketch another idea I have used since many years: Using GraphML to chart module dependencies. For example, the following graph shows the (filtered) modules of a NATURAL application together with their dependencies in an organic layout.

Some modules of a NATURAL application in an organic layout, 2017

Some modules of a NATURAL application in an organic layout, 2017

Process
GraphML is an XML-based file format for graphs. [...] It uses an XML-based syntax and supports the entire range of possible graph structure constellations including directed, undirected, mixed graphs, hyper graphs, and application-specific attributes. That should be more than enough to visualize any module dependencies. In fact I just need nodes and edges. My process to use GraphML is always the same:

Start with defining or extracting the modules and their dependencies.
Create a little script which loads the extracted data and creates a raw GraphML XML file.
Use an existing graph editor to tweak and layout the diagram.
Export the the diagram as PDF or as image.

Get the raw data
Extracting the module information is highly programming language specific. For Java I used my JavaClass class file parser. For NATURAL I used a CSV of all references which was provided by my client. Usually regular expressions work well to parse import, using or include statements. For example to analyse plain C, scan for #include statements (as done in Schmidrules for Application Architecture. Unfortunately there is not much documentation available about Schmidrules, which I will fix in the future.)

Create the XML
The next task is to create the XML. I use a little Ruby script to serializes a graph of nodes into GraphML. The used Node class needs a name and a list of dependencies, which are Nodes again. save stores the graph as GraphML.

require 'rexml/document'

class GraphmlSerializer
  include REXML

  # Save the _graph_ to _filename_ .
  def save(filename, graph)
    File.open(filename + '.graphml', 'w') do |f|
      doc = graph_to_xml(graph)
      doc.write(out_string = '<?xml version="1.0" encoding="UTF-8" standalone="no"?>')
      f.print out_string
    end
  end

  # Return an XML document of the GraphML serialized _graph_ .
  def graph_to_xml(graph)
    doc = create_xml_doc
    container = add_graph_element(doc)
    graph.to_a.each { |node| add_node_as_xml(container, node) }
    doc
  end

  private

  def create_xml_doc
    REXML::Document.new
  end

  def add_graph_element(doc)
    root = doc.add_element('graphml',
      'xmlns' => 'http://graphml.graphdrawing.org/xmlns',
      'xmlns:y' => 'http://www.yworks.com/xml/graphml',
      'xmlns:xsi' => 'http://www.w3.org/2001/XMLSchema-instance',
      'xsi:schemaLocation' => 'http://graphml.graphdrawing.org/xmlns http://www.yworks.com/xml/schema/graphml/1.1/ygraphml.xsd')
    root.add_element('key', 'id' => 'n1',
                     'for' => 'node',
                     'yfiles.type' => 'nodegraphics')
    root.add_element('key', 'id' => 'e1',
                     'for' => 'edge',
                     'yfiles.type' => 'edgegraphics')

    root.add_element('graph', 'edgedefault' => 'directed')
  end

  public

  # Add the _node_ as XML to the _container_ .
  def add_node_as_xml(container, node)
    add_node_element(container, node)

    node.dependencies.each do |dep|
      add_edge_element(container, node, dep)
    end
  end

  private

  def add_node_element(container, node)
    elem = container.add_element('node', 'id' => node.name)
    shape_node = elem.add_element('data', 'key' => 'n1').
      add_element('y:ShapeNode')
    shape_node.
      add_element('y:NodeLabel').
      add_text(node.to_s)
  end

  def add_edge_element(container, node, dep)
    edge = container.add_element('edge')
    edge.add_attribute('id', node.name + '.' + dep.name)
    edge.add_attribute('source', node.name)
    edge.add_attribute('target', dep.name)
  end

end

Layout the diagram
I do not try to layout the graph when creating it. Existing tools do a much better job. I use the yEd Graph Editor. yEd is a Java application. Download and uncompress the zipped archive. To get the final diagram

I load the GraphML file into the editor. (If the graph is huge, yEd needs more memory. yEd is just an executable Jar - Java Archive - it can get more heap on startup using java -XX:+UseG1GC -Xmx800m -jar ./yed-3.16.2.1/yed.jar.)
Then I select all nodes and apply menu commands Tools/Fit Node to Label. This is because the size of the nodes does not match the size of the node's names.
Finally I apply the menu commands Layout/Hierarchical or maybe Layout/Organic/Smart. In the end it needs some trial and error to find the proper layout.

Conclusion
This approach is very flexible. The nodes of the graph can be anything, a "module" can be a file, a folder or a bunch of folders. In the example shown above, the nodes where NATURAL modules (aka files), i.e. functions, subroutines, programs or includes. In another example shipped with JavaClass the nodes are components, i.e. source folders similar to Maven multi module projects. The diagram below shows the components of a product family of large Eclipse RCP applications with their dependencies in a hierarchical layout. Pretty crazy, isn't it. ;-)

Components of an Eclipse RCP application in hierarchical layout, 2012

Components of an Eclipse RCP application in hierarchical layout, 2012

30 August 2019

Visualising Architecture: Neo4j vs. Module Dependencies

Last year I wrote about some work I did for a client using NATURAL (an application development and deployment environment using a proprietary language), for example using NATstyle, adding custom rules and creating custom reports. Today I will share some things I used to visualise the architecture. Usually I want to get the bigger picture of the architecture before I change it.

Dependencies are killing us
Let's start with some context: This is Banking with some serious legacy: Groups of "solutions" are bundled together as "domains". Each solution contains 5.000 to 10.000 modules (files), which are either top level applications (executable modules) or subroutines (internal modules). Some modules call code of other solutions. There are some system "libraries" which bundle commonly used modules similar to solutions. A recent cross check lists more than 160.000 calls crossing solution boundaries. Nobody knows which modules outside of one's own solution are calling in and changes are difficult because all APIs are potentially public. As usual - dependencies are killing us.

Graph Database
To get an idea what was going on, I wanted to visualize the dependencies. If the data could be converted and imported into standard tools, things would be easier. But there were way too many data points. I needed a database, a graph database, which should be able to deal with hundreds of thousand of nodes, i.e. the modules, and their edges, i.e. the directed dependencies (call or include).

Extract, Transform, Load (ETL)
While ETL is a concept from data warehousing, we exactly needed to "copy data from one or more sources into a destination system which represented the data differently from the source(s) or in a different context than the source(s)." The first step was to extract the relevant information, i.e. the call site modules, destination modules together with more architectural information like "solution" and "domain". I got this data as CSV from a system administrator. The data needed to be transformed into a format which could be easily loaded. Use your favourite scripting language or some sed&awk-fu. I used a little Ruby script,

ZipFile.new("CrossReference.zip").
  read('CrossReference.csv').
  split(/\n/).
  map { |line| line.chomp }.
  map { |csv_line| csv_line.split(/;\s*/, 9) }.
  map { |values| values[0..7] }. # drop irrelevant columns
  map { |values| values.join(',') }. # use default field terminator ,
  each { |line| puts line }

to uncompress the file, drop irrelevant data and replace the column separator.

Loading into Neo4j
Neo4j is a well known Graph Platform with a large community. I had never used it and this was the perfect excuse to start playing with it ;-) It took me around three hours to understand the basics and load the data into a prototype. It was easier than I thought. I followed Neo4j's Tutorial on Importing Relational Data. With some warning: I had no idea how to use Neo4j. Likely I used it wrongly and this is not a good example.

CREATE CONSTRAINT ON (m:Module) ASSERT m.name IS UNIQUE;
CREATE INDEX ON :Module(solution);
CREATE INDEX ON :Module(domain);

// left column
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MERGE (ms:Module {name:row.source_module})
ON CREATE SET ms.domain = row.source_domain, ms.solution = row.source_solution

// right column
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MERGE (mt:Module {name:row.target_module})
ON CREATE SET mt.domain = row.target_domain, mt.solution = row.target_solution

// relation
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MATCH (ms:Module {name:row.source_module})
MATCH (mt:Module {name:row.target_module})
MERGE (ms)-[r:CALLS]->(mt)
ON CREATE SET r.count = toInt(1);

This was my Cypher script. Cypher is Neo4j's declarative query language used for querying and updating of the graph. I loaded the list of references and created all source modules and then all target modules. Then I loaded the references again adding the relation between source and target modules. Sure, loading the huge CSV three times was wasteful, but it did the job. Remember, I had no idea what I was doing ;-)

Querying and Visualising
Cypher is a query language. For example, which modules had most cross solution dependencies? The query MATCH (mf:Module)-[c:CALLS]->() RETURN mf.name, count(distinct c) as outgoing ORDER BY outgoing DESC LIMIT 25 returned

YYI19N00 36
YXI19N00 36
YRWBAN01 34
XGHLEP10 34
YWI19N00 32
XCNBMK40 31

and so on. (By the way, I just loved the names. Modules in NATURAL can only be named using eight characters. Such fun, isn't it.) Now it got interesting. Usually visualisation is a main issue, with Neo4j it was a no-brainer. The Neo4J Browser comes out of the box, runs Cypher queries and displays the results neatly. For example, here are the 36 (external) dependencies of module YYI19N00:

Called by module YYI19N00

As I said, the whole thing was a prototype. It got me started. For an in-depth analysis I would need to traverse the graph interactively or scripted like a Jupyter notebook. In addition, there are several visual tools for Neo4J to make sense of and see how the data is connected - exactly what I would want to know about the dependencies.

15 August 2018

Creating your own NATstyle rules

Last month I showed how to use NATstyle from the command line. NATstyle is the utility to define and check the coding standard of your NATURAL program. Today I want to explain how to customize and create your own rules beyond what is explained in the manual. I used NaturalONE 8.3.5.0.242 CE (November 2016). Likely there are more options and rules available in newer versions of NaturalONE and NATstyle.

Basic Configuration
NATstyle comes packaged inside NaturalONE, the Eclipse-based IDE for NATURAL. As expected NATstyle can be configured in Eclipse preferences. The configuration is saved as NATstyle.xml which is used when you run NATstyle from the right click popup menu. We will need to modify NATstyle.xml later, so let's have a look at it:

<?xml version="1.0" encoding="utf-8"?>
<naturalStyleCheck version="1.0"
                   xmlns="http://softwareag.com/natstyle/rules"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://softwareag.com/natstyle/rules checks.xsd">
  <checks type="source">
    <check class="CheckLineLength" name="Line length" severity="warning">
      <property name="max" value="72" />
      <property name="exclude" value="D3" />
    </check>
    <!-- more checks of type source -->
  </checks>
  <!-- more checks of other types -->
</naturalStyleCheck>

(In the source, the default configuration file is res/checks.xml together with its XML schema res/checks.xsd.)

Existing Rules
The existing rules are described in NaturalONE help topic Overview of NATstyle Rules, Error Messages and Solutions. Version 8.3 has 42 rules. These are only a few compared to PMD or SonarQube, which has more than 1000 rules available for Java. Here are some examples what NATstyle can do:

Source Checks: e.g. limit line length, find tab characters, find empty lines, limit the number of source lines and check a regular expressions for single source lines or whole source file.
Source Header Checks: e.g. force header or check file naming convention.
Parser Checks: e.g. find unused local variables, warn if local variable shadows view, find TODO comments, calculate Cyclomatic and NPath complexity, force NATdoc (documentation) tags and check function, subroutine and class names against regular expressions.
Error (Message File) Checks: e.g. check error messages file name.
Resource (File) Checks: e.g. check resource file name.
Library (Folder) Checks: e.g. library folder conventions, find special folders, force group folders and warn on missing NATdoc library documentation.

Same rule multiple times configured differently
Some rules like Source/Regular expression for single source lines only allow a single regular expression to be configured. Using alternation, e.g. a|b|c, in the expression is a way to overcome that, but the expression gets complicated quickly. Another way is to duplicate the <check> element in the NATstyle.xml configuration. Assume we do not only forbid PRINT statements, we also do not allow reduction to zero. (These rules do not make any sense, they are just here to explain the idea.) The relevant part of NATstyle.xml looks like

<checks type="source">
  <check class="CheckRegExLine"
         name="Regular expression for single source lines" ... >
    <property name="regex" value="PRINT '.*" />
  </check>
  <check class="CheckRegExLine"
         name="Regular expression for single source lines" ... >
    <property name="regex" value="REDUCE .* TO 0" />
  </check>
</checks>

While it is impossible to configure these rules in the NaturalONE preferences, it might be possible to run NATstyle with these modified settings. I did not verify that. I execute NATstyle from the command line passing in the configuration file name using the -c flag. (In the source, see the full configuration res/sameRuleTwice.xml and the script to to run the rules bin/natstyle_sameRuleTwice.bat from the command line.)

Defining your own rules
There is no documented way to create new rules for NATstyle. All rules' classes are defined inside the NATstyle plugin. The configuration XML contains a class attribute, which is a short name, e.g. CheckRegExLine. Its implementation is located in the package com.softwareag.naturalone.natural.natstyle.check.src.source where source is the group of the rules defined in the type attribute of the <checks> element. I experimented a lot and did not find a way to load rules from other packages than com.softwareag.naturalone.natural.natstyle. All rules must be defined inside this name space, which is possible.

Source Rules
While I cannot see the actual code of NATstyle rules, Java classes expose their public methods and parent class. I did see the names of the rule classes in the configuration and guessed and experimented with the API a lot. My experience with other static analysis tools, e.g. PMD and Pylint and the good method names of NATstyle code helped me doing so. A basic Source rule looks like that:

package com.softwareag.naturalone.natural.natstyle.check.src.source; // 1.

import com.softwareag.naturalone.natural.natstyle.NATstyleCheckerSourceImpl;
// other imports ...

public class FindFooSourceRule
  extends NATstyleCheckerSourceImpl { // 2.

  private Matcher name;

  @Override
  public void initParameterList() {
    name = Pattern.compile("FOO").matcher(""); // 3.
  }

  @Override
  public String run() { // 4.
    StringBuffer xmlOutput = new StringBuffer();

    String[] lines = this.getSourcelines(); // 5.
    for (int line = 0; line < lines.length; i++) {
      name.reset(lines[line]);
      if (name.find()) {
        setError(xmlOutput, line, "Message"); // 6.
      }
    }

    return xmlOutput.toString(); // 7.
  }
}

The marked lines are important:

Because it is a Source rule, it must be in exactly this package - see the paragraph above.
Source rules extend NATstyleCheckerSourceImpl which provides the lines of the NATURAL source file - see line 6. It has more methods, which have reasonable names, use the code completion.
You initialise parameters in initParameterList. I did not figure out how to make the rules configurable from the XML configuration, which will probably happen in here, too.
The run method is executed for each NATURAL file.
NATstyleCheckerSourceImpl provides the lines of the file in getSourcelines. You can iterate the lines and check them.
If there is a problem, call setError. Now setError is a bit weird, because it writes an XML element for the violation report XML (e.g. NATstyleResult.xml) into a StringBuffer.
In the end the return the XML String of all found violations.

Finally the rule is configured with

<checks type="source">
  <check class="FindFooSourceRule"
         name="Find FOO"
         severity="warning" />
</checks>

(In the example repository, there is a working Source rule src/com/softwareag/naturalone/natural/natstyle/check/src/source/FindInv02.java together with its configuration res/customSource.xml.)

Parser Rules
Now it is getting more interesting. There are 18 rules of this type, which is a good start, but we need moar! Parser rules look similar to Source rules:

package com.softwareag.naturalone.natural.natstyle.check.src.parser; // 1.

import com.softwareag.naturalone.natural.natstyle.NATstyleCheckerParserImpl;
// other imports ...

public class SomeParserRule
  extends NATstyleCheckerParserImpl { // 2.

  @Override
  public void initParameterList() {
  }

  @Override
  public String run() {
    StringBuffer xmlOutput = new StringBuffer();

    // create visitor
    getNaturalParser().getNaturalASTRoot().accept(visitor); // 3.
    // collect errors from visitor into xmlOutput

    return xmlOutput.toString();
  }
}

where

Like Source rules, Parser rules must be defined under the package ...natstyle.check.src.parser.
Parser rules extend NATstyleCheckerParserImpl.
The NATURAL parser traverses the AST of the NATURAL code. Similar to other tools, NATstyle uses a visitor, the INaturalASTVisitor. The visitor is called for each node in the AST tree. This is similar to PMD.

Using the Parser

The visitor must implement INaturalASTVisitor in package com.softwareag.naturalone.natural.parser.ast.internal. This interface defines 48 visit methods for the different sub types of INaturalASTNode, e.g. array indices, comments, operands, system function references like LOOP or TRIM, and so on. Still there are never enough node types as the AST does not convey much information about the code, most statements end up as INaturalASTTokenNode. For example the NATURAL lines

* print with leading blanks
PRINT 3X 'Hello'

which are a line comment and a print statement, result in the AST snippet

+ TOKEN: * print with leading blanks
+ TOKEN: PRINT
+ TOKEN: 3X
+ OPERAND
  + SIMPLE_CONSTANT_REFERENCE
    + TOKEN: 'Hello'

Now PRINT is a statement and could be recognised as one and 'Hello' is a string. This makes defining custom rules possible but pretty hard. To help me understand the AST I created a visitor which dumps the tree as XML file, similar to PMD's designer: src/com/softwareag/naturalone/natural/natstyle/check/src/parser/DumpAstAsXml.java.

Conclusion
With this information you should be able to get started defining your own NATstyle rules. There is always so much more we could and should check automatically.

3 July 2018

Using NATstyle from the commandline

Last year a new client contracted me to help them make their development process, their development environment and tooling "state of the art" (aka up do date). From the coding perspective that included - among other things - using a decent IDE, writing unit tests, having some Continuous Delivery pipeline in place and using static code analysis. I thought that it should not be too difficult, right? It was, but that is not the goal of this article. Today I want to describe the static analysis features and how to use it.

Earth Sapphire Progress Picture 30 july-2011

NATURAL
The code base was NATURAL. NATURAL is an application development and deployment environment using a proprietary language maintained by Software AG. It was created in 1979. André describes it as Cobol's ugly, brain-damaged, BABBLING IN ALL-CAPS - but regrettably still healthy and strong - cousin.. ;-) Well, it is a procedural language structured using modules, subroutines and functions. It feels as old as it is. For example, module names are limited to eight (8!) characters and there is no way to structure the modules further. An enterprise application might contain 5.000 to 10.000 modules. To me NATURAL feels much like early Pascal: units (modules), subroutines and functions together with the eight characters DOS name limit. I really liked Pascal back then and I do not feel that bad about NATURAL. Probably also because I do not have to use it on a daily basis.

NaturalONE and NATstyle
NaturalONE is the Eclipse-based tool for NATURAL by Software AG. It looks pretty good, similar to what you would expect from Eclipse. It contains the usual features like Outline, Search, Debug and something called NATstyle. Now NATstyle - which probably on purpose sounds similar to the well known Checkstyle - is an utility to define and check the coding standard in your programs. It is used inside NaturalONE, see a NATstyle Demo by Software AG. (For more information refer to the Eclipse/NaturalONE help topic Checking Natural Code with NATstyle.)

Invoking NATstyle from outside NaturalONE
I like static code analysis and consider it a mandatory component of every delivery pipeline. As I said above, I wanted to have some static code analysis and was wondering if I would be able to run NATstyle in the pipeline? I did not find any information on the topic and was forced to experiment. I used NaturalONE 8.3.5.0.242 CE (November 2016) and figured out several ways, which were hardly documented, and might be subject to change. Likely there are other options available in newer versions of NaturalONE and NATstyle.

Invoking NATstyle from command line
NATstyle is just an Eclipse plugin inside NaturalONE. In my edition it is the folder C:\Natural-CE\Designer\eclipse\plugins\com.softwareag.naturalone.natural.natstyle_8.3.5.0000-0242. With the proper class path it is possible to invoke NATstyle's main method, which prints help on its usage:

Usage: com.softwareag.naturalone.natural.natstyle.NATstyle [-options]

where options include:
  -projectpath <directory> Specify where to find the Natural project.
  -rootfolder              Library root folder support enabled.
  -c <file>                Specify the configuration file.
  -o <file>                Specify the output file.
  -sourcefiles <srcList>   Specify list of source files to be loaded.
                           separated with ;
  -libraries <libList>     Specify list of libraries to be loaded.
                           separated with ;
  -exclude <libList>       Specify a list of libraries to exclude.
                           separated with ;
  -p <directory>           Specify where to find additional packages.
  -help                    Display command line options and exit.

Make sure the following classes can be found in your class path:

com.softwareag.naturalone.natural.auxiliary
com.softwareag.naturalone.natural.common
com.softwareag.naturalone.natural.parser

Nice, that looks like the developers from Software AG prepared NATstyle to be called stand-alone in my pipeline. +1. All the necessary classes can be found in the following plugins of Eclipse/NaturalONE:

Folder com.softwareag.naturalone.natural.natstyle_8.3.5.0000-0242
Jar com.softwareag.naturalone.natural.auxiliary_*.jar
Folder com.softwareag.naturalone.natural.common_8.3.5.0000-0242
Jar com.softwareag.naturalone.natural.parser_*.jar
Jar org.eclipse.equinox.common_*.jar

These are all bundles defined by Eclipse/OSGi. The list of dependencies of these bundles also contains the following Jars

org.eclipse.swt.win32.*.jar
org.eclipse.ui.console_*.jar

which (in my experience) are not needed to use NATstyle. NATstyle does not use any other OSGi features and we can set the needed class path by hand. Here is an example how to do that in the Windows shell when NaturalONE is installed:

set N1=C:\Natural-CE\Designer\eclipse\plugins
set CLASSPATH=^
%N1%\com.softwareag.naturalone.natural.natstyle_8.3.5.0000-0242;^
%N1%\com.softwareag.naturalone.natural.auxiliary_8.3.5.0000-0242.jar;^
%N1%\com.softwareag.naturalone.natural.common_8.3.5.0000-0242;^
%N1%\com.softwareag.naturalone.natural.parser_8.3.5.0000-0242.jar;^
%N1%\org.eclipse.equinox.common_3.6.200.v20130402-1505.jar

java com.softwareag.naturalone.natural.natstyle.NATstyle %*

Download the source and see bin/run_natstyle for the script I used to run NATstyle directly. Copying the needed folders and Jars to another machine works well. I recommend to create a Jar from the folders before copying them around. bin/_create_jar shows how to do that. Make sure not to include META-INF\*.RSA or META-INF\*.SF into the new Jars because the checksums do not match. bin/copy_jars does the actual copying into the lib folder.

Invoking NATstyle from Ant
NaturalONE supports Ant for some automation tasks. There is no specific NATstyle Ant task, but calling it from Ant is straight forward. If the path <path id="natstyle.classpath"> is set up to contain all the Folders and Jars, NATstyle is executed via Ant's java task:

<java taskname="natstyle"
      classname="com.softwareag.naturalone.natural.natstyle.NATstyle"
      dir="${basedir}" fork="true" maxmemory="128m"
      failonerror="true"
      classpathref="natstyle.classpath">
  <arg value="-projectpath" />
  <arg value="../NatUnit/NatUnit_L4N" />
  ...
</java>

A new JVM needs to be forked because NATstyle calls System.exit() at the end. All parameters are the same as for the command-line invocation and are passed in via the <arg /> element. In the source, see ant/ant_NatStyle.xml for the complete Ant script.

Reporting
NATstyle writes an XML report of all files if worked and found rule violations. The structure is

<NATstyle>
  <file type='...' location='...'>
    <error severity='...' message='...' />
    ...
  </file>
  ...
</NATstyle>

Inside the folder of NATstyle (i.e. com.softwareag.naturalone.natural.natstyle_8.3.5.0000-0242) there is a subfolder NATstyle. It contains sample files. Using the provided NATstyleSimple.xsl the report XML (e.g. NATstyleResult.xml) can be transformed into a human-readable format (e.g. NATstyleSimple.html). When using NATstyle stand-alone, you might want to copy this folder as well. XSLT transformation task is part of Ant and converts the XML to HTML.

<xslt basedir="${basedir}"
      destdir="${basedir}/report"
      style="${basedir}/NATstyle/NATstyleSimple.xsl">
  <include name="NATstyleResult_*.xml" />
</xslt>

See ant/ant_NatStyleResultToHtml.xml for the complete Ant build script to convert the NATstyle result XML into a readable HTML report. Using your own stylesheet (.xsl) allows you to customize the report. See how to convert the report into more complex formats.

Next time I will show you how to write your own NATstyle rules.

18 February 2018

Checking Python for Compliance with Object Calisthenics

To use one of my Object Orientation workshops for a different client, I needed its whole setup in Python. (To summarise the workshop: I used Jeff Bay's Object Calisthenics as constraint and static code analysis to check the code for compliance. This helped participants follow the rules.)

I found that Pylint already contained some checkers I could use out of the box to check Python code:

checkers.refactoring with setting max-nested-blocks=1 for rule #1 (Use only one level of indentation per method.)
checkers.design_analysis with setting max-branches=1 also for rule #1, but stronger, which I like more.
checkers.design_analysis with setting max-attributes=2 for rule #7 (Don't use any classes with more than two instance variables.)

Diving deeper into Pylint's custom checkers, I created some of my own to enforce rules #2, #4, #6, #7, #8 and #9. For rule #4 (One Dot per Line) I used the approach of PHP_CodeSniffer and looked for nested expressions, allowing only one dot per statement, ignoring self references.

Dynamic Nature
Due to the dynamic nature of Python there is a grey area when working with types. For first-class collections (rule #8), only certain uses of collections are identified at the moment. I was not able to check for primitives at object boundaries at all, so rule #3 is not verified.

Figuring out how to write complex checkers for Pylint was difficult at first, but also a lot of fun. Like PMD and most static analysis tools I know, Pylint works with the Abstract Syntax Tree of the code and checkers are built using the Visitor design pattern. The main problem was figuring out which types of nodes exist, then the implementation of the rules was straight forward.

Python Project Setup
The setup is similar to the Java project. I created a repository for the LCD Numbers under lcd-numbers-object-calisthenics-python-setup. To test your setup there is some code in the project and ./run_pylint will show its violations. Pylint is configured using the objectcalisthenics.pylintrc file in the project directory. The individual checkers are in ./pylint/checkers.

11 January 2018

Compliance with Object Calisthenics

During my work as Code Cop I run many workshops. Sometimes I use constraints to make exercises more focused or more intense. Some constraints, like the Brutal Coding Constraints, are composite or aggregate constraints, which means that they are a combination of several simpler or low level constraints. (Here simple does not mean that they are easy to follow, rather that they focus on a single thing.) Today I want to discuss Object Calisthenics.

Object Calisthenics
Jeff Bay's Object Calisthenics is an aggregate constraint combined of the following nine rules:

Use only one level of indentation per method.
Don't use the else keyword.
Wrap all primitives and strings (in public API).
Use only one dot per line.
Don't abbreviate (long names).
Keep all entities small.
Don't use any classes with more than two instance variables.
Use first-class collections.
Don't use any getters/setters/properties.

If you are not familiar with these rules, I recommend you read Jeff Bay's original essay published in the The ThoughtWorks Anthology in 2008. William Durand's post is an exact copy of that essay, so no need to buy the book for that. Several people have followed up and discussed their interpretation and experience with Object Calisthenics, e.g. Mark Needham, Jeff Pace, Vasiliki Vockin and Juan Antonio.

Object Calisthenics is an exercise in object orientation. How is that? One of the core concepts of OOP is Abstraction: A class should capture one and only one key abstraction. Obviously primitive and built-in types lack abstraction. Wrapping primitives (rule #3) and wrapping collections (rule #8) drive the code towards more abstractions. Small entities (rule #6) help to keep our abstractions focused.

Further objects are defined by what they do, not what they contain. All data must be hidden within its class. This is Encapsulation. Rule #9, No Properties, forces you to stay away from accessing the fields from outside of the class.

Next to Abstraction and Encapsulation, these nine rules help Loose Coupling and High Cohesion. Loose Coupling is achieved by minimizing the number of messages sent between a class and its collaborator. Rule #4, One Dot Per Line, reduces the coupling introduced by a single line. This rule is misleading, because what Jeff really meant was the Law Of Demeter. The law is not about counting dots per line, it is about dependencies: "Only talk to your immediate friends." Sometimes even a single dot in a line will violate the law.

High Cohesion means that related data and behaviour should be in one place. This means that most of the methods defined on a class should use most of the data members most of the time. Rule #5, Don't Abbreviate, addresses this: When a name of a field or method gets long and we wish to shorten it, obviously the enclosing scope does not provide enough context for the name, which means that the element is not cohesive with the other elements of the class. We need another class to provide the missing context. Next to naming, small entities (rule #6) have a higher probability of being cohesive because there are less fields and methods. Limiting the number of instance variables (rule #7) also keeps cohesion high.

The remaining rules #1 and #2, One Level Of Indentation and No else aim to make the code simpler by avoiding nested code constructs. After all who wants to be a PHP Street Fighter. ;-)

Checking Code for Compliance with Object Calisthenics
When facilitating coding exercises with composite constraints, I noticed how easy it is to overlook certain violations. We are used to conditionals or dereferencing pointers that we might not notice them when reading code. Some rules like the Law Of Demeter or a maximum size of classes need a detailed inspection of the code to verify. To check Java code for compliance with Object Calisthenics I use PMD. PMD contains several rules we can use:

Rule java/coupling.xml/LawOfDemeter for rule #4.

Rule #6 can be checked with NcssTypeCount. A NCSS count of 30 is usually around 50 lines of code.

<rule ref="rulesets/java/codesize.xml/NcssTypeCount">
    <properties>
        <property name="minimum" value="30" />
    </properties>
</rule>

And there is TooManyFields for rule #7.

<rule ref="rulesets/java/codesize.xml/TooManyFields">
    <properties>
        <property name="maxfields" value="2" />
    </properties>
</rule>

I work a lot with PMD and have created custom rules in the past. I added rules for Object Calisthenics. At the moment, my Custom PMD Rules contain a rule set file object-calisthenics.xml with these rules:

java/constraints.xml/NoElseKeyword is very simple. All else keywords are flagged by the XPath expression //IfStatement[@Else='true'].
java/codecop.xml/FirstClassCollections looks for fields of known collection types and then checks the number of fields.
java/codecop.xml/OneLevelOfIntention
java/constraints.xml/NoGetterAndSetter needs a more elaborate XPath expression. It is checking MethodDeclarator and its inner Block/ BlockStatement/ Statement/ StatementExpression/ Expression/ PrimaryExpressions.
java/codecop.xml/PrimitiveObsession is implemented in code. It checks PMD's ASTConstructorDeclaration and ASTMethodDeclaration for primitive parameters and return types.

For the nitty-gritty details of all the rules have a look at the rules defined in codecop.xml and constraints.xml.

Interpretation of Rules: Indentation
When I read Jeff Bay's original essay, the rules were clear. At least I thought so. Verifying them automatically showed some areas where different interpretations are possible. Different people see Object Calisthenics in different ways. In comparison, the Object Calisthenics rules for PHP_CodeSniffer implement One Level Of Indentation by allowing a nesting of one. For example there can be conditionals and there can be loops, but no conditional inside of a loop. So the code is either formatted at method level or indented one level deep. My PMD rule is more strict: Either there is no indentation - no conditional, no loop - or everything is indented once: for example, if there is a loop, than the whole method body must be inside this loop. This does not allow more than one conditional or loop per method. My rule follows Jeff's idea that each method does exactly one thing. Of course, I like my strict version, while my friend Aki Salmi said that I went to far as it is more like Zero Level Of Indentation. Probably he is right and I will recreate this rule and keep the Zero Level Of Indentation for the (upcoming) Brutal version of Object Calisthenics. ;-)

Wrap All Primitives
There is no PHP_CodeSniffer rule for that, as Tomas Votruba considers it "too strict, vague or annoying". Indeed, this rule is very annoying if you use primitives all the way and your only data structure is an associative array or hash map. All containers like java.util.List, Set or Map are considered primitive as well. Samir Talwar said that every type that was not written by yourself is primitive because it is not from your domain. This prohibits the direct usage of Files and URLs to name a few, but let's not go there. (Read more about the issue of primitives in one of my older posts.)

My rule allows primitive values in constructors as well as getters to implement the classic Value Object pattern. (The rule's implementation is simplistic and it is possible to cheat by passing primitives to constructors. And the getters will be flagged by rule #9, so no use for them in Object Calisthenics anyway.)

I agree with Tomas that this rule is too strict, because there is no point in wrapping primitive payloads, e.g. strings that are only displayed to the user and not acted on by the system. These will be false positives. There are certain methods with primitives in their signatures like equals and hashCode that are required by Java. Further we might have plain numbers in our domain or we use indexing of some sort, both will be false positives, too.

One Dot Per Line
As I said before, I use PMD's LawOfDemeter to verify rule #4. The law allows sending messages to objects that are

the immediate parts of this or
the arguments of the current method or
objects created inside the current method or
objects in global variables.

I did not look at PMD's source code to check the implementation of this rule - but it complains a lot. For me this is the most difficult rule of all nine rules. (I code according to #1, #3, #5 and #6 and can easily adapt to strictly follow #2, #7, #8 and #9.) Although it complains a lot, I found every violation correct. I learned much about Law Of Demeter by checking my code for violations. For example, calling methods on an element of an array is a violation. The indexed array access is similar to a pointer access. (In Ruby this is obvious because Array defines a method def [](index).) Another interesting fact is that (at least in PMD) the law flags calling methods on enums. The enum instances are not created locally, so we cannot send them messages. On the other hand, an enum is a global variable, so maybe it should be allowed to call methods on it.

The PHP_CodeSniffer rule follows the rule's name and checks that there is only one dot per line. This creates better code, because train wrecks will be split into explaining variables which make debugging easier. Also Tomas is checking for known fluent interfaces. Fluent interfaces - by definition - look like they are violating the Law Of Demeter. As long as the fluent interface returns the same instance, as for example basic builders do, there is no violation. When following a more relaxed version of the law, the Class Version of Law Of Demeter, than different implementations of the same type are still possible. The Java Stream API, where many calls return a new Stream instance of a different class - or the same class with a different generic type - is likely to violate the law. It does not matter. Fluent interfaces are designed to improve readability of code. Law Of Demeter violations in fluent interfaces are false positives.

Don't Abbreviate
I found it difficult to check for abbreviations, so rule #5 is not enforced. I thought of implementing this rule using a dictionary, but that is prone to false positives as the dictionary cannot contain all terms from all domains we create software for. The PHP_CodeSniffer rules check for names shorter than three characters and allow certain exceptions like id. This is a good start but is not catching all abbreviations, especially as the need to abbreviate arises from long names. Another option would be to analyse the name for its camel case patterns, requiring all names to contain lowercase characters between the uppercase ones. This would flag acronyms like ID or URL but no real abbreviations like usr or loc.

Small Entities
Small is relative. Different people use different limits depending on programming language. Jeff Bay's 50 lines work well for Java. Rafael Dohms proposes to use 200 lines for PHP. PHP_CodeSniffer checks function length and number of methods per class, too. Fabian Schwarz-Fritz limits packages to ten classes. All these additional rules follow Jeff Bay's original idea and I will add them to the rule set in the future.

Two Instance Variables
Allowing only two instance variables seems arbitrary - why not have three or five. Some people have changed the rules to allow five fields. I do not see how the choice of language makes a difference. Two is the smallest number that allows composition of object trees.

In PHP_CodeSniffer there is no rule for this because the number depends on the "individual domain of each project". When an entity or value object consists of three or more equal parts, the rule will flag the code but there is no problem. For example, a class BoundingBox might contain four fields top, left, bottom, right. Depending on the values, introducing a new wrapper class Coordinate to reduce these fields to topLeft and bottomRight might make sense.

No Properties
My PMD rule finds methods that return an instance field (a getter) or update it (a setter). PHP_CodeSniffer checks for methods using the typical naming conventions. It further forbids the usage of public fields, which is a great idea. As we wrapped all primitives (rule #3) and we have no getters, we can never check their values. So how do we create state based tests? Mark Needham has discussed "whether we should implement equals and hashCode methods on objects just so that we can test their equality. My general feeling is that this is fine although it has been pointed out to me that doing this is actually adding production code just for a test and should be avoided unless we need to put the object into a HashMap or HashSet."

From what I have seen, most object oriented developers struggle with that constraint. Getters and setters are very ingrained. In fact some people have dropped that constraint from Object Calisthenics. There are several ways to live without accessors. Samir Talwar has written why avoiding Getters, Setters and Properties is such a powerful mind shift.

Java Project Setup
I created a repository containing the starting point for the LCD Numbers Kata:

lcd-numbers-object-calisthenics-java-setup.

Both are Apache Maven projects. The projects are set up to check the code using the Maven PMD Plugin on each test execution. Here is the relevant snippet from the pom.xml:

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-pmd-plugin</artifactId>
      <version>3.7</version>
      <configuration>
        <printFailingErrors>true</printFailingErrors>
        <linkXRef>false</linkXRef>
        <typeResolution>true</typeResolution>
        <targetJdk>1.8</targetJdk>
        <sourceEncoding>${encoding}</sourceEncoding>
        <includeTests>true</includeTests>
        <rulesets>
          <ruleset>/rulesets/java/object-calisthenics.xml</ruleset>
        </rulesets>
      </configuration>
      <executions>
        <execution>
          <phase>test</phase>
          <goals>
            <goal>check</goal>
          </goals>
        </execution>
      </executions>
      <dependencies>
        <dependency>
          <groupId>org.codecop</groupId>
          <artifactId>pmd-rules</artifactId>
          <version>1.2.3</version>
        </dependency>
      </dependencies>
    </plugin>
  </plugins>
</build>

You can add this snippet to any Maven project and enjoy Object Calisthenics. The Jar file of pmd-rules is available in my personal Maven repository.

To test your setup there is sample code in both projects and mvnw test will show two violations:

[INFO] PMD Failure: SampleClass.java:2 Rule:TooManyFields Priority:3 Too many fields.
[INFO] PMD Failure: SampleClass:9 Rule:NoElseKeyword Priority:3 No else keyword.

It is possible to check the rules alone with mvnw pmd:check. (Using the Maven Shell the time to run the checks is reduced by 50%.) There are two run_pmd scripts, one for Windows (.bat) and one for Linux (.sh).

Object Calisthenics Retrospective

Limitations of Checking Code
Obviously code analysis cannot find everything. On the other hand - as discussed earlier - some violations will be false positives, e.g. when using the Stream API. You can use // NOPMD comments and @SuppressWarnings("PMD") annotations to suppress false positives. I recommend using exact suppressions, e.g. @SuppressWarnings("PMD.TooManyFields") to skip violations because other violations at the same line will still be found. Use your good judgement. The goal of Object Calisthenics is to follow all nine rules, not to suppress them.

Learnings
Object Calisthenics is a great exercise. I used it all of my workshops on Object Oriented Programming and in several exercises I did myself. The verification of the rules helped me and the participants to follow the constraints and made the exercise more strict. (If people were stuck I sometimes recommended to ignore one or another PMD violations, at least for some time.) People liked it and had insights into object orientation: It is definitely a "different" and "challenging way to code". "It is good to have small classes. Now that I have many classes, I see more structure." You should give it a try, too. Jeff Bay even recommends to run an exercise or prototype of at least 1000 lines for at least 20 hours.

The question if Object Calisthenics is applicable to real working systems remains. While it is excellent for exercise, it might be too strict to be used in production. On the other hand, in his final note, Jeff Bay talks about a system of 100,000 lines of code written in this style, where the "people working on it feel that its development is so much less tiresome when embracing these rules".

3 December 2017

PMD Check and Report in same build

I am working together with senior developer and (coding) architect Elisabeth Blümelhuber to set up a full featured continuous delivery process for the team. The team's projects use Java and are built with Maven.

Using PMD for Static Code Analysis
After using Jenkins for some time to run the tests, package and deploy the products, it was time to make it even more useful: Add static code analysis. As a first step Elisabeth added a PMD report of a small set of important rules to the Maven parent of all projects. PMD creates a pmd.xml in the target folder which is picked up by Jenkins' PMD Plugin. Jenkins displays the found violations and tracks changes over time, showing a basic trend graph. (While SonarQube would be more powerful, we decided to stay with Jenkins because the team was already "listening" to it.)

Breaking the Build on Critical Violations
I like breaking the build on critical violations to ensure the developers' attention. It is vital, though, to achieve the acceptance of the team members when changing their development process. We thus started with a custom, minimal set of rules (in src/config/pmd_mandatory.xml) that would break the build. The smaller the initial rule set is the better. In the beginning of adding static code analysis to the build process, it is not about the code but getting the team aboard - we can always add more rules later. The first rule set might even contain a single rule, e.g. EmptyCatchBlock. Empty Catch blocks are a well known problem when analysing defects and usually developers agree with the severity of having them in the code and accept breaking the build for that. On the other hand, breaking the build on minor or formatting issues is not recommended in the beginning.

Here is the snippet of our pom.xml that breaks the build:

<build>
  ...
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-pmd-plugin</artifactId>
      <configuration>
        <failOnViolation>true</failOnViolation>
        <printFailingErrors>true</printFailingErrors>
        <rulesets>
          <ruleset>.../pmd_mandatory.xml</ruleset>
        </rulesets>
        ... other settings
      </configuration>
      <executions>
        <execution>
          <id>pmd-break</id>
          <phase>prepare-package</phase>
          <goals>
            <goal>check</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

This is more or less taken directly from the PMD Plugin documentation. After running the tests, PMD checks the code.

Keeping a Report of Major Violations
We wanted to keep the report Elisabeth had established previously. We tried to add another <execution> element for that. As executions can have their own <configuration> we thought that this would work, but it did not. PMD just ignored the second configuration. (Maybe this is a general Maven issue. For example the Maven Failsafe Plugin is a copy of the Surefire plugin to allow both plugins to have different configurations.)

The PMD plugin offers a report for the Maven site which is configured independently. As a workaround for the above problem, we used the site report to check the rules listed in src/config/pmd_report.xml. The PMD report invocation created the needed target/pmd.xml as well as a readable target/site/pmd.html.

<reporting>
  <plugins>
    ... other plugins
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-pmd-plugin</artifactId>
      <configuration>
        <rulesets>
          <ruleset>.../pmd_report.xml</ruleset>
        </rulesets>
        ... other settings
      </configuration>
    </plugin>
  </plugins>
</reporting>

Skipping Maven Standard Reports
Unfortunately mvn site also created other reports which we did not need and which slowed down the build. Maven standard reports can be selected using the Maven Project Info Reports Plugin. It is possible to set its <reportSet> empty, not creating any reports:

<reporting>
  <plugins>
    ... other plugins
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-project-info-reports-plugin</artifactId>
      <version>2.9</version>
      <reportSets>
        <reportSet>
          <reports>
            <!-- empty - no reports -->
          </reports>
        </reportSet>
      </reportSets>
    </plugin>
  </plugins>
</reporting>

Now it did not create the standard reports. It only generated target/site/project-reports.html with a link to the pmd.html and no other HTML reports. Win.

Skipping CPD Report
By default, the PMD plugin invokes PMD and CPD. CPD is checking for duplicate code - and is very useful - but we did not want to use it right now. As I said before, we wanted to start small. All plugins have goals which are explained in the documentation. Obviously the Maven report invokes PMD plugin's goals pmd:pmd and pmd:cpd. How do we tell a report which goals to invoke? That was the hardest problem to solve because we could not find any documentation on that. It turned out that each reporting plugin can be configured with <reportSets> similar to the Maven Project Info Reports Plugin:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-pmd-plugin</artifactId>
  <configuration>
    ... same as above
  </configuration>
  <reportSets>
    <reportSet>
      <reports>
        <report>pmd</report>
      </reports>
    </reportSet>
  </reportSets>
</plugin>

Putting Everything Together
We execute the build with

mvn clean verify site

If there is a violation of the mandatory rules, the build breaks and Maven stops. Otherwise site generates the PMD report. If there are no violations at all, Maven does not create a pmd.html. There is always a pmd.xml, so Jenkins is always happy.

(The complete project (compressed as zip) is here.)

2 October 2014

Visualising Architecture: Maven Dependency Graph

Last year, during my CodeCopTour, I pair programmed with Alexander Dinauer. During a break he mentioned that he just created a Maven Module Dependency Graph Creator. How awesome is that. The Graph Creator is a Gradle project that allows you to visually analyse the dependencies of a multi-module Maven project. First it sounds a bit weird - a Gradle project to analyse Maven modules. It generates a graph in Graphviz DOT language and an html report using Viz.js. You can use the dot tool which comes with Graphviz to transform the resulting graph file into JPG, SVG etc.

In a regular project there is not much use for the graph creator tool, but in super large projects there definitely is. I like to use it when I run code reviews for my clients, which involve codebases around 1 million lines of code or more. Such projects can contain up to two hundred Maven modules. Maven prohibits cyclic dependencies, still projects of such size often show crazy internal structure. Here is a (down scaled) example from one of my clients:

Maven dependencies in large project

Looks pretty wild, doesn't it? The graph is the first step in my analysis. I need to load it into a graph editor and cluster modules into their architectural components. Then outliers and architectural anomalies are more explicit.

12 March 2013

Complexity Slope

GeeCON is a great conference with many good presentations. Last year, a talk by Keith Braithwaite on Measuring the Effect of TDD on Design particularly piqued my interest. He talked about Cyclomatic Complexity and that code with tests is more strongly biased towards simpler methods than code without tests. Keith started this interesting research in 2006 and you can read everything about it in a series of blog posts about complexity and test-first. Unfortunately there is no recording of the session Keith gave at GeeCON, but he gave a similar talk at QCon 2008. I will not repeat his research here, so you should read the series of posts or at least watch the recording to continue.

Measure It!
Keith wrote a tool called Measure to harvest the distribution of Cyclomatic Complexity throughout a code base. It is based on Checkstyle and analyses Java source code. It worked out of the box, so I gave it a try.

Some Numbers
My current project, a large RCP application, has a complexity slope as bad as 1,75. It has really bad code, there are many large and complex methods and no tests at all. (There were no tests when I joined the project last year. Now with me on the project, the number of tests is growing slowly but steadily ;-) I am wondering if the used framework has an impact in the complexity slope as well. RCP is known for its complex dependency structure by its overuse of the singleton patten.

Another project that I worked on some years ago was a large web application and it was probably as bad as the current one. After five years of heavy refactoring and retrofitting with unit tests up to 60% code coverage, it scored almost two (1,99). It seems to be a special case. The application was never test-first and as such should score below two. The refactoring activities did not target complexity but were guided by simple metrics and layering rules.

Usage
If you want to see the complexity slope of your projects download measure-0.3.1.zip and unpack it. It contains everything you need to run Measure. There is a bash for Linux and a I added a batch for Windows. If you run one of them they will print a short help message. I ran it against my little BDD testing framework and it showed me a nice score of almost three (2,94). This was expected as I had used strict TDD while developing it. After my latest refactoring where I reduced the complexity of the story parser considerably, it even scored beyond three (3,12).

Complexity Slope of BaDaDam Testing-Framework

What about Ruby?
Keith's Measure supported Java by using Checkstyle under its hood. In fact it did little more than parse the Checkstyle complexity report with a threshold of zero, so reporting all complexity values. I used Saikuro, a Ruby Cyclomatic Complexity Analyser to implement the same for Ruby. First I created a Saikuro Rake task and set Saikuro's complexity threshold to zero,

state_filter = Filter.new(0)
  state_formater = Saikuro::StateHTMLComplexityFormater.new(STDOUT, state_filter)
  ...
  idx_states, idx_tokens = Saikuro.analyze(@files, state_formater, nil, @output_dir)
  write_cyclo_index(idx_states, @output_dir)

Then I ran Saikuro to generate the complexity report into some folder,

rake -f ruby\saikuro_task.rb dir=some_folder

added a Saikuro parser to Measure,

measure -k Saikuro -l some_folder

and wrote some glue code to bring all these things together,

measure_ruby -t JavaClass -d E:\Develop\Ruby\JavaClass\lib

Note that the scripts are only available for Windows, but the Rake task and Java code work on all platforms. Also note that Saikuro messes up RDOC, so both cannot be in the same Rake file.

19 December 2011

JavaClass 0.4 released

Much has happened since I created JavaClass almost three years ago. I started with version 0.0.1 because much was missing in there. Since then I managed a small release once a year. Version two added class names and references. Last year I added classpath abstractions and moved the project from Rubyforge to Google Code (and then in the future to Bitbucket and eventually to GitHub). This year I finished version 0.0.4. Now the Java class file parser supports everything except fields and methods. Adding them would be no big deal, I just had no need to do it till now. This fourth release feels more complete and should rather be named version 0.4 than 0.0.4. So the next version will be 0.5.

New in JavaClass 0.4
I added support for Java 5 Enums and Annotations to JavaClass::ClassFile::JavaClassHeader and write code to retrieve all interfaces implemented by a class. Further I created Maven- and Eclipse-aware classpath abstractions. If you point JavaClass to the folder of a whole Eclipse workspace, you get a JavaClass::Classpath::CompositeClasspath containing all projects found in that workspace. Finally I fixed some defects, e.g. a character to byte conversion problem found in Ruby 1.9 and a printf inconsistency between Windows and Linux platforms. See the history of JavaClass 0.4 for the complete list of changes.

RDOC documentation was improved and several examples have been added. In fact these examples are real world scripts that I use to analyse code bases. They are just formatted nicely and commented in very detail.

Find Unused Classes
For example let us look for unused classes. By "unused" I mean classes that have no incoming reference inside a code base. Such classes might still be referenced by Java Reflection, but more likely are just dead code. I have seen legacy code bases where more than 10% of all classes were unused and their deletion was a welcomed reduction of the maintenance burden. A list of classes that can (potentially) be deleted is a valuable asset for the next clean-up cycle. Still, each class on the list needs to be double checked because classes with main methods, Servlets and other framework classes might show up on the list but still be used.

require 'javaclass/dsl/mixin'
require 'javaclass/classpath/tracking_classpath'

location = 'C:\Eclipse\workspace'
package = 'com.biz.app'

# 1) create the (tracking) composite classpath of the given workspace.
cp = workspace(location)

# Create a filter to limit all operations to the classes of our application.
filter = Proc.new { |clazz| clazz.same_or_subpackage_of?(package) }

# 2) load all classes of the application, this can take several minutes.
classes = cp.values(&filter)

# 3) for all classes, mark all referenced types as accessed.
cp.reset_access
classes.map { |clazz| clazz.imported_types }.flatten.
  find_all(&filter).
  each { |classname| cp.mark_accessed(classname) }

# 4) also mark classes referenced from config files.
scan_config_for_3rd_party_class_names(location).
  find_all(&filter).
  each { |classname| cp.mark_accessed(classname) }

# 5) find non accessed classes.
unused_classes = classes.find_all { |clazz| cp.accessed(clazz) == 0 }

The above example uses EclipseClasspath and TrackingClasspath under the hood. TrackingClasspath adds functionality to record the number of usages of each class. The classes that are not used by any imports (step 3) or any hard coded configurations (step 4) are potentially unused.

Java Class Literals in Ruby
Inspired by date literals in Ruby I thought about Java class name literals in Ruby. This feels a bit crazy so it is not enabled by default. After including JavaClass::Dsl::JavaNameFactory you can write

clazz = java.lang.String    # => "java.lang.String"

Note the absence of any quotes around the right hand side of the above assignment. The returned class name is not just a plain String, but also a JavaQualifiedName with some extra methods to work on Java names, like

clazz.to_java_file          # => "java/lang/String.java"
clazz.full_name             # => java.lang.String"
clazz.package               # => "java.lang"
clazz.simple_name           # => "String"

Adding these rich strings caused me a lot of trouble, but that is another story.

15 November 2011

Visualising Architecture: Eclipse Plug-in Dependencies

In my job at the Blue Company I am working on a product family of large Eclipse RCP applications. Each application consists of a number of OSGi plug-ins which are the main building block of Eclipse RCP. There are dedicated as well as shared plug-ins. I am new to the project and struggle to understand which applications depend on which plug-ins. I would like to see a component diagram, showing the dependencies between modules, i.e. on OSGi level.

There is an Eclipse's Dependency Visualization tool inside the Eclipse Incubator project: Plug-in org.eclipse.pde.visualization.dependency_1.0.0.20110531 which displays dependencies of plug-ins. This is exactly what I need. Here is the visualization of one of the products:

(The yellow and green bounding boxes were added manually and indicate parts of the architecture.) The tool's main use is to identify unresolved plug-ins, a common problem in large applications. While you find unresolved plug-ins by running the application, the tool helps even before the application is launched. For more information see this article on developerWorks about the use of the Dependency Visualization. The latest version of the plug-in, including its download for Eclipse 3.5.x, is available at the testdrivenguy's blog.

28 December 2010

Architecture Rules

Unfortunately checking Java code for compliance with a given software architecture is not done on every project, as far as I know only a few experts do it. This is not good because "if it's not checked, it's not there". Free tool support for enforcing architectural rules has been lacking since long but recently things have changed. Well, maybe not recently but recently I thought about it :-)

Macker
There have always been a few basic tools to check the references of classes. One of these tools is Macker. It's quite old and not actively maintained any more, but I've been using it for years and it worked great for me. It's small and simple and yet immensely powerful. I love it and use it whenever possible. Everybody should use it. And don't turn it off!

A Shameless Plug
Macker and some other tools that can be used to enforce architecture rules are described in the fourth part of my 'Code Cop' series, published in the German magazine iX last summer: Automatisierte Architektur-Reviews (Automated Architecture Reviews) (iX 6/2010).

Architecture Reviews with Ant
The article describes several free tools (PMD, Macker and Classycle) and how to use them with Ant to verify different aspects of an architecture. Each of them has its advantages and disadvantages and when used together they are useful. The examples of the Ant integration and the PMD/Macker/Classycle architecture rules as given in the article should help you getting started with your own checks.

Macker and Maven
As I said above, Macker comes with proper Ant support, but Maven integration has been lacking. Last year I wanted to use Macker on an Maven project but was disappointed by the immaturity of the MackerMavenPlugin available on Codehaus. So I had to enhance it a bit. I submitted some patches which got accepted but still the MOJO wouldn't get promoted out of its sandbox state. Fortunately I keep an unofficial release (0.9) in my own Maven repository. So finally proper Maven integration of Macker is available.

Other Tools
I guess there are other tools available to be used with Maven. For example there is Architecture Rules with its maven-architecture-rules-plugin which uses JDepend under the hood. It looks promising but I haven't used it in production yet. And I hear that Sonar 2.4 is able to check architecture rules as well but I didn't try it till now.

(List of all my publications with abstracts.)

7 May 2010

Custom PMD Rules

Last month I submitted my fourth article from the 'Code Cop' series. After Daily Build, Daily Code Analysis and Automated Testing I wrote about checking architecture rules. One approach described there used PMD. In case you don't know what PMD is, it's a static code analysis tool for Java. It was first developed in 2002 and is updated regularly. It checks the abstract syntax tree (AST) of Java code. (PMD is a great tool. You should definitely use it!)

PMD vs. AST (I just love TLAs)
PMD comes with many predefined rules. Some of them are real life-savers, e.g. EmptyCatchBlock. It's possible to define your own rules. The easiest way to do this is to use XPath expressions to match patterns in the AST. For example, the following trivial class

package org.codecop.myapp.db;

import java.sql.Connection;

public class DBSearch {
   ...
}

is represented inside PMD by the following AST

+ PackageDeclaration
|    + Name:org.codecop.myapp.db
+ ImportDeclaration
|    + Name:java.sql.Connection
+ TypeDeclaration
     + ClassOrInterfaceDeclaration:DBSearch
          + ClassOrInterfaceBody
               + ...

An XPath expression to find imports from the java.sql package looks like

//ImportDeclaration[starts-with(Name/@Image, 'java.sql.')]

It is quite simple (at least when you are familiar with XPath ;-) For further details read the PMD XPath Rule Tutorial.

I've always been enthusiastic about static code analysis and PMD in particular and have been using it successfully since 2004. (For example see my very first presentation on static code analysis with PMD.) I love it; It's a great tool. I have created several custom rules to enforce consistent coding conventions and to prevent bad coding practices. Today I am going to share some of these rules with you in this post.

First Rule
One common bug is public boolean equals(MyClass o) instead of public boolean equals(Object o) which shows that the developer is trying (and failing) to override the equals(Object) method of Object in MyClass. It will work correctly when invoked directly with a MyClass instance but will fail otherwise, e.g. when used inside collections. Such suspiciously close (but still different) equals methods are matched by

//ClassDeclaration//MethodDeclarator
  [@Image = 'equals']
  [count(FormalParameters/*)=1]
  [not ( FormalParameters//Type/Name[
           @Image='Object' or @Image='java.lang.Object' ] ) ]

This rule explained in plain English matches all the method declarations that are named equals, and have one parameter which type is neither Object nor java.lang.Object. (Object is mentioned twice because PMD analyses the source and therefore can't know about simple and full qualified class names.) Later, Tom included this SuspiciousEqualsMethodName rule into PMD's default set of rules.

Copy and Paste
The easiest way to define your own rule is to take an existing one and tweak it. Like SuspiciousEqualsMethodName was derived from SuspiciousHashcodeMethodName, the next JumbledIterator is quite similar to JumbledIncrementer. (JumbledIterator was created by Richard Beitelmair, one of my colleagues who appointed me "Code Cop" and presented me with my first Code Cop T-shirt.) So what's wrong with the following line?

for (Iterator it1 = iterator(); it2.hasNext(); ) { ... }

Most likely it2 should be it1, shouldn't it. Richard created the following rule to pick up on these kinds of errors:

//ForStatement[
  ( ForInit//ClassOrInterfaceType/@Image='Iterator' or
    ForInit//ClassOrInterfaceType/@Image='Enumeration'
  ) and
  ( ends-with(Expression//Name/@Image, '.hasNext') or
    ends-with(Expression//Name/@Image, '.hasMoreElements')
  ) and not (
    starts-with(Expression//Name/@Image,
      concat(ForInit//VariableDeclaratorId/@Image, '.'))
  )
]

A Real Environment
After the last two sections we are warmed up and finally ready for some real stuff. On several occasions I have met developers who wondered why their code Long.getLong(stringContainingANumber) would not work. Well it worked, but it did not parse the String as they expected. This is because the Long.getLong() is a shortcut to access System.getProperty(). What they really wanted was Long.parseLong(). Here is the UnintendedEnvUsage rule:

//PrimaryExpression/PrimaryPrefix/Name[
  @Image='Boolean.getBoolean' or
  @Image='Integer.getInteger' or
  @Image='Long.getLong'
]

Care for Your Tests
PMD provides several rules to check JUnit tests. One of my favourite rules is JUnitTestsShouldIncludeAssert which avoids (the common) tests that do not assert anything. (Such tests just make sure that no Exception is thrown during their execution. This is fair enough but why bother to write them and not add some assert statements to make sure the code is behaving correctly.) Unfortunately, one "quick fix" for that problem is to add assertTrue(true). Rule UnnecessaryBooleanAssertion will protect your tests from such abominations.

A mistake I keep finding in JUnit 3.x tests is not calling super in test fixtures. The framework methods setUp() and tearDown() of class TestCase must always call super.setUp() and super.tearDown(). This is similar to constructor chaining to enable the proper preparation and cleaning up of resources. Whereas Findbugs defines a rule for that, PMD does not. So here is a simple XPath for JunitSetupDoesNotCallSuper rule:

//MethodDeclarator[
  ( @Image='setUp' and count(FormalParameters/*)=0 and
    count(../Block//PrimaryPrefix[@Image='setUp'])=0
  ) or
  ( @Image='tearDown' and count(FormalParameters/*)=0 and
    count(../Block//PrimaryPrefix[@Image='tearDown'])=0
  )
]

Obviously this expression catches more than is necessary: If you have a setUp() method outside of test cases this will also be flagged. Also it does not check the order of invocations, i.e. super.setUp() must be the first and super.tearDown() the last statement. For Spring's AbstractSingleSpringContextTests the methods onSetUp() and onTearDown() would have to be checked instead. So it's far from perfect, but it has still found a lot of bugs for me.

That's all for now. There are more rules that I could share with you, but this blog entry is already far too long. I've set up a repository pmd-rules containing the source code of the rules described here (and many more).

27 July 2008

Viewing Dependencies with Eclipse

Dependencies between classes are somehow double faced. Classes need other classes to do their work because code has to be broken down into small chunks to be easier to understand. On the other hand, growing quadratic with the number of classes (n nodes can define up to n*(n-1) directed edges in a graph), they can be a major source of code complexity: Tight coupling, Feature Envy and cyclic dependencies are only some of the problems caused by relations between classes "going wild".

I try to avoid (too many) usages of other classes in my source. But it's not that easy. Consider a class in some legacy code base that shows signs of Feature Envy for classes in another package. Should I move it there? To be sure I need to know all classes my class is dependent upon? I mean all classes, so checking imports does not help. Even if I forbid wildcard imports (what I do), that are still not all classes. Classes in the same package are rarely shown by tools. Even my favourite Code Analysis Plugin (CAP) for Eclipse does not show them.

So here is my little, crappy, amateur Java Dependency View Plugin for Eclipse. Unpack it and put the jar into Eclipse's plugins directory. Restart Eclipse and open the Dependency View. open Dependency View in Eclipse with Open View

open Dependency View in Eclipse with Open View

(The small blue icon is my old coding-logo, a bit skewed.) In the view you'll see all kinds of outgoing references: inner classes, classes in the same package, other classes, generic types and sometimes - a bug in the view.

Dependency View in Eclipse showing all outgoing references of java.util.ArrayList

The plugin was developed with Eclipse 3.2 and tested in 3.3. Source code is included in the jar.

A nice feature, that I was not able to implement would have been reverse search, i.e. a list of all classes depending on the current selected one. If you happen to know how to to that, i.e. how to search for all classes that refer the current class, drop me a note.