How to add a new language module with CPD support.
Table of Contents

Adding support for a CPD language

CPD works generically on the tokens produced by a Tokenizer. To add support for a new language, the crucial piece is writing a tokenizer that splits the source file into the tokens specific to your language. Thankfully you can use a stock Antlr grammar or JavaCC grammar to generate a lexer for you. If you cannot use a lexer generator, for instance because you are wrapping a lexer from another library, it is still relatively easy to implement the Tokenizer interface.

Use the following guide to set up a new language module that supports CPD.

  1. Create a new Maven module for your language. You can take the Golang module as an example.
    • Make sure to add your new module to the parent pom as <module> entry, so that it is built alongside the other languages.
    • Also add your new module to the dependencies list in “pmd-languages-deps/pom.xml”, so that the new language is automatically available in the binary distribution (pmd-dist).
  2. Implement a Tokenizer.
    • For Antlr grammars you can take the grammar from antlr/grammars-v4 and place it in src/main/antlr4 followed by the package name of the language. You then need to call the appropriate ant wrapper to generate the lexer from the grammar. To do so, edit pom.xml (eg like the Golang module). Once that is done, mvn generate-sources should generate the lexer sources for you.

      You can now implement a tokenizer, for instance by extending AntlrTokenizer. The following reproduces the Go implementation: ```java // mind the package convention if you are going to make a PR package net.sourceforge.pmd.lang.go.cpd;

    public class GoTokenizer extends AntlrTokenizer {

     @Override
     protected Lexer getLexerForSource(CharStream charStream) {
         return new GolangLexer(charStream);
     }  }  ```
    
    • For JavaCC grammars, place your grammar in etc/grammar and edit the pom.xml like the Python implementation does. You can then subclass JavaCCTokenizer instead of AntlrTokenizer.
    • For any other scenario just implement the interface however you can. Look at the Scala or Apex module for existing implementations.
  3. Create a Language implementation, and make it implement CpdCapableLanguage. If your language only supports CPD, then you can subclass CpdOnlyLanguageModuleBase to get going:

     // mind the package convention if you are going to make a PR
     package net.sourceforge.pmd.lang.go;
    
     public class GoLanguageModule extends CpdOnlyLanguageModuleBase {
            
         // A public noarg constructor is required.
         public GoLanguageModule() {
             super(LanguageMetadata.withId("go").name("Go").extensions("go"));
         }
    
         @Override
         public Tokenizer createCpdTokenizer(LanguagePropertyBundle bundle) {
             // This method should return an instance of the tokenizer you created.
             return new GoTokenizer();
         }
     } 
    

    To make PMD find the language module at runtime, write the fully-qualified name of your language class into the file src/main/resources/META-INF/services/net.sourceforge.pmd.lang.Language.

    At this point the new language module should be available in CPD and usable by CPD like any other language.

  4. Update the test that asserts the list of supported languages by updating the SUPPORTED_LANGUAGES constant in BinaryDistributionIT.

  5. Add some tests for your tokenizer by following the section below.

  6. Finishing up your new language module by adding a page in the documentation. Create a new markdown file <langId>.md in docs/pages/pmd/languages/. This file should have the following frontmatter:

    ---
    title: <Language Name>
    permalink: pmd_languages_<langId>.html
    last_updated: <Month> <Year> (<PMD Version>)
    tags: [languages, CpdCapableLanguage]
    ---
    

    On this page, language specifics can be documented, e.g. when the language was first supported by PMD. There is also the following Jekyll Include, that creates summary box for the language:

       
    {% include language_info.html name='<Language Name>' id='<langId>' implementation='<langId>::lang.<langId>.<langId>LanguageModule' supports_cpd=true %}
       
    

Declaring tokenizer options

To make the tokenizer configurable, first define some property descriptors using PropertyFactory. Look at Tokenizer for some predefined ones which you can reuse (prefer reusing property descriptors if you can). You need to override newPropertyBundle and call definePropertyDescriptor to register the descriptors. After that you can access the values of the properties from the parameter of createCpdTokenizer.

To implement simple token filtering, you can use BaseTokenFilter as a base class, or another base class in net.sourceforge.pmd.cpd.impl. Take a look at the Kotlin token filter implementation, or the Java one.

Testing your implementation

Add a Maven dependency on pmd-lang-test (scope test) in your pom.xml. This contains utilities to test your tokenizer.

Create a test class extending from CpdTextComparisonTest. To add tests, you need to write regular JUnit @Test-annotated methods, and call the method doTest with the name of the test file.

For example, for the Dart language:

package net.sourceforge.pmd.lang.dart.cpd;

public class DartTokenizerTest extends CpdTextComparisonTest {

    /**********************************
      Implementation of the superclass
    ***********************************/


    public DartTokenizerTest() {
        super("dart", ".dart"); // the ID of the language, then the file extension used by test files
    }

    @Override
    protected String getResourcePrefix() {
        // "testdata" is the default value, you don't need to override.
        // This specifies that you should place the test files in
        // src/test/resources/net/sourceforge/pmd/lang/dart/cpd/testdata
        return "testdata";
    }

    /**************
      Test methods
    ***************/


    @Test  // don't forget the JUnit annotation
    public void testLiterals() {
        // This will look for a file named literals.dart
        // in the directory identified by getResourcePrefix,
        // tokenize it, then compare the result against a baseline
        // literals.txt file in the same directory

        // If the baseline file does not exist, it is created automatically
        doTest("literals");
    }

}