Table of Contents
Adding support for a CPD language
CPD works generically on the tokens produced by a Tokenizer
.
To add support for a new language, the crucial piece is writing a tokenizer that
splits the source file into the tokens specific to your language. Thankfully you
can use a stock Antlr grammar or JavaCC
grammar to generate a lexer for you. If you cannot use a lexer generator, for
instance because you are wrapping a lexer from another library, it is still relatively
easy to implement the Tokenizer interface.
Use the following guide to set up a new language module that supports CPD.
- Create a new Maven module for your language. You can take the Golang module as an example.
- Make sure to add your new module to the parent pom as
<module>
entry, so that it is built alongside the other languages. - Also add your new module to the dependencies list in “pmd-languages-deps/pom.xml”, so that the new language is automatically available in the binary distribution (pmd-dist).
- Make sure to add your new module to the parent pom as
- Implement a
Tokenizer
.-
For Antlr grammars you can take the grammar from antlr/grammars-v4 and place it in
src/main/antlr4
followed by the package name of the language. You then need to call the appropriate ant wrapper to generate the lexer from the grammar. To do so, editpom.xml
(eg like the Golang module). Once that is done,mvn generate-sources
should generate the lexer sources for you.You can now implement a tokenizer, for instance by extending
AntlrTokenizer
. The following reproduces the Go implementation: ```java // mind the package convention if you are going to make a PR package net.sourceforge.pmd.lang.go.cpd;
public class GoTokenizer extends AntlrTokenizer {
@Override protected Lexer getLexerForSource(CharStream charStream) { return new GolangLexer(charStream); } } ```
- For JavaCC grammars, place your grammar in
etc/grammar
and edit thepom.xml
like the Python implementation does. You can then subclassJavaCCTokenizer
instead of AntlrTokenizer. - For any other scenario just implement the interface however you can. Look at the Scala or Apex module for existing implementations.
-
-
Create a
Language
implementation, and make it implementCpdCapableLanguage
. If your language only supports CPD, then you can subclassCpdOnlyLanguageModuleBase
to get going:// mind the package convention if you are going to make a PR package net.sourceforge.pmd.lang.go; public class GoLanguageModule extends CpdOnlyLanguageModuleBase { // A public noarg constructor is required. public GoLanguageModule() { super(LanguageMetadata.withId("go").name("Go").extensions("go")); } @Override public Tokenizer createCpdTokenizer(LanguagePropertyBundle bundle) { // This method should return an instance of the tokenizer you created. return new GoTokenizer(); } }
To make PMD find the language module at runtime, write the fully-qualified name of your language class into the file
src/main/resources/META-INF/services/net.sourceforge.pmd.lang.Language
.At this point the new language module should be available in
CPD
and usable by CPD like any other language. -
Update the test that asserts the list of supported languages by updating the
SUPPORTED_LANGUAGES
constant in BinaryDistributionIT. - Add some tests for your tokenizer by following the section below.
Declaring tokenizer options
To make the tokenizer configurable, first define some property descriptors using
PropertyFactory
. Look at Tokenizer
for some predefined ones which you can reuse (prefer reusing property descriptors if you can).
You need to override newPropertyBundle
and call definePropertyDescriptor
to register the descriptors.
After that you can access the values of the properties from the parameter
of createCpdTokenizer
.
To implement simple token filtering, you can use BaseTokenFilter
as a base class, or another base class in net.sourceforge.pmd.cpd.impl
.
Take a look at the Kotlin token filter implementation, or the Java one.
Testing your implementation
Add a Maven dependency on pmd-lang-test
(scope test
) in your pom.xml
.
This contains utilities to test your tokenizer.
Create a test class extending from CpdTextComparisonTest
.
To add tests, you need to write regular JUnit @Test
-annotated methods, and
call the method doTest
with the name of the test file.
For example, for the Dart language:
package net.sourceforge.pmd.lang.dart.cpd;
public class DartTokenizerTest extends CpdTextComparisonTest {
/**********************************
Implementation of the superclass
***********************************/
public DartTokenizerTest() {
super("dart", ".dart"); // the ID of the language, then the file extension used by test files
}
@Override
protected String getResourcePrefix() {
// "testdata" is the default value, you don't need to override.
// This specifies that you should place the test files in
// src/test/resources/net/sourceforge/pmd/lang/dart/cpd/testdata
return "testdata";
}
/**************
Test methods
***************/
@Test // don't forget the JUnit annotation
public void testLiterals() {
// This will look for a file named literals.dart
// in the directory identified by getResourcePrefix,
// tokenize it, then compare the result against a baseline
// literals.txt file in the same directory
// If the baseline file does not exist, it is created automatically
doTest("literals");
}
}