Created extra CSV output format 'csv_with_linecount_per_file' which outputs the correct line count per file.

Some of the tokenizers ignore comments and therefore the line count of a duplication can differ per file. Take for example the following files: FileA.java: 1: public class FileA { 2: pulbic String Foo() { 3: return "Foo"; 4: } 5: } FileB.java: 1: public class FileB { 2: pulbic String Foo() { 3: // This is a comment 4: return "Foo"; 5: } 6: } When comments are ignored and not tokenized, the duplication consist of the following tokens: '{', 'public', 'String', 'Foo', '(', ')', '{', 'return', 'Foo', ';', '}', '}' For 'FileA.java' the duplication is 5 lines long, it starts at line 1 and ends at line 5. For 'FileB.java' the duplication is 6 lines long, it starts at line 1 and ends at line 6. Note that this is just 1 example, because for most tokenizers comments and white spaces are not significant. For example the following file contains the same duplication all on 1 line: FileC.java 1: public class FileC { public String Foo() { return "Foo"; } } For us the correct line count per file is important, because we highlight the duplications in an annotated source view and show the percentage of duplicated code the file contains. The current output formats only contain 1 line count per duplication and file set. For the above example CPD would output the following: Found a 4 line (12 tokens) duplication in the following files: Starting at line 1 of FileA.java Starting at line 1 of FileB.java For FileB.java this is not correct and would lead to incorrect percentage of duplicated code. (66% (4 of 6 lines) instead of the correct 83% (5 of 6 lines)). To fix the problem, I created an extra output format 'csv_with_linecount_per_file' which outputs the correct line count per file. The format contains the following: tokens,occurrences <nr of tokens>,<nr of occurrences>(,<begin line>,<line count>,<file name>)+ For the above example the output would be tokens,occurrences 12,2,1,4,FileA.java,1,5,FileB.java
2015-01-21 13:55:49 +01:00
parent 5ae01d2d3e
commit b1769846e5
14 changed files with 311 additions and 93 deletions
--- a/pmd-java/src/test/java/net/sourceforge/pmd/cpd/MatchAlgorithmTest.java
+++ b/pmd-java/src/test/java/net/sourceforge/pmd/cpd/MatchAlgorithmTest.java
@ -55,15 +55,18 @@ public class MatchAlgorithmTest {
        Match match = matches.next();
        assertFalse(matches.hasNext());

-        Iterator<TokenEntry> marks = match.iterator();
-        TokenEntry mark1 = marks.next();
-        TokenEntry mark2 = marks.next();
+        Iterator<Mark> marks = match.iterator();
+        Mark mark1 = marks.next();
+        Mark mark2 = marks.next();
        assertFalse(marks.hasNext());

        assertEquals(3, mark1.getBeginLine());
+        assertEquals("Foo.java", mark1.getFilename());
+        assertEquals(LINE_3, mark1.getSourceCodeSlice());
+
        assertEquals(4, mark2.getBeginLine());
-        assertTrue("Foo.java" == mark1.getTokenSrcID() && "Foo.java" == mark2.getTokenSrcID());
-        assertEquals(LINE_3, match.getSourceCodeSlice());
+        assertEquals("Foo.java", mark2.getFilename());
+        assertEquals(LINE_4, mark2.getSourceCodeSlice());
    }

    @Test
@ -84,7 +87,7 @@ public class MatchAlgorithmTest {
        Match match = matches.next();
        assertFalse(matches.hasNext());

-        Iterator<TokenEntry> marks = match.iterator();
+        Iterator<Mark> marks = match.iterator();
        marks.next();
        marks.next();
        marks.next();