We are developing a plagiarism detection framework. In there i have to highlight the possible plagiarized phrases in the docu开发者_如何学Cment. The document gets preprocessed with stop word removal, stemming and number removal first. So the highlighting gets difficult with the preprocessed token As and example:
Orginal Text: "Extreme programming is one approach of agile software development which emphasizes on frequent releases in short development cycles which are called time boxes. This result in reducing the costs spend for changes, by having multiple short development cycles, rather than one long one. Extreme programming includes pair-wise programming (for code review, unit testing). Also it avoids implementing features which are not included in the current time box, so the schedule creep can be minimized. "
phrase want to highlight: Extreme programming includes pair-wise programming
preprocessed token : Extrem program pair-wise program
Is there anyway I can highlight the preprocessed token in the original document????
Thanx
You'd better use JTextPane or JEditorPane, instead of JTextArea.
A text area is a "plain" text component, which means taht although it can display text in any font, all of the text is in the same font.
So, JTextArea
is not a convenient component to make any text formatting.
On the contrary, using JTextPane
or JEditorPane
, it's quite easy to change style (highlight) of any part of loaded text.
See How to Use Editor Panes and Text Panes for details.
Update:
The following code highlights the desired part of your text. It's not exectly what you want. It simply finds the exact phrase in the text.
But I hope that if you apply your algorithms, you can easily modify it to fit your needs.
import java.lang.reflect.InvocationTargetException;
import javax.swing.*;
import javax.swing.text.*;
import java.awt.*;
public class LineHighlightPainter {
String revisedText = "Extreme programming is one approach "
+ "of agile software development which emphasizes on frequent"
+ " releases in short development cycles which are called "
+ "time boxes. This result in reducing the costs spend for "
+ "changes, by having multiple short development cycles, "
+ "rather than one long one. Extreme programming includes "
+ "pair-wise programming (for code review, unit testing). "
+ "Also it avoids implementing features which are not included "
+ "in the current time box, so the schedule creep can be minimized. ";
String token = "Extreme programming includes pair-wise programming";
public static void main(String args[]) {
try {
SwingUtilities.invokeAndWait(new Runnable() {
public void run() {
new LineHighlightPainter().createAndShowGUI();
}
});
} catch (InterruptedException ex) {
// ignore
} catch (InvocationTargetException ex) {
// ignore
}
}
public void createAndShowGUI() {
JFrame frame = new JFrame("LineHighlightPainter demo");
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
JTextArea area = new JTextArea(9, 45);
area.setLineWrap(true);
area.setWrapStyleWord(true);
area.setText(revisedText);
// Highlighting part of the text in the instance of JTextArea
// based on token.
highlight(area, token);
frame.getContentPane().add(new JScrollPane(area), BorderLayout.CENTER);
frame.pack();
frame.setVisible(true);
}
// Creates highlights around all occurrences of pattern in textComp
public void highlight(JTextComponent textComp, String pattern) {
// First remove all old highlights
removeHighlights(textComp);
try {
Highlighter hilite = textComp.getHighlighter();
Document doc = textComp.getDocument();
String text = doc.getText(0, doc.getLength());
int pos = 0;
// Search for pattern
while ((pos = text.indexOf(pattern, pos)) >= 0) {
// Create highlighter using private painter and apply around pattern
hilite.addHighlight(pos, pos + pattern.length(), myHighlightPainter);
pos += pattern.length();
}
} catch (BadLocationException e) {
}
}
// Removes only our private highlights
public void removeHighlights(JTextComponent textComp) {
Highlighter hilite = textComp.getHighlighter();
Highlighter.Highlight[] hilites = hilite.getHighlights();
for (int i = 0; i < hilites.length; i++) {
if (hilites[i].getPainter() instanceof MyHighlightPainter) {
hilite.removeHighlight(hilites[i]);
}
}
}
// An instance of the private subclass of the default highlight painter
Highlighter.HighlightPainter myHighlightPainter = new MyHighlightPainter(Color.red);
// A private subclass of the default highlight painter
class MyHighlightPainter
extends DefaultHighlighter.DefaultHighlightPainter {
public MyHighlightPainter(Color color) {
super(color);
}
}
}
This example is based on Highlighting Words in a JTextComponent.
From a technical point of view: You can either choose or develop a markup language and add annotations or tags to the original document. Or you want to create a second file that records all potential plagiarisms.
With markup, your text could look like this:
[...] rather than one long one. <plag ref="1234">Extreme programming
includes pair-wise programming</plag> (for code review, unit testing). [...]
(with ref referencing to some metadata record that describes the original)
You could use java.text.AttributedString to annotate the preprocessed tokens in the original document. Then apply TextAttributes to the relevant ones (which whould take effect in the original document.
精彩评论