I am reading a boo开发者_StackOverflow社区k and deleting a number of words from it. My problem is that the process takes long time, and i want to make its performance better(Less time), example :
Vector<String> pages = new Vector<String>(); // Contains about 1500 page, each page has about 1000 words.
Vector<String> wordsToDelete = new Vector<String>(); // Contains about 50000 words.
for( String page: pages ) {
String pageInLowCase = page.toLowerCase();
for( String wordToDelete: wordsToDelete ) {
if( pageInLowCase.contains( wordToDelete ) )
page = page.replaceAll( "(?i)\\b" + wordToDelete + "\\b" , "" );
}
// Do some staff with the final page that does not take much time.
}
This code takes around 3 minutes to execute. If i skipped the loop of replaceAll(...) i can save more than 2 minutes. So is there a way to do the same loop with a faster performance ?
Yes, you can process page in a different way. The basic idea is following
for (String word : page) {
if (!forbiddenWords.contains(word)) {
pageResult.append(word);
}
}
Here forbiddenWords
is a set.
Also, for (String word : page)
is a shorthand for parsing page into a list of words. Don't forget to add whitespaces to result too (I skip that for clarity).
The complexity of processing one page in original version was ~ 50000*1000, while now it's only ~1000. (checking if word is in HashSet
takes constant time)
edit
Since I wanted to divert myself from work for ten minutes, here's the code :)
String text = "This is a bad word, and this is very bad, terrible word.";
Set<String> forbiddenWords = new HashSet<String>(Arrays.asList("bad", "terrible"));
text += "|"; // mark end of text
boolean readingWord = false;
StringBuilder currentWord = new StringBuilder();
StringBuilder result = new StringBuilder();
for (int pos = 0; pos < text.length(); ++pos) {
char c = text.charAt(pos);
if (readingWord) {
if (Character.isLetter(c)) {
currentWord.append(c);
} else {
// finished reading a word
readingWord = false;
if (!forbiddenWords.contains(currentWord.toString().toLowerCase())) {
result.append(currentWord);
}
result.append(c);
}
} else {
if (Character.isLetter(c)) {
// start reading a new word
readingWord = true;
currentWord.setLength(0);
currentWord.append(c);
} else {
// append punctuation marks and spaces to result immediately
result.append(c);
}
}
}
result.setLength(result.length() - 1); // remove end of text mark
System.out.println(result);
For a start, you can get rid of the contains(..)
check. It adds unneeded overhead. And it will return true sometimes, when this is not the case. For example it will return true
for "not", even if there is only "knot" on the page.
Another thing - replace Vector
with ArrayList
.
And as Konrad indicated in his comment - you are not changing the vectors. String
is immutable, so you are not changing the objects. You'd have to use set(..)
(and maintain an iteration index).
The problem is you have a double for loop. These are generally poor performance and equate to x*y performance. Also since strings cannot be changed every single time you call toLowerCase and then replaceAll you are creating a new string. So you are creating x*y number of strings containing a whole page for each word in your list. This can be avoided using the MULTI_LINE and CASE_INSENSITIVE options in a regex.
You can reduce it to one loop and use regex to replace all the words at once.
StringBuffer buffer = new StringBuffer();
for (String word : wordsToDelete) {
if (buffer.length() != 0) {
buffer.append("|");
}
buffer.append("(\\b");
buffer.append(word);
buffer.append("\\b)");
}
Pattern pattern = Pattern.compile(buffer.toString(), Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
List<String> newPageList = new ArrayList<String>();
for (String page : pages) {
Matcher matcher = pattern.matcher(page);
String newPage = matcher.replaceAll("");
newPageList.add(newPage);
}
Assuming the pages are independent, and if you have multiple cores around, and you've got a lot of pages to process, this loop could be parallelized, also:
final ArrayList<String> pages = ...;
final Set<String> wordsToDelete = ...;
final ExecutorService pageFrobber = Executors.newFixedThreadPool(8); //pick suitable size
final List<Callable<String>> toFrobPages = new ArrayList<Callable<String>>(pages.size());
for( final String page: pages ) {
toFrobPages.add(new Callable<String>() {
String call() {
return page.toLowerCase().replaceAll( "(?i)\\b" + wordToDelete + "\\b" , "" );
}
});
}
final List<Future<String>> frobbedPages = pageFrobber.executeAll(toFrobPages);
// the above will block until all pages are processed
// frobbedPages will contain a set of Future<String> which can be converted to strings
// by calling get()
Use java.lang.StringBuilder
- it's created specially for modified text.
StringBuilder builder = new StringBuilder(page);
for (String word: wordsToDelete) {
int position = 0;
int newpos = 0;
while ((newpos = builder.indexOf(word, position))>=0) {
builder.delete(position, position+word.length());
position = newpos;
}
}
It's just the idea - it doesn't check for word boundaries
精彩评论