开发者

To find similar words (strings) in two files

开发者 https://www.devze.com 2023-03-27 21:55 出处:网络
I have to validate the similarity of word 1 in file 1 with word 2 in file 2 and so on. if word 1 (file 1).equals to word 2 (file 2), file 3 will be the output to show the True and False. Below is the

I have to validate the similarity of word 1 in file 1 with word 2 in file 2 and so on. if word 1 (file 1).equals to word 2 (file 2), file 3 will be the output to show the True and False. Below is the coding but I am stuck when there is no error but giving no output. Am a beginner in JAVA.

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class test2 {

    private static ArrayList<String> load(String f1) throws FileNotF开发者_如何学GooundException {
        Scanner reader = new Scanner(new File(f1));
        ArrayList<String> out = new ArrayList<String>();
        while (reader.hasNext()) {
            String temp = reader.nextLine();
            String[] sts = temp.split(" ");
            for (int i = 0; i < sts.length; i++) {
                if (sts[i].equals("") && sts[i].equals(" ") && sts[i].equals("\n")) {
                    out.add(sts[i]);
                }
            }
        }
        return out;
    }

    private static void write(ArrayList<String> out, String fname) throws IOException {
        FileWriter writer = new FileWriter(new File("out_test2.txt"));
        for (int i = 0; i < out.size(); i++) {
            writer.write(out.get(i) + "\n");
        }
        writer.close();
    }

    public static void main(String[] args) throws IOException {
        ArrayList<String> file1;
        ArrayList<String> file2;
        ArrayList<String> out = new ArrayList<String>();
        file1 = load("IbanDict.txt");
        file2 = load("AFF_outVal.txt");

        for (int i = 0; i < file1.size(); i++) {
            String word1 = file1.get(i);
            for (int z = 0; z < file2.size(); z++) {
                if (word1.equalsIgnoreCase(file2.get(z))) {
                    boolean already = false;
                    for (int q = 0; q < out.size(); q++) {
                        if (out.get(q).equalsIgnoreCase(file1.get(i))) {
                            already = true;
                        }
                    }
                    if (already == false) {
                        out.add(file1.get(i));
                    }
                }
            }
        }
        write(out, "out_test2.txt");
    }

}


Here is my suggestion for your porblem

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

  private static final Pattern WORD_PATTERN = Pattern.compile("[\\w']+");

  private static Map<String, Integer> load(final String f1) throws FileNotFoundException {
    Scanner reader = new Scanner(new File(f1));
    Map<String, Integer> out = new HashMap<String, Integer>();
    while (reader.hasNext()) {
      String tempLine = reader.nextLine();
      if (tempLine != null && tempLine.trim().length() > 0) {
        Matcher matcher = WORD_PATTERN.matcher(tempLine);
        while (matcher.find()) {
          out.put(matcher.group().toLowerCase(), 0);
        }
      }
    }

    return out;
  }

  private static void write(final Map<String, Integer> out, final String fname) throws IOException {
    FileWriter writer = new FileWriter(new File(fname));
    for (Map.Entry<String, Integer> word : out.entrySet()) {
      if (word.getValue() == 1) {
        writer.write(word.getKey() + "\n");
      }
    }
    writer.close();
  }

  public static void main(final String[] args) throws IOException {
    Map<String, Integer> file1 = load("file1.txt");
    Map<String, Integer> file2 = load("file2.txt");

    // below for loop will run just one time, so it is much faster
    for (Map.Entry<String, Integer> file1Word : file1.entrySet()) {
      if (file2.containsKey(file1Word.getKey())) {
        file1.put(file1Word.getKey(), 1);
        file2.put(file1Word.getKey(), 1);
      }
    }

    write(file1, "test1.txt");
    write(file2, "test2.txt");
  }

}


Firstly, Scanner will tokenise your String for you. There is no need to read in a line and tokenise using the String.split method; refer here.

Secondly, it looks like you have a logic error here:

for (int i = 0; i < sts.length; i++) {
    if (sts[i].equals("") && sts[i].equals(" ")
            && sts[i].equals("\n"))
       out.add(sts[i]);
}

(assuming I understand what you're trying to do) it should be:

for (int i = 0; i < sts.length; i++) {
    if (!(sts[i].equals("") && sts[i].equals(" ") && sts[i]
           .equals("\n")))
       out.add(sts[i]);
}

This is why you are not seeing any output.

Note: This way of matching is error prone and far from optimal (linear); you might have more success with a specialised text parsing language like awk or Python (assuming you're not bound to Java). If you're stuck with Java, an alternative implementation might be to extend FilterReader/Writer classes as shown here.


There are a few issues I see. One being the redundant splitting on spaces wulfgar.pro pointed out.

Another issue is that Scanner will include punctuation, so file1 "I am happy and sad" will not find "happy" if file2 is "You are happy.".

I also changed it to use Sets, since you don't seem to be worried about how many times a word matches. Then use for-each loops to iterate (you are using generics, so you should be able to do for-each loops as well).

So I rewrote the while-loop in the load method:

private static final Pattern PUNCTUATION_PATTERN = Pattern.compile("[\\w']+");

private static Set<String> load(String f1) throws FileNotFoundException {
    Scanner reader = new Scanner(new File(f1));
    Set<String> out = new HashSet<String>();
    while (reader.hasNext()) {
        String tempLine = reader.nextLine();
        if (tempLine != null
                && tempLine.trim().length() > 0) {
            Matcher matcher = PUNCTUATION_PATTERN.matcher(tempLine);
            while (matcher.find()) {
                out.add(tempLine.substring(matcher.start(), matcher.end()));
            }
        }
    }
    return out;
}

The for-loop in the main method can then be simplified to:

public static void main(String[] args) throws IOException {
    Set<String> out = new HashSet<String>();
    Set<String> file1 = load("IbanDict.txt");
    Set<String> file2 = load("AFF_outVal.txt");

    for (String word1 : file1) {
        for (String word2 : file2) {
            if (word1.equalsIgnoreCase(word2)) {
                boolean already = false;
                for (String outStr : out) {
                    if (outStr.equalsIgnoreCase(word1)) {
                        already = true;
                    }
                }
                if (!already) {
                    out.add(word1);
                }
            }
        }
    }
    write(out, "out_test2.txt");
}

And change the write method to iterate, and use File.separator to be OS-independent:

private static void write(Iterable<String> out, String fname) throws IOException {
    OutputStreamWriter writer = new FileWriter(new File(fname));
    for (String s : out) {
        writer.write(s + File.separator);
    }
    writer.close();
}


So basically you want to check if a word from file 2 also exists in file 1. If so print true, if not print false.

The easiest way is probably to make a searchable dataset of all the words in file 1. For each word in file 2 you can then check against the dataset wheter it does or does not contain a word.

The code below does nothing. it creates a array of all words in the file in sts and then you check wheter a word is nothing AND a space AND a newline. if so you add it to an ArrayList. A word will never be all those things and therefore never a word will be added to out.

Scanner reader = new Scanner(new File(f1));
ArrayList<String> out = new ArrayList<String>();
while (reader.hasNext()) {
  String temp = reader.nextLine();    
  String[] sts = temp.split(" ");
  for (int i = 0; i < sts.length; i++) {
    if (sts[i].equals("") && sts[i].equals(" ") && sts[i].equals("\n")) {
      out.add(sts[i]);
    }
  }
}

Modify your loop here to get a collection of all words by iterating all tokens in your scanner and adding them to the arraylist

while (reader.hasNext()) {
 out.add(reader.next());
}

Now that you have a arraylist of all words in your dictionary you can start to check.

To see if a word from file 2 is contained in the dictionary you can simply call

dictionary.contains(file2.get(i))

contains uses the equals method of all Strings in the ArrayList to check if there is a match.

Now if you want to do it line by line you should not make 2 datasets. your dictionary should be a dataset, but for file 2 it is easier to just use the Scanner object.

Read each line from the Scanner. Make sure you use hasNextLine() instead of hasNext() here since hasNextLine() does the check you need for the itteration.

line = reader.nextLine();

check for each token in the line if it has a match in the list and write true or false + a space if it does

String[] splitLine = line.split(" "); 
for(String token: splitLine){    
  writer.write(dictionary.contains(file2.get(i))+" ");
}

While checking each line you can write a line to your output file so that the line numbers match.

Your definite code will look something like this:

public class Test{

  private static List<String> loadDictionary(String fileName) throws FileNotFoundException {
    Scanner reader = new Scanner(new File(fileName));
    List<String> out = new ArrayList<String>();
    while (reader.hasNext()) {
      out.add(reader.next());
    }
    reader.close();
    return out;
  }

  public static void main(String[] args) throws IOException {
    List<String> dictionary;
    dictionary = loadDictionary("IbanDict.txt");

    Scanner reader = new Scanner(new File("AFF_outVal.txt"));
    OutputStreamWriter writer = new FileWriter(new File("out_test2.txt"));

    while(reader.hasNextLine()){
      String line = reader.nextLine();
      String[] tokens = line.split(" ");
      for(String token: tokens){
        writer.write(dictionary.contains(token)+" ");
      }
      writer.write(System.getProperty("line.separator"));
    }
    writer.close();
    reader.close();
  }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号