开发者

Algorithm to detect how many words typed, also multi sentence support (Java)

开发者 https://www.devze.com 2022-12-28 13:28 出处:网络
Problem: I have to design an algorithm, which does the following for me: Say that I have a line (e.g.) alert tcp 192.168.1.1 (caret is currently here)

Problem:

I have to design an algorithm, which does the following for me:

Say that I have a line (e.g.)

alert tcp 192.168.1.1 (caret is currently here)

The algorithm should process this line, and return a value of 4.

I coded something for it, I know it's sloppy, but it works, partly.

private int counter = 0;
    public void determineRuleActionRegion(String str, int index) {
        if (str.length() == 0 || str.indexOf(" ") == -1) {
            triggerSuggestionList(1);
            return;
        }

        //remove duplicate space, spaces in front and back before searching
        int num = str.trim().replaceAll(" +", " ").indexOf(" ", index);
        //Check for occurances of spaces, recursively
        if (num == -1) { //if there is no space
            //no need to check if it's 0 times it will assign to 1
            triggerSuggestionList(counter + 1);
            counter = 0;
            return; //set to rule action
        } else { //there is a space
            counter++;
            determineRuleActionRegion(str, num + 1);
        }

    } //end of determineactionRegion()

So basically I find for the space and determine the region (number of words typed). However, I want it to change upon the user pressing space bar <space character>.

How may I go around with the current code?

Or better yet, how would one suggest me to do it the correct way? I'm figuring out on BreakIterator for this case...

To add to that, I believe my algorithm won't work for multi sentences. How should I address this problem as well.

--

The source of String str is acquired from textPane.getText(0, pos + 1);, the JTextPane.

Thanks in advance. Do let me know if my question is still not specific enough.

--

More e开发者_开发知识库xamples:

alert tcp $EXTERNAL_NET any -> $HOME_NET 22 <caret>

return -1 (maximum of the typed text is 7 words)

alert tcp 192.168.1.1 any<caret> 

return 4 (as it is still at 2nd arg)

alert tcp<caret>

return 2 (as it is still at 2nd arg)

alert tcp <caret>

return 3

alert tcp $EXTERNAL_NET any -> <caret>

return 6

It is something like shell commands. As above. Though I think it does not differ much I believe, I just want to know how many arguments are typed. Thanks.

--

Pseudocode

Get whole paragraph from textpane
  if more than 1 line -> process the last line
      count how many arguments typed and return appropriate number
  else
    process current line
      count how many arguments typed and return appropriate number
End


This uses String.split; I think this is what you want.

    String[] texts = {
        "alert tcp $EXTERNAL_NET any -> $HOME_NET 22 ",
        "alert tcp 192.168.1.1 any",
        "alert tcp",
        "alert tcp ",
        "alert tcp $EXTERNAL_NET any -> ",
        "multine\ntest\ntest  1   2   3",
    };

    for (String text : texts) {
        String[] lines = text.split("\r?\n|\r");
        String lastLine = lines[lines.length - 1];

        String[] tokens = lastLine.split("\\s+", -1);
        for (String token : tokens) {
            System.out.print("[" + token + "]");
        }

        int pos = (tokens.length <= 7) ? tokens.length : -1;
        System.out.println(" = " + pos);
    }

This produces the following output:

[alert][tcp][$EXTERNAL_NET][any][->][$HOME_NET][22][] = -1
[alert][tcp][192.168.1.1][any] = 4
[alert][tcp] = 2
[alert][tcp][] = 3
[alert][tcp][$EXTERNAL_NET][any][->][] = 6
[test][1][2][3] = 4


The codes provided by polygenelubricants and helios work, to a certain extent. It addresses the aforementioned problem I'd stated, but not with multi-lines. helios's code is more straightforward.

However both codes did not address the problem when you press enter in the JTextPane, it will still return back the old count instead of 1 as the split() returns it as one sentence instead of two.

E.g. alert tcp <enter is pressed> By right it should return 1 since it is a new sentence. It returned 2 for both algorithms. Also, if I highlight all and delete both algorithms will throw NullPointerException as there is no string to be split.

I added one line, and it solved the problems mentioned above:

public void determineRuleActionRegion(String str) {
    //remove repetitive spaces and concat $ for new line indicator
    str = str.trim().replaceAll(" +", " ") + "$";
    String[] lines = str.split("\r?\n|\r");
    String lastLine = lines[lines.length - 1];
    String[] tokens = lastLine.split("\\s+", -1);
    int pos = (tokens.length <= 7) ? tokens.length : -1;
    triggerSuggestionList(pos);
    System.out.println("Current pos: " + pos);
    return;
} //end of determineactionRegion()

With that, when split() parses the str, the "$" will create another line, which will be the last line regardless, and the count now will return to one. Also, there will not be NullPointerException as the "$" is always there.

However, without the help of polygenelubricants and helios, I don't think I will be able to figure it out so soon. Thanks guys!

EDIT: Okay... apparently split("\r?\n|\r",-1) works the same. Question is should I accept polygenelubricants or my own? Hmm.

2nd EDIT: One thing bad about concatenating '%' to the end of the str, lastLine.endsWith(" ") == true will return false. So have to use split("\r?\n|\r",-1) and lastLine.endsWith(" ") == true for the complete solution.


What about this: get last line, count what's between spaces...

String text = ...
String[] lines = text.split("\n"); // or \r\n depending on how you get the string
String lastLine = lines[lines.length-1];
StringTokenizer tokenizer = new StringTokenizer(lastLine, " ");
// note that strtokenizer will ignore empty tokens, it is, what is between two consecutive spaces
int count = 0;
while (tokenizer.hasMoreTokens()) {
  tokenizer.nextToken();
  count++;
}
return count;

Edit you could control if you have a final space (lastLine.endsWith(" ")) so you are starting a new word or whatever, it's a basic approach for you to make it up :)


Is the sample line representative? An editor for some rule based language (ACLs)?

How about going for a full Information Extraction/named entity recognition solution, the one that will be able to recognize entities (keywords, ip addresses, etc)? You don't have to write everything from scratch, there're existing tools and libraries.

UPDATE: Here's a piece of Snort code that I believe does the parsing:

Function ParseRule()
if (*args == '(') {
   // "Preprocessor Rule detected"

} else {
    /* proto ip port dir ip port r*/
    toks = mSplit(args, " \t", 7, &num_toks, '\\');

    /* A rule might not have rule options */
    if (num_toks < 6) {
        ParseError("Bad rule in rules file: %s", args);
    }
..
 }
 otn = ParseRuleOptions(sc, rtn, roptions, rule_type, protocol);
..

mSplit is defined in mstring.c, a function to split a string into tokens.

In your case, ParseRuleOptions should return one for the whole string inside brackets I guess.

UPDATE 2: btw, is your first example correct, since in snort, you can add options to rules? For example this is a valid rule being written (options section not completed):

alert tcp any any -> 192.168.1.0/24 111 (content:"|00 01 86 a5|"; <caret>

In some cases you can have either 6 or 7 'words', so your algorithm should have a bit more knowledge, right?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号