开发者

C# Removing separator characters from quoted strings

开发者 https://www.devze.com 2023-01-25 13:47 出处:网络
I\'m writing a program that has to remove separator characters from quoted strings in text files. For example:

I'm writing a program that has to remove separator characters from quoted strings in text files.

For example:

"Hello, my name is world"

Has to be:

"Hello my name is world"

This sounds quite easy at first (I thought it would be), but you need to detect when the quote starts, when the quote开发者_如何学编程 ends, then search that specific string for separator characters. How?

I've experimented with some Regexs but I just keep getting myself confused!

Any ideas? Even just something to get the ball rolling, I'm just completely stumped.


string pattern = "\"([^\"]+)\"";
value = Regex.Match(textToSearch, pattern).Value;

string[] removalCharacters = {",",";"}; //or any other characters
foreach (string character in removalCharacters)
{
    value = value.Replace(character, "");
}


why not try and do it with Linq ?

var x = @" this is a great whatever ""Hello, my name is world"" and all that";

var result = string.Join(@"""", x.Split('"').
Select((val, index) => index%2 == 1 ? 
val.Replace(",", "") : val).ToArray());


Using a regex pattern with a look-ahead the pattern would be: "\"(?=[^\"]+,)[^\"]+\""

The \" matches the opening double-quote. The look-ahead (?=[^\"]+,) will try to match a comma within the quoted text. Next we match the rest of the string as long as it's not a double-quote [^\"]+, then we match the closing double-quote \".

Using Regex.Replace allows for a compact approach to altering the result and removing the unwanted commas.

string input = "\"Hello, my name, is world\"";
string pattern = "\"(?=[^\"]+,)[^\"]+\"";
string result = Regex.Replace(input, pattern, m => m.Value.Replace(",", ""));
Console.WriteLine(result);


What you want to write is called a "lexer" (or alternatively a "tokenizer"), that reads the input character by character and breaks it up into tokens. That's generally how parsing in a compiler works (as a first step). A lexer will break text up into a stream of tokens (string literal, identifer, "(", etc). The parser then takes those tokens, and uses them to produce a parse tree.

In your case, you only need a lexer. You will have 2 types of tokens "quoted strings", and "everything else".

You then just need to write code to break the input up into tokens. By default something is an "everything else" token. A string token starts when you see a ", and ends when you see the next ". If you are reading source code you may have to deal with things like \" or "" as special cases.

Once you have done that, then you can just iterate over the tokens and do what ever processing you need on the "string" tokens.


I've had to do something similar in an application I use to translate flat files. This is the approach I took: (just a copy/paste from my application)

        protected virtual string[] delimitCVSBuffer(string inputBuffer) {
        List<string> output       = new List<string>();
        bool insideQuotes         = false;
        StringBuilder fieldBuffer = new StringBuilder();
        foreach (char c in inputBuffer) {
            if (c == FieldDelimiter && !insideQuotes) {
                output.Add(fieldBuffer.Remove(0, 1).Remove(fieldBuffer.Length - 1, 1).ToString().Trim());
                fieldBuffer.Clear();
                continue;
            } else if (c == '\"')
                insideQuotes = !insideQuotes;
            fieldBuffer.Append(c);
        }
        output.Add(fieldBuffer.Remove(0, 1).Remove(fieldBuffer.Length - 1, 1).ToString().Trim());
        return output.ToArray();
    }


So I guess you have some long text with a lot of quotes inside? I would make a method that does something like this:

  1. Run thought the string until you encounter the first "
  2. Then take the substring up till the next ", and do a str.Replace(",","") and also replace any other characters that you want to replace.
  3. Then go without replacing until you encounter the next " and continue until the end.

EDIT

I just got a better idea. What about this:

  string mycompletestring = "This is a string\"containing, a quote\"and some more text";
  string[] splitstring = mycompletestring.Split('"');
  for (int i = 1; i < splitstring.Length; i += 2) {
    splitstring[i] = splitstring[i].Replace(",", "");
  }
  StringBuilder builder = new StringBuilder();
  foreach (string s in splitstring) {
    builder.Append(s + '"');
  }
  mycompletestring = builder.ToString().Substring(0, builder.ToString().Length - 1);

I think there should be a better way of combining the string into one with a " between them at the end, but I don't know any better ones, so feel free to suggest a good method here :)


Ok, this is a bit wacky, but it works.

So first off you split your string up into parts, based on the " character:

string msg = "this string should have a comma here,\"but, there should be no comma in this bit\", and there should be a comma back at that and";

var parts = msg.Split('"');

then you need to join the string back together on the " character, after removing each comma in every other part:

string result = string.Join("\"", RemoveCommaFromEveryOther(parts));

The removal function looks like this:

IEnumerable<string> RemoveCommaFromEveryOther(IEnumerable<string> parts)
{
    using (var partenum = parts.GetEnumerator())
    {
        bool replace = false;
        while (partenum.MoveNext())
        {
            if(replace)
            {
                yield return partenum.Current.Replace(",","");
                replace = false;
            }
            else
            {
                yield return partenum.Current;
                replace = true;
            }
        }
    }
}

The does require that you include a using directive for System.Collections.Generic.


There are many ways to do this: Lok at the functions string.Split() and string.IndexOfAny()

You can use string.Split(new char[] {',',' '}, StringSplitOption.RemoveEmptyEntries) to slipt the phrase into words, then use the StringBuilder class to put the words together.

Calling string.Replace("[char to remove goes here]"',"") multiple times with each char you want to remove will also work.

EDIT:

Call string.Split(new char[] {'\"'}, StringSplitOption.RemoveEmptyEntries) to obtain an array of the strings that are between quotes ( " ) then call Replace on each of them, then put the strings together with StringBuilder.

0

精彩评论

暂无评论...
验证码 换一张
取 消