开发者

C++ replace multiple strings in a string in a single pass

开发者 https://www.devze.com 2023-01-19 20:01 出处:网络
Given the following string, \"Hi ~+ and ^*. Is ^* still flying around ~+?\" I want to replace all occurrences of \"~+\" and \"^*\" with \"Bobby\" and \"Danny\", so the string becomes:

Given the following string, "Hi ~+ and ^*. Is ^* still flying around ~+?"

I want to replace all occurrences of "~+" and "^*" with "Bobby" and "Danny", so the string becomes:

"Hi Bobby and Danny. Is Danny still fl开发者_如何学Cying around Bobby?"

I would prefer not to have to call Boost replace function twice to replace the occurrences of the two different values.


I managed to implement the required replacement function using Boost.Iostreams. Specifically, the method I used was a filtering stream using regular expression to match what to replace. I am not sure about the performance on gigabyte sized files. You will need to test it of course. Anyway, here's the code:

#include <boost/regex.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <iostream>

int main()
{
   using namespace boost::iostreams;

   regex_filter filter1(boost::regex("~\\+"), "Bobby");
   regex_filter filter2(boost::regex("\\^\\*"), "Danny");

   filtering_ostream out;
   out.push(filter1);
   out.push(filter2);
   out.push(std::cout);

   out << "Hi ~+ and ^*. Is ^* still flying around ~+?" << std::endl;

   // for file conversion, use this line instead:
   //out << std::cin.rdbuf();
}

The above prints "Hi Bobby and Danny. Is Danny still flying around Bobby?" when run, just like expected.

It would be interesting to see the performance results, if you decide to measure it.

Daniel

Edit: I just realized that regex_filter needs to read the entire character sequence into memory, making it pretty useless for gigabyte-sized inputs. Oh well...


I did notice it's been a year since this was active, but for what it's worth. I came across an article on CodeProject today that claims to solve this problem - maybe you can use ideas from there:

I can't vouch for its correctness, but might be worth taking a look at. :)

The implementation surely requires holding the entire string in memory, but you can easily work around that (as with any other implementation that performs the replacements) as long as you can split the input into blocks and guarantee that you never split at a position that is inside a symbol to be replaced. (One easy way to do that in your case is to split at a position where the next char isn't any of the chars used in a symbol.)

--

There is a reason beyond performance (though that is a sufficient reason in my book) to add a "ReplaceMultiple" method to one's string library: Simply doing the replace operation N times is NOT correct in general.

If the values that are substituted for the symbols are not constrained, values can end up being treated as symbols in subsequent replace operations. (There could be situations where you'd actually want this, but there are definitely cases where you don't. Using strange-looking symbols reduces the severity of the problem, but doesn't solve it, and "is ugly" because the strings to be formatted may be user-defineable - and so should not require exotic characters.)

However, I suspect there is a good reason why I can't easily find a general multi-replace implementation. A "ReplaceMultiple" operation simply isn't (obviously) well-defined in general.

To see this, consider what it might mean to "replace 'aa' with '!' and 'baa' with '?' in the string 'abaa'"? Is the result 'ab!' or 'a?' - or is such a replacement illegal?

One could require symbols to be "prefix-free", but in many cases that'd be unacceptable. Say I want to use this to format some template text. And say my template is for code. I want to replace "§table" with a database table name known only at runtime. It'd be annoying if I now couldn't use "§t" in the same template. The templated script could be something completely generic, and lo-and-behold, one day I encounter the client that actually made use of "§" in his table names... potentially making my template library rather less useful.

A perhaps better solution would be to use a recursive-descent parser instead of simply replacing literals. :)


Very late answer but none of answers so far give a solution.

With a bit of Boost Spirit Qi you can do this substitution in one pass, with extremely high efficiency.

#include <iostream>
#include <string>
#include <string_view>
#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>

namespace bsq = boost::spirit::qi;

using SUBSTITUTION_MAP = std::map<std::string, std::string>;//,std::string>;

template <typename InputIterator>
struct replace_grammar
    : bsq::grammar<InputIterator, std::string()>
{
    replace_grammar(const SUBSTITUTION_MAP& substitution_items)
        : replace_grammar::base_type(main_rule)
        {
            for(const auto& [key, value] : substitution_items) {
                replace_items.add(key,value);
            }
            
            main_rule = *( replace_items [( [](const auto &val,  auto& context) {
                auto& res = boost::fusion::at_c<0>(context.attributes);
                res += val;  })]
                |
                bsq::char_ 
                [( [](const auto &val,  auto& context) {
                    auto& res = boost::fusion::at_c<0>(context.attributes);
                    res += val;  })] );
        }

private :
    bsq::symbols<char, std::string> replace_items;
    bsq::rule<InputIterator, std::string()> main_rule;

};


std::string replace_items(std::string_view input, const SUBSTITUTION_MAP& substitution_items)
{
    std::string result;
    result.reserve(input.size());
  
    using iterator_type = std::string_view::const_iterator;
    
    const replace_grammar<iterator_type> p(substitution_items);

    if (!bsq::parse(input.begin(), input.end(), p, result))
        throw std::logic_error("should not happen");

    return result;

}

int main()
{
    std::cout << replace_items("Hi ~+ and ^*. Is ^* still flying around ~+?",{{"~+", "Bobby"} , { "^*", "Danny"}});
}

The qi::symbol is essentially doing the job you ask for , i.e searching the given keys and replace with the given values. https://www.boost.org/doc/libs/1_79_0/libs/spirit/doc/html/spirit/qi/reference/string/symbols.html

As said in the doc it builds behind the scene a Ternary Search Tree, which means that it is more efficient that searching n times the string for each key.


Boost string_algo does have a replace_all function. You could use that.


I suggest using the Boost Format library. Instead of ~+ and ^* you then use %1% and %2% and so on, a bit more systematically.

Example from the docs:

cout << boost::format("writing %1%,  x=%2% : %3%-th try") % "toto" % 40.23 % 50; 
     // prints "writing toto,  x=40.230 : 50-th try"

Cheers & hth.,

– Alf


I would suggest using std::map. So you have a set of replacements, so do:

std::map<std::string,std::string> replace;
replace["~+"]=Bobby;
replace["^*"]=Danny;

Then you could put the string into a vector of strings and check to see if each string occurs in the map and if it does replace it, you'd also need to take off any punctuation marks from the end. Or add those to the replacements. You could then do it in one loop. I'm not sure if this is really more efficient or useful than boost though.

0

精彩评论

暂无评论...
验证码 换一张
取 消