开发者

Robust way to hide bits of text during a find-and-replace?

开发者 https://www.devze.com 2023-04-02 18:24 出处:网络
Let\'s say I have some text: <hello> <world> <:how> are <you> Now I want to HTML-encode it so that the <>s don\'t mess things up. But <:how> is special because i

Let's say I have some text:

<hello> <world> <:how> are <you>

Now I want to HTML-encode it so that the <>s don't mess things up. But <:how> is special because it has a : in it, so I don't want to touch that.

I can replace it using a regex with something like {{how}}, then do the HTML encoding, and then replace it back.

But what if {{something}} already appears somewhere in the code? Then {{something}} would get converted to <:something> when it should have been left开发者_开发知识库 as is.

I've encountered this problem a few times in the past, and still haven't found a good way to approach it. Do people just choose something really obscure to replace with and hope that it doesn't exist elsewhere, or is there a proper way to do this?


Using regexps for HTML parsing is bad. But let's consider you just fix a small piece of your own code.

Consider this regexp: <(?!:): it matches any < which is not followed by a :, but the colon is not included in the match, so you can just use a replace string of &lt;.

Find out where in your favorite text editor is the checkbox 'use regexps'. (In vi, it's implicitly there, checked.) The expression above only works if your editor supports decent regexp syntax; most do now.

But your original approach is also workable. If it's impractical to enumerate several complex exclusion patterns in a regexp, you can replace these patterns with some strings temporarily. Just make these strings fairy unique. I bet that if you replaced <:with LESS=THAN=AND=COLON, there's about zero chance that you clashed with something, or forgot what does this string stand for. Yes, these temp strings are an eyesore: this makes the chance that you forget to replace them back pretty slim.


You could implement an escaping mechanism based on some characters that wouldn't survive the encoding process. For instance, if you are html-encoding your input, you know that after that you won't have any < or > characters, because they are replaced by html entities. So you can use as an escape code a string made of < or >. If you're going to display the final code in a browser you could use something like <!-- TOKEN --> as escape code, since it wouldn't affect the visualization.

Your encoding process could look like this:

  • input string:
    • <hello> {{world}} <:how> are <you>
  • replace <xxx> with &lt;xxx&gt; where xxx doesn't start with :
    • &lt;hello&gt; {{world}} <:how> are &lt;you&gt;
  • replace <:xxx> with {{<!-- TOKEN -->xxx}}
    • &lt;hello&gt; {{world}} {{<!-- TOKEN -->how}} are &lt;you&gt;

Displayed in browser, {{world}} and {{how}} would look the same but they would preserve decoding informations. Indeed, the corresponding decoding process would be:

  • input string:
    • &lt;hello&gt; {{world}} {{<!-- TOKEN -->how}} are &lt;you&gt;
  • replace {{<!-- TOKEN -->xxx}} with <:xxx>
    • &lt;hello&gt; {{world}} <:how> are &lt;you&gt;
  • replace &lt;xxx&gt; with <xxx>
    • <hello> {{world}} <:how> are <you>

Like I said, since the characters you are basing your escape code on can't appear by themselves in the encoded text, having an input like {{<!-- TOKEN -->how}} wouldn't break the encoding/decoding process, because it would be encoded like {{&lt;!-- TOKEN --&gt;how}} and so reversed correctly.

0

精彩评论

暂无评论...
验证码 换一张
取 消