开发者

Removing characters from a PHP String

开发者 https://www.devze.com 2022-12-08 00:18 出处:网络
I\'m accepting a string from a feed for display on 开发者_JAVA技巧the screen that may or may not include some rubbish I want to filter out. I don\'t want to filter normal symbols at all.

I'm accepting a string from a feed for display on 开发者_JAVA技巧the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.

The values I want to remove look like this: �

It is only this that I want removed. Relevant technology is PHP.

Suggestions appreciated.


This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.


Thanks for the responses, guys. Unfortunately, those submitted had the following problems:

wrong for obvious reasons:

ereg_replace("[^A-Za-z0-9]", "", $string);

This:

s/[\u00FF-\uFFFF]//

which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.

This suggestion:

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.

So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.

This does seem to work for now. The solution is as follows:

$fixT = str_replace("£", "£", $string); 
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>@#\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);

If anyone has any better ideas I'm still keen to hear them. Cheers.


You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be

s/[\u00FF-\uFFFF]//

This would strip anything above character 255.


That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.

It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.


If you cant resolve the issue with the data from the feed and need to filter the information then this may help:

PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability

filter_input(input_type, variable, filter, options) 

You can also filter all of your form data in one line if it requires the same filtering :)

There are some good examples and more information about it here:

http://www.w3schools.com/PHP/func_filter_input.asp

The PHP site has more information on the options here: Validation Filters


Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)

Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.


Try this:

  • Download a sample from the feed manually.
  • Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
  • Try changing the encoding and converting from one encoding to another.

If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.


Hello Friends,

     try this Regular Expression to remove unicode char from the string : 

     /*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/ 

Thanks, Chintu(prajapati.chintu.001@gmail.com)

0

精彩评论

暂无评论...
验证码 换一张
取 消