I'm working on string and I wonder which way is best to check if string contains only specified character set:
@ ∆ SP 0 ¡ P ¿ p
£ _ ! 1 A Q a q
$ Φ " 2 B R b r
¥ Γ # 3 C S c s
è Λ ¤ 4 D T d t
é O % 5 E U e u
ù Π & 6 F V f v
ì Ψ ' 7 G W g w
ò Σ ( 8 H X h x
Ç Θ ) 9 I Y i y
LF Ξ * : J Z j z
Ø 1) + ; K Ä k ä
ø Æ , < L Ö l ö
CR æ q = 开发者_JS百科M Ñ m ñ
Å ß . > N Ü n ü
å É / ? O § o à
I was trying to make it done by eregi and regexp, but didn't success. Other way is to convert each char to decimal and check if it is smaller than < 137, or check each element by in_array() - which I find weak.
Anyone have better solution?
Thanks in advance.
I see you've already accepted another answer, but I want to explain why your attempts with regex weren't working. Hopefully it'll help you.
Firstly, I notice ereg in your tags for this question. Please note that PHP's ereg_
functions have been deprecated; you should only use the preg_
functions.
Now, if you want to use regex for this sort of thing, you would typically use a negated character class to define a list of characters you want to allow, and then look for anything else.
A character class is a list of characters enclosed in square brackets. You can negate a character class by adding a carat symbol to the start of it. So if you wanted a string that contained only 'A', 'B' or 'C', and you wanted to get warned about strings which contained anything else, you could use something like this:
$result = preg_match("/[^ABC]/",$mystring);
Your example is basically the same (but with more characters to test, obviously), except for two points: Firstly you have characters in your list which are reserved characters in Regex, and secondly, you are using non-Ascii characters.
The Regex reserved characters can be dealt with by escaping them with a leading back-slash. You just need to know what characters are reserved. Looking at your list, I see ?
, /
, .
and +
.
The second point explains why you couldn't get it working with ereg
, because the ereg
functions don't support unicode. Switch to using the preg
functions instead, and you'll have more luck.
You still need to specify to the regex engine that you're looking for a unicode characters. This is done by adding the u
modifier to the end of the regex string.
So a shortened version of your query might look like this:
$result = preg_match("/[^èΛ¤4DTdt]/u",$mystring);
It looks like you're including new lines in your list of characters, so you may also want to add the multi-line modifier m
alongside that u
.
For characters which can't be written (or indeed for any character, if it's easier), you can add escape sequences for their unicode character codes. Use \uFFFF
where FFFF
is the hex unicode reference for the character you want to match -- eg \u00E0
matches à
.
I hope that gives you a better insight into regular expressions. I should add that I'm not saying that regex is necessarily the best solution to this question, nor necessarily the only solution. I have tried to make it perform optimally by using the negated character class (which means it'll fail as soon as it finds a non-matching character, and should prevent the kind of excessive backtracking which can cause regex expressions to be quite slow sometimes), so it should be reasonably performant, but I haven't tested it against other solutions.
I hope that helps.
As far as you're concerned for single byte charsets, you can do it with string function:
$charset = 'abc';
$test = 'abcd';
$ofCharset = strlen($test) === strspn($test, $charset); # FALSE
Otherwise you must split your string into array entries of one char each and then compare against a character table which could be a keyed array as well containing the character of the charset as key.
To keep the operation O(n) you could compute the ascii value of each of your test characters and place them into a hash table like so:
$testChars[$ascii] = true;
Then just loop through the subject string's characters and test if the hash table value entry is set and equates to true. If you get false for any of the characters then it contains characters not in your test set.
This would be better than using in_array because testing if $testChars[$ascii] == true is a constant O(1) lookup.
Here's a great resource that might help you find your answer.
Advanced Regular Expression Tips and Techniques
if your trying to find out only if there are other characters you could just str_replace the character set to "" and then get the strlen ... If it is 0 then only those characters are there... if greater then 0 then other characters exist.
ex.
$mystr = "macguffin";
$mycharset = array('m', 'a', 'c', 'g', 'u', 'f', 'i', 'n');
$tmpstr = str_replace($mycharset, "", $mystr);
if (!strlen($tmpstr)) {
echo "only charset chars";
} else {
echo "other chars";
}
would return
only charset chars
but
$mystr = "macguffin";
$mycharset = array('m', 'a', 'c');
$tmpstr = str_replace($mycharset, "", $mystr);
if (!strlen($tmpstr)) {
echo "only charset chars";
} else {
echo "other chars";
}
would return
other chars
HTH
I know this is an old question, but no one has mentioned strpbrk. I've never tried it with odd characters, but aside from that possibly being an issue, why wouldn't this work?
精彩评论