Let's say I have the following content:
Lorem Ipsum is simply dummy text of the printing and typesetting i开发者_如何转开发ndustry.
How do I search for dummy
or dummy text
in that string using C? Is there any easy way to do it or only with strong string manipulation? All I need is to search for it and return a Boolean with the result.
EDIT:
You guys created a big discussion around this topic and suggested a few algorithms and I don't mind that cause this might be useful for someone else, or even me in the future. But what I really wanted was the most easy way to do it, no matter the time/space complexity. That doesn't really matter for what I'm doing. So,strstr
easily and quickly fixed my problem. I really have to get me some standard C functions chet sheet.The standard library function for this is strstr:
char *strstr(const char *haystack, const char *needle);
It returns a pointer into the string where the match was found, or NULL if it wasn't - so if all you need is a boolean, just test the return value (if (strstr(...))
.
There's an extensive discussion of a large number of string searching algorithms at http://www-igm.univ-mlv.fr/~lecroq/string/, with illustrative C code and references.
There's a discussion in one set of comments about the costs of the algorithms. One of the points to bear in mind is that if you can amortize the cost of setup over many invocations of the search function, then the high-performance algorithms can give you enormous benefit. If you are going to be searching for different strings all the time, it is harder to win out.
I've got a version of the KMP (Knuth-Morris-Pratt) algorithm packaged for multiple reuse of the same search string. The header is:
/*
@(#)File: $RCSfile: kmp.h,v $
@(#)Version: $Revision: 1.4 $
@(#)Last changed: $Date: 2008/02/02 05:49:34 $
@(#)Purpose: Knuth-Morris-Pratt Search Algorithm
@(#)Author: J Leffler
@(#)Copyright: (C) JLSS 2005,2008
@(#)Product: :PRODUCT:
*/
#ifndef KMP_H
#define KMP_H
#include <stddef.h> /* size_t */
typedef struct kmp_control kmp_control;
/*
** To set up a search (to repeatedly look for the same search string in
** multiple scan strings), use kmp_setsearch(). To start a search on a
** new scan string, use kmp_settarget(). To find the next match of a
** given search string in a given target string, use kmp_search(). Note
** that kmp_setsearch() and kmp_settarget() do not copy the data in the
** source and target strings; the pointers must remain valid You can
** copy kmp_control structures for reuse if desired.
*/
typedef void *(*kmp_malloc)(size_t nbytes);
typedef void (*kmp_free)(void *data);
extern kmp_control *kmp_setsearch(const char *search, size_t schlen);
extern void kmp_settarget(kmp_control *ctrl, const char *target, size_t tgtlen);
extern const char *kmp_search(kmp_control *ctrl);
extern void kmp_release(kmp_control *ctrl);
extern void kmp_setalloc(kmp_malloc mem_alloc, kmp_free mem_free);
#endif /* KMP_H */
Being able to specify memory allocation functions is a tad unusual - but my code often works in an environment where memory allocation is not done via the standard malloc()
and so on, and you must be able to switch the memory allocator on demand. You can ignore the two typedefs and the corresponding function; the default settings are, of course, to use malloc()
and free()
.
The basic KMP algorithm code came from the site above - but was modified to allow me to set the search string once and then search multiple target strings, etc. Contact me (see my profile) for the source code. I have got a similar structure for Boyer-Moore code too (same original source), and also a case-insensitive Boyer-Moore code.
There's a good war story about strstr()
and performance in Kernighan and Pike's excellent book "The Practice of Programming".
I did some experimentation - using a copy of the King James Bible (4.8 MB) as the plain text, and memory mapping that. For many searches, the (MacOS X 10.6.2 / BSD) strstr()
was faster than either KMP or BM. When the strings grew long enough (12+ characters, approximately), then the BM algorithm finally outpaced strstr()
. The KMP algorithm always seemed to be much slower.
Morals?
- It is hard to out-pace a good library.
- KMP is much slower than BM on plausible English language strings.
And the infrastructure I put in place around the algorithms may be too heavy - but the alternative in the original code is a callback mechanism, which presents some problems for determining the context of matches.
You can use the strstr function if you want something simple and your strings aren't too long. If your strings are very long however, consider the KMP algorithm as it is a lot more efficient.
I don't really like the Wikipedia article, as the implementation there looks a bit weird to me (although it's probably correct), and it's also misleading about KMP's performance. I prefer the implementation and description given here and on other sites returned by a Google search for "KMP algorithm".
I've always liked Boyer-Moore, myself. It is O(n), but must be setup (i.e., two tables must be precomputed.) Thus it is good if a lot of text is to be searched, or the search strings are known in advance, thus making up for the cost of building the tables. It is also best for 8-bit ASCII.
[http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm]
(BTW, is there a Unicode flavor of strstr()?)
I would use strstr (also here).
I am not about the use of word "partial" in the question. The argument ("dummy" or "dummy text") has to be fully matched, right?
精彩评论