I have a ~700 MB binary file (non-text data); what I would like to do is search for a specific pattern of bytes that occurs in random locations throughout the file. e.g. 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55
and so on for 50 or so bytes in sequence. The pattern I'd be searching for would be a sequence two random bytes with 0x55 occurring every two bytes.
That is, search for tables stored in the file with 0x55 being the delimiter, and then save the data contained in the tables or otherwise manipulate it.
Would the best option be simply going through every individual byte one at a time, and then looking ahead two bytes to see if the value is 0x55, and if it is, then looking ahead again and again to confirm that a table exists in that location?
Load the whole thing? fseek? Buffer 开发者_JAVA百科chunks, searching those one byte at a time?
What would be the best way of looking through this large file, and finding the pattern, using C or C++?
This sounds like a great job for a regular expression matcher or a deterministic finite automaton. These are high-power tools designed to do just what you're asking, and if you have them at your disposal you shouldn't have much trouble doing this sort of search. In C++, consider looking into the Boost.Regex libraries, which should have all the functionality you need to knock this problem down.
What ultimately worked for me was a hybrid between the Boyer-Moore-Horspool algorithm (suggested by Jerry Coffin) and my own algorithm based on the structure of the tables and the data being stored.
Basically, the BMH algorithm caught most of the things I was looking for. The obvious stuff.
But some tables did turn out to have odd formatting, and I had to implement a semi-intelligent search that would look at the data following each 0x55
, and figure out whether or not it was it was likely to be good data, or just random junk.
Oddly enough, I ended up implementing it in PHP rather than C++, and dumping the results right into a MySQL database for querying. The search process only took around 5 minutes or less, and the results were largely good. I did end up with a lot of junk data, but it caught everything that I needed it to, and (as far as I'm aware) did not leave any good data behind.
Load the whole thing? fseek? Buffer chunks, searching those one byte at a time?
If you can load the whole thing into memory, you should probably use the memory mapping features provided by your platform. This way, the operating system can decide if it should keep large portions of the file in physical memory (i.e. the system has lots of free RAM at the moment), or if it should work only in smaller chunks.
Of course, this only works if you can fit the file into working set.
精彩评论