Looking for patterns in binary files_问答_开发者

开发者 https://www.devze.com 2023-02-15 00:10 出处：网络

I\'m working on a small project in C where I have to parse a binary file of undocumented file format. As I\'m quite new to C I have two questions to some more experienced programmers.

相关专题：bin c hex

I'm working on a small project in C where I have to parse a binary file of undocumented file format. As I'm quite new to C I have two questions to some more experienced programmers.

The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array? Basically I am looking for a simple implementation of strings program in C.

When I open the binary file in any text editor I get a lot of rubbish with some readable strings mixed in. I can extract this strings using strings in the command line. Now I'd like to do something similar in C, like in the pseudocode below:

while (!EOF) {
     if (string found) {
          put it into array[i]
          i++
       }
     return i;
}

The second problem is a little bit more c开发者_开发知识库omplicated and is, I believe, the proper way of achieving the same thing. When I look at the file in HEX editor it's easy to notice some patterns. For example before each string there is a byte of value 02 (0x02) followed by the length of the string and the string itself. For example 02 18 52 4F 4F 54 4B 69 57 69 4B 61 4B 69 is a string with the string part in bold.

Now the function I'm trying to create would work like this:

while(!EOF) {
     for(i=0; i<buffer_size; ++i) {
          if(buffer[i] hex value == 02) {
               int n = read the next byte;
               string = read the next n bytes as char;
               put string into array;
          }
     }
}

Thanks for any pointers. :)

The first seems to be an easy one. How do I extract all the strings from the binary file and put them into an array?

Figure out what character range represents printable ASCII characters. Iterate across the file, checking if characters are ASCII characters, and counting up for adjacent ASCII characters. By default, strings will treat sequences of four or more characters as strings; when you find the next non-ASCII character, check if the number has been exceeded; if it has, output the string. Some book-keeping is necessary.

The second problem is a little bit more complicated and is, I believe, the proper way of achieving the same thing.

Your pseudocode is essentially correct. You can manually compare the contents of buffer[i] with an integer (e.g. 2). Reading a byte is as simple as incrementing i. Make sure you don't overrun the buffer, and make sure the array your reading the string to is big enough (if the size parameter is only one byte, you can get away with a 255 length array buffer.)

I'm not sure your solution will work: what if you find a string with 350 char length? Numbers can be part of a string or you can consider them "rubbish"?

I think the most safe way is

Define what you consider string and what you consider "rubbish" - for instance ":,!?" are "string" or "rubbish"?
Define a minimum string length to be considered a "readable" string
Parse the file looking for every group of char with length >= minimum. I know, it's boring, but I think it's the only safe way. Good luck!