Is there a more or less reliable way to tell whether data at some location in memory is a beginning of a processor instruction or some other data?
For example, E8 3F BD 6A 00
may be call
instruction (E8
) with relative offset of 0x6ABD3F
, or it might be three bytes of data belonging to some other instruction, followed by push 0
(6A 00
).
I know the question sounds silly and there is probably no simple way, but maybe instruction set was designed with this problem in mind and maybe some simple code examining +-100 bytes around the location can give an answer that is very likely correct.
I want to know this because I scan program's code and replace all calls to some function with calls to my replacement. It's working this far but it's not impossible that at some poin开发者_运维问答t, as I increase number of functions I'm replacing, some data will look exactly like a function call to that exact address, and will be replaced, and this will cause a program to break in a most unexpected fashion. I want to reduce the probability of that.
If it is your code (or another one which retaining linking and debug info), the best way is to scan symbol/relocation tables in object file. Otherwise there's no reliable way to determine if some byte is inctruction or data.
Possibly the most efficient method to qualify data is recursive disassembling. I. e. disassembling code from enty point and from all jump destinations found. But this is not completely reliable, because it does not traverse jump tables (you can try to use some heuristics for this, but this is not completely reliable too).
Solution for your problem would be patch function being replaced itself: overwrite its beginning with jump inctruction to your function.
Unfortunately, there is no 100% reliable way to distinguish code from data. From the CPU point of view, code is code only when some jump opcode induces the processor into trying to execute the bytes as if they were code. You could try to make a control flow analysis by beginning with the program entry point, and following all possible execution paths, but this may fail in the presence of pointers to function.
For your specific problem: I gather that you want to replace an existing function with a replacement of your own. I suggest that you patch the replaced function itself. I.e., instead of locating all calls to the foo()
function and replacing them with a call to bar()
, just replace the first bytes of foo()
with a jump to bar()
(a jmp
, not a call
: you do not want to mess with the stack). This is less satisfactory because of the double jump, but it is reliable.
It is impossible to distinguish data from instruction in general and this is because of von Neumann architecture . Analyzing the code around is helpful and disassembly tools do this. (This may be helpful. If you can't use IDA Pro /it is commercial/, use another disassembly tool.)
Plain code have a very specific entropy, so it's quite easy to distinglish it from most data. However, it's a probabilistic approach, but a large enough buffer of plain code can be recognized (especially compiler output, when you can also recognize patterns, like beginning of a function).
Also, some opcodes are reserved for future, others are available only from kernel mode. In this case by knowing them and knowing how to compute the instruction lengths (you could try a routine written by Z0mbie for that), you can do it.
Thomas suggests the right idea. To implement it properly, you need to disassemble the first few instructions (the part you would overwrite with the JMP
) and generate a simple trampoline function that executes them then jumps to the rest of the original function.
There's libraries that do this for you. A well-known one is Detours but it has somewhat awkward licensing conditions. A nice implementation of the same idea with a more permissive license is Mhook.
精彩评论