开发者

How can I determine differences between different encodings/serializations/etc?

开发者 https://www.devze.com 2023-01-26 22:33 出处:网络
There\'s all types of decoders for data formats such as Base64, the ASP EventValidation object, XML serialization, or otherwise? Is there a simple test I can do?

There's all types of decoders for data formats such as Base64, the ASP EventValidation object, XML serialization, or otherwise? Is there a simple test I can do?

For example, I have a string here, it's part of a cgi-based web form, it's obviously hex (full size is 5kb): 52616e646f6d49567ef61b360522ae5ae69064f0ecb664a831c4196dad319215013aa8d04726b5d54ed673dad2004726c35e66d8b19c5177a331b24988f3cf11871084f6cc9ff808baf5cdee83f031a56dc42b65ee5309f1f1

I got no idea what that is, hex to ascii gives me some more nonsense like Ra_d__IVo6"Odd1_1/G&?sG&OfQw1I1_eS, it's obviously not a base 64 string...

The question is basically: is there a method other than looking at differnt types, trying it, and guessing?

ed开发者_StackOverflowit: I think this string is encrypted data based on the perpended 52616e646f6d4956, but my question isn't what is the string, rather, how I can tell these things easily.


You can develop your own heuristic algorithm. Similar to a virus scanner. It doesn't work 100%, but it should improve over time. For example, you could take the string and note that it contains only characters from the hex alphabet, flag it for the possibility of being encrypted, zipped or whatever else that is related to the hex character set.

You could extend the heuristic to try N different encodings and perform word count's. This could help narrow down the possibilities of the encoding's, but in the simple case with say the standard english alphabet there's plenty of overlap across encoding tables so you will certainly get false positives. But, as long as the overlap doesn't contain character's outside/mismatching you should still get readable content.

As Marc pointed out, not all content is necessarily readable content. Pictures, zip files, and a list of other data will result in pure nonsense when converted to an encoding table representation. But, even items such as these have potential to contain consistent data to be detected by the heuristic.

This topic can get pretty involved. Just look at the TCP protocol. One doesn't just fire packets across the internet expecting some magical interpretation of data on the client side. There are pre-defined rules (protocols) to define the way and type of data to be transmitted between the client/server. So, to directly answer your question regarding "guessing", you cannot be certain of the data you will receive nor of your interpretation, but you certainly can develop an application that is smarter than a "guess".


In the general case that will be hard. Obviously looking for the right character range helps spot things like base-64, but beyond that you'd need a lot of per-type logic. Anything text-based could itself use any Unicode/code-page encoding, for example.

Xml and json are probably ratively easy to infer (guess based on the start chars, then try running it through a parser/validator). Of course non-x-HTML complicates matters.

Binary forms are trickier and more numerous; could it be an image, ? Sound? Zip? Or a binary data format; protobuf perhaps? Or bespoke?

And what Endianness are we in?

Then; is the entire payload gzip? Deflate? Encrypted?

So yes; it can probably be done - wireshark tries, for example. But it is a lot of work, with no magic short-cuts.

0

精彩评论

暂无评论...
验证码 换一张
取 消