What is the correct way to determine the type of a file returned by a web server?_问答_开发者

I've always believed that the HTTP Content-Type s开发者_运维问答hould correctly identify the contents of a returned resources. I've recently noticed a resource from google.com with a filename similar to /extern_chrome/799678fbd1a8a52d.js that contained HTTP headers of:

HTTP/1.1 200 OK
Expires: Mon, 05 Sep 2011 00:00:00 GMT
Last-Modified: Mon, 07 Sep 2009 00:00:00 GMT
Content-Type: text/html; charset=UTF-8
Date: Tue, 07 Sep 2010 04:30:09 GMT
Server: gws
Cache-Control: private, x-gzip-ok=""
X-XSS-Protection: 1; mode=block
Content-Length: 19933

The content is not HTML, but is pure JavaScript. When I load the resource using a local proxy (Burp Suite), the proxy states that the MIME type is "script".

Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate. Is the only accurate method to examine the contents of the file? Is this what web browsers do to determine how to handle the content?

The browser knows it's JavaScript because it reached it via a <script src="..."> tag.

If you typed the URL to a .js file into your URL's address bar, then even if the server did return the correct Content-Type, your browser wouldn't treat the file as JavaScript to be executed. (Instead, you would probably either see the .js source code in your browser window, or be prompted to save it as a file, depending on your browser.)

Browsers don't do anything with JavaScript unless it's referenced by a <script> tag, plain and simple. No content-sniffing is required.

Is the only accurate method to examine the contents of the file?

Its the method browsers use to determine the file type, but is by no means accurate. The fact that it isn't accurate is a security concern.

The only method available to the server to indicate the file type is via the Content-Type HTTP header. Unfortunately, in the past, not many servers set the correct value for this header. So browsers decided to play smart and tried to figure out the file type using their own proprietary algorithms.

The "guess work" done by browsers is called content-sniffing. The best resource to understand content-sniffing is the browser security handbook. Another great resource is this paper, whose suggestions have now been incorporated into Google Chrome and IE8.

How do I determine the correct file type?

If you are just dealing with a known/small list of servers, simply ask them to set the right content-type header and use it. But if you are dealing with websites in the wild that you have no control of, you will likely have to develop some kind of content-sniffing algorithm.

For text files, such as JavaScript, CSS, and HTML, the browser will attempt to parse the file. If that parsing fails before anything can get parsed, then it is considered completely invalid. Otherwise, as much as possible is kept and used. For JavaScript, it probably needs to syntactically compile everything.

For binary files, such as Flash, PNG, JPEG, WAVE files, they could use a library such as the magic library. The magic library determines the MIME type of a file using the content of the file which is really the only part that's trustworthy.

However, somehow, when you drag and drop a document in your browser, the browser heuristic in this case is to check the file extension. Really weak! So a file to attach to a POST could be a .exe and you would think it is a .png because that's the current file extension...

I have some code to test the MIME type of a file in JavaScript (after a drag and drop or Browse...):

https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/output/output.js

Search for MIME and you'll find the various functions doing the work. An example of usage is visible in the editor:

https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/editor/editor.js

There are extensions to the basic MIME types that can be found in the mimetype plugin.

It's all Object Oriented code so it may be a bit difficult to follow at first, but more or less, many of the calls are asynchronous.

Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate.

As far as I know Apache uses file extensions. Assuming you trust your website administrator and end users cannot upload content, extensions are quite safe actually.

Is the only accurate method to examine the contents of the file?

Accurate and secure, yes. That being said, a server that makes use of a database system can save such meta data in the database and thus not have to re-check each time it handles the file. Further, once the type is detected, it can attempt a load to double check that the MIME type is all proper. That can even happen in a backend process so you don't waste the client's time (actually my server goes further and checks each file for viruses too, so even files it cannot load get checked in some way.)

Is this what web browsers do to determine how to handle the content?

As mentioned by Joe White, in most cases the browser expects a specific type of data from a file: a link for CSS expects CSS data; a script expects JavaScript, Ruby, ASP; an image or figure tag expects an image; etc.

So the browser can use a loader for that type of data and if the load fails it knows it was not of the right type. So the browser does not really need to detect the type per se. However, you have to trust that the loaders will properly fail when the data stream is invalid. This is why we have updates of the Flash player and way back had an update of the GIF library.

The detection of the type, as the magic library does, will only read a "few" bytes at the start of the file and determine a type based on that. This does not mean that the file is valid and can safely be loaded. The GIF bug meant that the file very much looked like a GIF image (it had the right signature) but at some point the buffers used in the library would overflow possibly creating a way to crash your browser and, hopefully for the hacker, take over your computer...