I have a c# component that will recieve a file of the following types .doc, .pdf, .xls, .rtf
These will be sent by the calling siebel legacy app as a filestream.
So...
[LegacyApp] >> {Binary file stream} >> [Component]
The legacy app is a black box that ca开发者_运维知识库nt be modified to tell the component what file type (doc,pdf,xls) it is sending. The component needs to read this binary stream and create a file on the filesystem with the right extension.
Any ideas?
Thanks for your time.
On Linux/Unix based systems you can use the file command, but I assume you want to do this manually yourself in code...
If all you have access to is the byte stream of the file, then you would need to handle each file type independently.
Most programs/components that do what you are wondering usually read the first few bytes and make a classification based on that. For example GIF files start with one of the following: GIF87a or GIF89a
Many file formats have the same signature at the start of the file, or have the same header format. This signature is refered to as a magic number as described by me on this post.
A good place to get started is to go to www.wotsit.org. It contains the file format specifications searchable by file type. You could look at the important file types that you want to handle and see if you can find some identifying factor in those file formats.
You could also search Google to try and find a library that does this classification, or look at the source code of the file command.
Yes this is possible, as MS Office (97-2007 or thereabouts) files all start with D0CF11E and then there is a subtype marker at byte 512.
A reference for these is at: http://www.garykessler.net/library/file_sigs.html
This seems to be the best list around, with all sorts of file formats - it is the main reference on wikipedia.
It doesn't give complete details on the new Office format, so this is from my own examples. DOCX files start with "PK" (as technically they are zip files) and then contain the string "word/_rels/document.xml.rels" while XLSX contain "xl/_rels/workbook.xml.rels".
You maybe interested in this: http://en.wikipedia.org/wiki/Magic_number_(programming)
Most binary formats contain a magic number at their beginning. If you only have to recognize a certain set of formats, it should be easy to check the first few bytes of a new incoming file and guess the appropriate file extension correctly.
On linux, there is a command called file
. Given an arbitrary file, it attempts to determine what kind of file it is. For instance:
gzip compressed data, from Unix, last modified: Fri Jun 12 20:16:28 2009
HTML document text
vCalendar calendar file
RCS/CVS diff output text
Those are from a few random files lying around my home directory.
Yep. See file
.
And please do not reinvent the wheel. It works just fine how it is.
精彩评论