I am looking at writing a program that can test files for corruption and/or damage. I would prefer to write the program in Java.
Now for the tricky part, is it possible to use Java to test for files corruption/damage in many different file types? I am mainly looking at checking .pdf
.html
and .txt
files, but I fear that more files could be added onto the list soon. I honestly have no idea if this is ev开发者_如何学Goen possible to write or not. If Java can not do this is it possible to do it with C?
I think you are going to have to take it file by file basis. For example
- text files - make sure that you can read the file using FileReader
- html - make sure it is a text file AND that the HTML file is valid
- pdf - use a pdf generator to see if you can read the pdf and it is valid
But as alex has suggest, it doesn't matter if you do this in java. As long as you can read bytes you can check.
You also have to define corruption. If by corruption you mean correct disk blocks on the HD then you might need a lower level programming language. If you mean all the bytes represent correct data then you can do this in any language.
You will first need to define "corruption". If you can assume that a file is in good shape as long as you can open it, read its content, confirm its file permissions, and confirm that it is not empty, that's doable in java via the java io API.
If your definition of a valid file includes more rules, such as HTML files needing to be in valid XML form, and PDFs need to be correct/complete, then your program will get more interesting based on your requirements. For PDFs, you can use iText to read them and get their meta data:
http://itextpdf.com/
Files can always be seen as collection of bytes that Java can read. If you have an algorithm to check for corruption, nothing prevents you from implementing it in Java.
And using some good design patterns can make it easy to support different file types.
Acrobat has some fairly powerful repair capabilities so it repairs and opens many broken files. The spec is also quite loosely interpreted (for example TT fonts are supposed to be MAC encoded but in practise WIN encoding works).
精彩评论