A Transactional Model for Editing Large Binary Files_问答_开发者

I am creating a binary editor for some very large binary files. One of the software requirements is that the editor cannot modify the original file, so the target file must be an edited copy of 开发者_高级运维the original.

I want to design the editor in such a way that copying of the file only takes place once (it will be a 20 minute process). I know that I can lock the file while it is being edited, but if the user exits the program, they will have to go through the whole 20 minute copy process over again, unless I can find a way to determine that they are still in their original editing session.

Is there some simple process you can think of by which I can allow the user to "register" the copied file somehow as an editable file, and when they are completed with all of their changes, "finalize" the file?

Ideally, such a process would allow me to detect whether or not the editable file or the transaction information has been tampered with in-between editing sessions (tampering or finalization would cause another copy to occur, if the file is edited again).

Create and maintain a record (db?) of sessions in a centralized location.
Session consists of username, if you've got it, or IP, or whatever you want to use to uniquely identify the user, and a hash of the bytes. If hash is too burdensome for the filesize, you might try relying on file date and size.
When the user closes out their editor, you update the session record with the above information and mark it as inactive.
When the user reopens the editor, you should have access to your key information, i.e., username and the file info. If you find a session record, it's an inactive session that you can reactivate, otherwise, it's either been tampered with or is brand new.

Does that suit your needs?

I think you'll want to keep a log of actions taken by the user. In order to avoid writing to the copy of the source data, I would keep the log in a separate file. Store the user's edits with time stamp information.

When it comes time to commit the transaction, simply read down the list of changes in the log file and apply in them, ordered by time stamp.

When the user needs to read data from the file during the editing process, you'll have to read out the relevant portion of the source file into memory and apply the changes to that data from the log file.

This could really be the hardest part, depending on the binary file format. If you have the ability to somehow index the contents of the binary file, I would use that information in the edit log. That way, you can pull only the data you need from the log file, and you'll be able to determine which edits are applicable to that data.

If all you have is a big, formless blob, you'll have to keep the entire thing in memory and apply all of the changes every time you perform a read. There's room for optimization here, I think, but the whole thing is still really heinous. Without being able to limit the scope of the read, you have to assume that any edit could change any data at any time.

As to securing the edits, that's a tricky question. If you're running in an environment you trust, you can get away with keeping a secret and using it to authenticate the information. It's cumbersome, but you could hash the concatenation of the binary file, the edit log, and a secret known only to the application. (Without the secret, anyone could come by, modify the file, and insert a new hash.)

If you're running on a machine local to the user (i.e., a desktop), keeping secrets can be really difficult, especially with managed code. This is a topic unto itself, and I don't have a good answer for you.

Can't you just have a field in that file, at fixed offset from start or end, where you put session information, of just a 'being edited' flag? It may include a reference to its current editing process (e.g. its pid). If the pid is our pid, then it's our session. If it's not our pid, look at process list. If a process with this pid exists, it's the legitimate editor; if not, we're seeing a result of a crash, initiate crash recovery (if any). If pid is 0, the file was cleanly finalized.

Also: If the big file is available for reading, do you really need to copy it before editing?

If edits are rather small compared to the size of file, I'd record user actions as 'diffs' between the original file and the result. If the same spot is edited again and again, it may be useful to "join" the diffs somehow so that you don't apply too many layers of diffs. User's view of the file is, of course, with all diffs dynamically applied.

In the meantime you copy the file, and, once the editing session is over and the file is fully here, you apply all your diffs to the file. Depending on the nature of allowed edits, this may or may not be a time-consuming process, though. If editing sessions are longer than 20 minutes, the user may notice no wait time at all. You will lock the file for the time of diff application, which is presumably shorter than copy time.

Since you are thinking about transactions and file system activity, it might be helpful to consider Transactional NTFS. This doesn't answer your question but might give you a fresh insight into the possibilities. Since your question is tagged for C# and Windows, you might want to look at a .NET wrapper such as here: http://offroadcoder.com/CategoryView,category,Transactions.aspx. Scott Klueppel shows how to do transactional NTFS utilizing the familiar .NET idiom of a TransactionScope. I did a quick test of what Scott has done, and like what I have seen.