Best practices for versioning text data_问答_开发者

开发者 https://www.devze.com 2023-02-07 17:22 出处：网络

What are the best practices for versioning data contained in several large (100MB+) CSV files? Is SVN a good option?

What are the best practices for versioning data contained in several large (100MB+) CSV files?

Is SVN a good option?

Update: After deliberating on this for a while, I feel it may be a better option to GZIP/Zip the CSV file and then add it to the repo. That way, I'd save on the headache of version management while not losing out on diskspace. It's at least as good, if not better, than managing their versions manually.

Still looking out for the perfect solution.

Also, a small note: 开发者_开发百科Versioning of the file contents is not a requirement. Like I don't need to know what words have changed within the file so long as I am able to record a summary of changes or add a note to each version.

This largely depends on how you intend to use these files.

SVN, and most other source control systems, would give you revision numbers that would uniquely identify a specific version of the file. Everytime you commit a new CSV this commit would have its own revision number.

However...

Depending on usage it might not be a good solution. Lets say you check in a CSV and this is on SVN revision number 1234. Somebody then checks that file out, maybe sends it to somebody else etc etc. The holder of the CSV will not know, from the CSV, what revision it is and therefore will not know if they are using the latest version.

Personally, I would put a version number in the filename or add a row to the start/end of the CSV that contains the a version number - however these also depend on your usage as well.

Food for thought...

EDIT Additionally there might be an issue with diffs, I am not certain if SVN supports diffs on CSV so everytime you check in, withing the bowels of SVN, it might completely replace the older file (keeping the old for reference). That could rapidly use a lot of disk space.

SVN is terribly slow because it transfers all the data over the network. Try a local git or hg repository. This only needs file access, which should be much faster than the network. Both repo types also have a much better handling concerning moving files, file renames and merging. Additionally git can use 'plugins' to support further file types such as merging office documents (odf, doc etc.).

In contrast to SVN you only have one hidden repo dir containing the compressed repository. SVN has a .svn dir in every sub dir containing the last state of the file (and other stuff).

Some random numbers:

Assume the size of all files (not repo info) in the repository is 100MB

An SVN checkout would take 200 to 250MB, all older versions must be tranferred from the SVN server.
A git or hg repo would take 150MB (assuming the files can be compressed well) including all versions of the files.

This is what we've experienced with SVN and git. I'm using hg (mercurial) only occasionally.

Regarding MrEyes answer, I'd also suggest to add some version info to the CSV file, or file name. Git will identify the file rename including the changes etc.