开发者

How bad is idea of letting users to upload and store files with national characters in the filename?

开发者 https://www.devze.com 2023-01-28 17:11 出处:网络
Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files wi

Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a s开发者_运维知识库tandard established way to handle these?


A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.

For generating the unique file name Guid values would be a good choice.


Store the original filename in a database of some sort, in case you ever need to use it.

Then, rename the filename using a unique alphanumeric id, keeping the original file extension.

If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:

.../2010/12/02/10/28/1a2b3c4d5e.mp3

Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.


It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.


On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.


It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)


There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.

Udpate: Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.

Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.

I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.

0

精彩评论

暂无评论...
验证码 换一张
取 消