A very similar question has also been asked here on SO in case you are interested, but as we will see the accepted answer of that question is not always the case (and it's never the case for my application use-pattern).
The performance determining code consists of FileStream constructor (to open a file) and a SHA1 hash (the .Net framework implementation). The code is pretty much C# version of what was asked in the question I've linked to above.
Case 1: The Application is started either for the first time or Nth time, but with different target file set. The application is now told to compute the hash values on the files that were never accessed before.
- ~50ms
- 80% FileStream constructor
- 18% hash computation
Case 2: Application is now fully terminated, and started again, asked to compute hash on the same files:
- ~8m开发者_C百科s
- 90% hash computation
- 8% FileStream constructor
Problem
My application is always in use Case 1. It will never be asked to re-compute a hash on a file that was already visited once.So my rate-determining step is FileStream Constructor! Is there anything I can do to speed up this use case?
Thank you.
P.S. Stats were gathered using JetBrains profiler.
... but with different target file set.
Key phrase, your app will not be able to take advantage of the file system cache. Like it did in the second measurement. The directory info can't come from RAM because it wasn't read yet, the OS always has to fall back to the disk drive and that is slow.
Only better hardware can speed it up. 50 msec is about the standard amount of time needed for a spindle drive, 20 msec is about as low as such drives can go. Reader head seek time is the hard mechanical limit. That's easy to beat today, SSD is widely available and reasonably affordable. The only problem with it is that when you got used to it then you never move back :)
The file system and or disk controller will cache recently accessed files / sectors.
The rate-determining step is reading the file, not constructing a FileStream
object, and it's completely normal that it will be significantly faster on the second run when data is in the cache.
Off track suggestion, but this is something that I have done a lot and got our analyses 30% - 70% faster:
Caching
Write another piece of code that will:
- iterate over all the files;
- compute the hash; and,
- store it in another index file.
Now, don't call a FileStream
constructor to compute the hash when your application starts. Instead, open the (expectedly much) smaller index file and read the precomputed hash off it.
Further, if these files are log etc. files which are freshly created every time before your application starts, add code in the file creator to also update the index file with the hash of the newly created file.
This way your application can always read the hash from the index file only.
I concur with @HansPassant's suggestion of using SSDs to make your disk reads faster. This answer and his answer are complimentary. You can implement both to maximize the performance.
As stated earlier, the file system has its own caching mechanism which perturbates your measurement.
However, the FileStream
constructor performs several tasks which, the first time are expensive and require accessing the file system (therefore something which might not be in the data cache). For explanatory reasons, you can take a look at the code, and see that the CompatibilitySwitches
classes is used to detect sub feature usage. Together with this class, Reflection is heavily used both directly (to access the current assembly) and indirectly (for CAS protected sections, security link demands). The Reflection engine has its own cache, and requires accessing the file system when its own cache is empty.
It feels a little bit odd that the two measurements are so different. We currently have something similar on our machines equipped with an antivirus software configured with realtime protection. In this case, the antivirus software is in the middle and the cache is hit or missed the first time depending the implementation of such software.
The antivirus software might decide to aggressively check certain image files, like PNGs, due to known decode vulnerabilities. Such checks introduce additional slowdown and accounts the time in the outermost .NET class, i.e. the FileStream
class.
Profiling using native symbols and/or with kernel debugging, should give you more insights.
Based on my experience, what you describe cannot be mitigated as there are multiple hidden layers out of our control. Depending on your usage, which is not perfectly clear to me right now, you might turn the application in a service, therefore you could serve all the subsequent requests faster. Alternative, you could batch multiple requests into one single call to achieve an amortized reduced cost.
You should try to use the native FILE_FLAG_SEQUENTIAL_SCAN
, you will have to pinvoke CreateFile
in order to get an handle and pass it to FileStream
精彩评论