开发者

Search MS Word binary file for specific content

开发者 https://www.devze.com 2023-02-15 06:41 出处:网络
I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word \"hello\" for instan开发者_如何学Cce.

I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instan开发者_如何学Cce.

Is there any way to do this search in the binary file?


You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.

If you have the stream from the DB, then you code would look like this:

Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);

if (doc.GetText().ToLower().Contains("hello world"))
  MessageBox.Show("Hello World exists");

Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.


Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.

Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.

If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.

0

精彩评论

暂无评论...
验证码 换一张
取 消