
word ifilter for docx parser error

开发者 https://www.devze.com 2022-12-14 08:12 出处:网络
.Docx documents do not appear to be being indexed. I used a unique string in a .docx, but the .docx is not returned when I search on \"one\".

.Docx documents do not appear to be being indexed.

I used a unique string in a .docx, but the .docx is not returned when I search on "one".

For example here's the following text:

"Here is the text for line one and here is the text for line two."

Will be extracted via the iFilter as:

"Here is the text开发者_StackOverflow中文版 for line oneand here is the text for line two."

So when the Ifilter parses the .docx he deletes the line break separator and tries to parse "oneand here"... .

So it seems that the Word ifilter for .docx concatenates the last word of a line with the first word of the next line.

Can anyone give some ideas of how to get around this issue?

Thanks in advance.

OK I figured this one out now. Basically the 64 bit IFilter is not working correctly. It merges words that are separated by line breaks and does not carry them through. I used Ionic.zip to access the docx zip archive and parsed the important xml files using a slightly modified version of DocxToText. This works perfectly now.

Here is the modified code originally created by Jevgenij Pankov

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Ionic.Zip;
using System.IO;
using System.Xml;

public class DocxToText
    private const string ContentTypeNamespace =

    private const string WordprocessingMlNamespace =

    private const string DocumentXmlXPath =
        "/t:Types/t:Override[@ContentType=\"" +
        "application/vnd.openxmlformats-officedocument." +

    private const string BodyXPath = "/w:document/w:body";

    private string docxFile = "";
    private string docxFileLocation = "";

    public DocxToText(string fileName)
        docxFile = fileName;

    #region ExtractText()

    /// Extracts text from the Docx file.


    /// Extracted text.

    public string ExtractText()
        if (string.IsNullOrEmpty(docxFile))
            throw new Exception("Input file not specified.");

        // Usually it is "/word/document.xml"

        docxFileLocation = FindDocumentXmlLocation();

        if (string.IsNullOrEmpty(docxFileLocation))
            throw new Exception("It is not a valid Docx file.");

        return ReadDocumentXml();

    #region FindDocumentXmlLocation()

    /// Gets location of the "document.xml" zip entry.


    /// Location of the "document.xml".

    private string FindDocumentXmlLocation()
        using (ZipFile zip = new ZipFile(docxFile))
            foreach (ZipEntry entry in zip)
                // Find "[Content_Types].xml" zip entry
                if (string.Compare(entry.FileName, "[Content_Types].xml", true) == 0)
                    XmlDocument xmlDoc = new XmlDocument();
                    using (var stream = new MemoryStream())

                        stream.Position = 0;

                        xmlDoc.PreserveWhitespace = true;

                    //Create an XmlNamespaceManager for resolving namespaces

                    XmlNamespaceManager nsmgr =
                        new XmlNamespaceManager(xmlDoc.NameTable);
                    nsmgr.AddNamespace("t", ContentTypeNamespace);

                    // Find location of "document.xml"

                    XmlNode node = xmlDoc.DocumentElement.SelectSingleNode(
                        DocumentXmlXPath, nsmgr);

                    if (node != null)
                        string location =
                        return location.TrimStart(new char[] { '/' });
        return null;

    #region ReadDocumentXml()

    /// Reads "document.xml" zip entry.


    /// Text containing in the document.

    private string ReadDocumentXml()
        StringBuilder sb = new StringBuilder();

        using (ZipFile zip = new ZipFile(docxFile))
            foreach (ZipEntry entry in zip)
                if (string.Compare(entry.FileName, docxFileLocation, true) == 0)
                    XmlDocument xmlDoc = new XmlDocument();
                    using (var stream = new MemoryStream())

                        stream.Position = 0;

                        xmlDoc.PreserveWhitespace = true;

                    XmlNamespaceManager nsmgr =
                        new XmlNamespaceManager(xmlDoc.NameTable);
                    nsmgr.AddNamespace("w", WordprocessingMlNamespace);

                    XmlNode node =
                        xmlDoc.DocumentElement.SelectSingleNode(BodyXPath, nsmgr);

                    if (node == null)
                        return string.Empty;


        return sb.ToString();

    #region ReadNode()

    /// Reads content of the node and its nested childs.


    /// XmlNode.

    /// Text containing in the node.

    private string ReadNode(XmlNode node)
        if (node == null || node.NodeType != XmlNodeType.Element)
            return string.Empty;

        StringBuilder sb = new StringBuilder();
        foreach (XmlNode child in node.ChildNodes)
            if (child.NodeType != XmlNodeType.Element) continue;

            switch (child.LocalName)
                case "t": // Text


                    string space =
                    if (!string.IsNullOrEmpty(space) &&
                        space == "preserve")
                        sb.Append(' ');


                case "cr":                          // Carriage return

                case "br":                          // Page break


                case "tab":                         // Tab


                case "p":                           // Paragraph


        return sb.ToString();

Here is the usage of this code...

DocxToText dtt = new DocxToText(filepath);
string docxText = dtt.ExtractText();

Placing the cursor in the middle of a word and saving the document will result in the word being split among two XML tags, with a "_GoBack" bookmark in between. The result is that after parsing with this routine, a space is placed between these two string fragments, instead of merging them back to one string. It's easy enough to handle the "_GoBack" scenario, but there's probably other ones as well. Maybe "Track Changes" and who knows what else.

Does a more detailed parsing algorithm exist for DOCX?



验证码 换一张
取 消
