Please advise me o开发者_开发技巧n how to approach this problem:
I have a sequential list of metadata in a document in MS Word. The basic idea is to create a Python algorithm to iterate over the information, retrieving just the name of the PROCESS, when is made a queue, from a database.
Example metadata:
Process: Process Walker (1965)
Exact reference: Walker Process Equipment., Inc. v. Food Machinery Corp.Link: http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=382&invol=
Type of procedure: Certiorari to the United States Court of Appeals for the Seventh Circuit. Parties: Walker Process Equipment, Inc.
Sector: Systems is ...
Start Date: October 12-13 Arguedas, 1965
Summary: Food Machinery Company has initiated a process to stop or slow the entry of competitors through the use of a patent obtained by fraud. The case concerned a patent on "knee action swing diffusers" used in aeration equipment for sewage treatment systems, and the question was whether "the maintenance and enforcement of a patent obtained by fraud before the patent office" may be a basis for antitrust punishment. Report of the evolution process: petitioner, in answer to respond...Importance: a) First case which established an analysis for the diagnosis of dispute…
There are about 200 pages containing the information above.
I have in mind the idea of implementing an algorithm in Python to be able to break this information sequence and try to store it in a web database (an open source application that I’m looking for) in order to allow for free consultations.
Check out AntiWord for converting the document to plaintext, then grep and sed to convert to a format you can pipe into your script.
Recent versions of Word allow you to save documents in XML format. This can either be done by explicitly "saving as" and choosing XML, or unzipping a .docx file and parsing its XML. The XML formats are documented online depending on the version of Word: 2003 Office XML or 2007/2010 Office Open XML.
Anything more powerful (e.g. requiring manipulation of the documents) requires interfacing with .NET (MS Open XML SDK or Aspose.Words).
精彩评论