开发者

Using Python to extract images and text from a word document

开发者 https://www.devze.com 2023-03-13 08:48 出处:网络
I would like to run a script on a folder full of word documents that reads through the documents and pulls out 开发者_开发知识库images and their captions (text right below the images). From the resear

I would like to run a script on a folder full of word documents that reads through the documents and pulls out 开发者_开发知识库images and their captions (text right below the images). From the research I've done, I think pywin32 might be a viable solution. I know how to use pywin32 to find strings and pull them out, but I need help with the images part. How can I read through a docx file and have an event occur when an image is found? Thank you for any help! I am using Python 2.7.


Docx files can be unzipped for extracting the images.


Find some inspiration in this post How can I search a word in a Word 2007 .docx file?


You can use the python module docx2txt for extracting text as well as images from docx files


document =docx.Document(filepath)
for image in document.inline_shapes:
    print (image.width, image.height)

Try this it will work.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号