开发者

How do I do this in Python (File Manipulation)?

开发者 https://www.devze.com 2023-01-01 20:57 出处:网络
I have a bunch of HTML files in HTML 开发者_JS百科folder. Those HTML files have unicode characters which I solved by using filter(lambda x: x in string.printable, line). Now how do I write the changes

I have a bunch of HTML files in HTML 开发者_JS百科folder. Those HTML files have unicode characters which I solved by using filter(lambda x: x in string.printable, line). Now how do I write the changes back to the original file? What is the best way of doing it? Each HTML file is of 30 kb in size.

  1 import os, string
  2 
  3 for file in os.listdir("HTML/"):
  4     print file
  5     myfile = open('HTML/' + file)
  6     fileList = myfile.readlines()
  9     for line in fileList:
 10         #print line
 11         line = filter(lambda x: x in string.printable, line)
 12     myfile.close()


Use the fileinput module. It allows you to read and write to the same file in place:

import fileinput,sys,os
files=[os.path.join('HTML',filename) for filename in os.listdir("HTML/")]
for line in fileinput.input(files, inplace=True):    
    line = filter(lambda x: x in string.printable, line)
    sys.stdout.write(line)


At first I didn't understand what @~unutbu was getting at, but after reading the documentation for fileinput module I found this, which I hadn't seen before (emphasis mine):

Optional in-place filtering: if the keyword argument inplace=1 is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently). This makes it possible to write a filter that rewrites its input file in place. If the backup parameter is given (typically as backup='.'), it specifies the extension for the backup file, and the backup file remains around; by default, the extension is '.bak' and it is deleted when the output file is closed. In-place filtering is disabled when standard input is read.

So I think his answer is best, and this explains why.


This should work on Linux; support on other operative systems is iffy (see below).

import os, string

for file in os.listdir("HTML/"):
    print file
    myfile = open('HTML/' + file)
    fileList = myfile.readlines()
    for pos, line in enumerate(fileList):
        line = filter(lambda x: x in string.printable, line) # see note 1
        fileList[pos] = line                                 
    myfile.close()
    myfile = open('HTML/' + file, "wz") # see note 2
    myfile.write("\n".join(fileList))

Note 1. Simply assigning to line does not change fileList. Variables really are labels (references) onto objects: assigning to a label changes the object the label is attached to. That line creates a list which is then assigned

Note 2. The "wz" file mode empties the file on opening (it should be the equivalent of the O_TRUNC flag when passed to open() ). It might not be available on platforms other than Linux.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号