I have information about 12340 cars. This info is stored sequentially in two different files:
- car_names.txt, which contains 开发者_运维百科one line for the name of each car
- car_descriptions.txt, which contains the descriptions of each car. So 40 lines for each one, where the 6th line reads @CAR_NAME
I would like to do in python: to add for each car in the car_descriptions.txt file the name of each car (which comes from the other file) in the 7th line (it is empty), just after @CAR_NAME
I thought about:
1) read 1st file and store car names in a matrix/list 2) start to read 2nd file and each time it finds the string @CAR_NAME, just write the name on the next line
But I wonder if there is a faster approach, so the program reads each time one line from each file and makes the modification.
Thanks
First, make a generator that retrieves the car name from a sequence. You could yield every 7th line; I've made mine yield whatever line follows the line that starts with @CAR_NAME
:
def car_names(seq):
yieldnext=False
for line in seq:
if yieldnext: yield line
yieldnext = line.startswith('@CAR_NAME')
Now you can use itertools.izip
to go through both sequences in parallel:
from itertools import izip
with open(r'c:\temp\cars.txt') as f1:
with open(r'c:\temp\car_names.txt') as f2:
for (c1, c2) in izip(f1, car_names(f2)):
print c1, c2
I'm not sure if I completely understand what you're trying to do, is something like this?
f1 = open ('car_names.txt')
f2 = open ('car_descriptions.txt')
for car_name in f1.readlines ():
for i in range (6): # echo the first 6 lines
print f2.readline ()
assert f2.readline() == '@CAR_NAME' # skip the 7th, but assert that it is @CAR_NAME
print car_name # print the real car name
for i in range (33): # print the remaining 33 of the original 40
print f2.readline ()
Reading car_names.txt
will save you a piddling amount of memory (really really tiny by today's standards;-) but it absolutely won't be any faster than slurping it down at one gulp (best case it will be exactly the same speed, probably even a little bit slower unless your underlying operating system and storage system do a great job at read-lookahead caching / buffering). So I suggest:
import fileinput
carnames = open('car_names.txt').readlines()
carnamit = iter(carnames)
skip = False
for line in fileinput.input(['car_descriptions.txt'], True, '.bak'):
if not skip:
print line,
if '@CAR_NAME' in line:
print next(carnamit),
skip = True
else:
skip = False
So measure the speed of this, and an alternative that does
carnamit = open('car_names.txt')
at the start instead of reading all lines over a list like my first version -- I bet that the first version (in as much as there's any measurable and repeatable difference) will prove to be faster.
BTW, the fileinput module of the standard library is documented here, and it's truly a convenient way to perform "virtual rewriting in-place" of text files (typically keeping the old version as a backup, just in case -- but even if the machine should crash in the middle of the operation the old version of the data will still be there, so in a sense the "rewriting" operates atomically with respect to machine crashes, a nice little touch;-).
for line1, line2 in zip(file(filename1), file(filename2)):
# do your thing
or similar
12340 is not any data (in sense that there are much bigger data to process on the market).
Even better approach would use build in sqlite module. If not use some simple format like CSV for example. This is a structure organized. If not use threads, you could process two files simultaneously.
I think this fits the question:
- it reads the description file one line at a time
- when it sees @CAR_NAME, it still emits it, but replaces the next line in the description file with the next line from the names file
def merge_car_descriptions(namefile, descrfile):
names = open(namefile,'r')
descr = open(descrfile,'r')
for d in descr:
if '@CAR_NAME' in d:
yield d + names.readline()
descr.next()
else:
yield d
if __name__=='__main__':
import sys
if len(sys.argv) != 3:
sys.exit("Syntax: %s car_names.txt car_descriptions.txt" % sys.argv[0])
for l in merge_car_descriptions(sys.argv[1], sys.argv[2]):
print l,
精彩评论