开发者

How do i extract specific lines of data from a huge Excel sheet using Python?

开发者 https://www.devze.com 2023-01-06 20:39 出处:网络
I need to get specific lines of data that have certain key words in them (na开发者_Go百科mes) and write them to another file. The starting file is a 1.5 GB Excel file. I can\'t just open it up and sav

I need to get specific lines of data that have certain key words in them (na开发者_Go百科mes) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?


I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]

  1. How big is the file in MB? ["Huge" is not a useful answer]

  2. What software created the file?

  3. How much memory do you have on your computer?

  4. Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".

  5. Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message

  6. What operating system, what version of Python, what version of xlrd?

  7. Do you know how many worksheets there are in the file?


It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.

Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.

If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.

Assuming the first case:

import xlrd

wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'

wb = xlrd.open(wb_path)
ws = wb.sheets()[0]  # assuming you want to work with the first sheet in the workbook

with open(output_path, 'w') as output_file:
    for i in xrange(ws.nrows):
        row = [cell.value for cell in ws.row(i)]

        # ... replace the following if statement with your own conditions ...
        if row[0] == u'interesting':
            output_file.write('\t'.join(row) + '\r\n')

This will give you a tab-delimited output file that should open in Excel.

Edit:

Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.


I haven't used it, but xlrd looks like it does a good job reading Excel data.


Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号