Wikipedia has an excellent description of the ZIP file format, but the "central directory" structure is confusing to me. Specifically this:
This ordering allows a ZIP file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.
The problem is that even the trailing header for the central directory is variable length. How then, can someone get the start of the central directory to parse?
(Oh, and I did spend some time looking at APPNOTE.TXT in vain before comi开发者_如何学Cng here and asking :P)
My condolences, reading the wikipedia description gives me the very strong impression that you need to do a fair amount of guess + check work:
Hunt backwards from the end for the 0x06054b50 end-of-directory tag, look forward 16 bytes to find the offset for the start-of-directory tag 0x02014b50, and hope that is it. You could do some sanity checks like looking for the comment length and comment string tags after the end-of-directory tag, but it sure feels like Zip decoders work because people don't put funny characters into their zip comments, filenames, and so forth. Based entirely on the wikipedia page, anyhow.
I was implementing zip archive support some time ago, and I search last few kilobytes for a end of central directory signature (4 bytes). That works pretty good, until somebody will put 50kb text into comment (which is unlikely to happen. To be absolutely sure, you can search last 64kb + few bytes, since comment size is 16 bit). After that, I look up for zip64 end of central dir locator, that's easier since it has fixed structure.
Here is a solution I have just had to roll out incase anybody needs this. This involves grabbing the central directory.
In my case I did not want any of the compression features that are offered in any of the zip solutions. I just wanted to know about the contents. The following code will return a ZipArchive of a listing of every entry in the zip.
It also uses a minimum amount of file access and memory allocation.
TinyZip.cpp
#include "TinyZip.h"
#include <cstdio>
namespace TinyZip
{
#define VALID_ZIP_SIGNATURE 0x04034b50
#define CENTRAL_DIRECTORY_EOCD 0x06054b50 //signature
#define CENTRAL_DIRECTORY_ENTRY_SIGNATURE 0x02014b50
#define PTR_OFFS(type, mem, offs) *((type*)(mem + offs)) //SHOULD BE OK
typedef struct {
unsigned int signature : 32;
unsigned int number_of_disk : 16;
unsigned int disk_where_cd_starts : 16;
unsigned int number_of_cd_records : 16;
unsigned int total_number_of_cd_records : 16;
unsigned int size_of_cd : 32;
unsigned int offset_of_start : 32;
unsigned int comment_length : 16;
} ZipEOCD;
ZipArchive* ZipArchive::GetArchive(const char *filepath)
{
FILE *pFile = nullptr;
#ifdef WIN32
errno_t err;
if ((err = fopen_s(&pFile, filepath, "rb")) == 0)
#else
if ((pFile = fopen(filepath, "rb")) == NULL)
#endif
{
int fileSignature = 0;
//Seek to start and read zip header
fread(&fileSignature, sizeof(int), 1, pFile);
if (fileSignature != VALID_ZIP_SIGNATURE) return false;
//Grab the file size
long fileSize = 0;
long currPos = 0;
fseek(pFile, 0L, SEEK_END);
fileSize = ftell(pFile);
fseek(pFile, 0L, SEEK_SET);
//Step back the size of the ZipEOCD
//If it doesn't have any comments, should get an instant signature match
currPos = fileSize;
int signature = 0;
while (currPos > 0)
{
fseek(pFile, currPos, SEEK_SET);
fread(&signature, sizeof(int), 1, pFile);
if (signature == CENTRAL_DIRECTORY_EOCD)
{
break;
}
currPos -= sizeof(char); //step back one byte
}
if (currPos != 0)
{
ZipEOCD zipOECD;
fseek(pFile, currPos, SEEK_SET);
fread(&zipOECD, sizeof(ZipEOCD), 1, pFile);
long memBlockSize = fileSize - zipOECD.offset_of_start;
//Allocate zip archive of size
ZipArchive *pArchive = new ZipArchive(memBlockSize);
//Read in the whole central directory (also includes the ZipEOCD...)
fseek(pFile, zipOECD.offset_of_start, SEEK_SET);
fread((void*)pArchive->m_MemBlock, memBlockSize - 10, 1, pFile);
long currMemBlockPos = 0;
long currNullTerminatorPos = -1;
while (currMemBlockPos < memBlockSize)
{
int sig = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos);
if (sig != CENTRAL_DIRECTORY_ENTRY_SIGNATURE)
{
if (sig == CENTRAL_DIRECTORY_EOCD) return pArchive;
return nullptr; //something went wrong
}
if (currNullTerminatorPos > 0)
{
pArchive->m_MemBlock[currNullTerminatorPos] = '\0';
currNullTerminatorPos = -1;
}
const long offsToFilenameLen = 28;
const long offsToFieldLen = 30;
const long offsetToFilename = 46;
int filenameLength = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFilenameLen);
int extraFieldLen = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFieldLen);
const char *pFilepath = &pArchive->m_MemBlock[currMemBlockPos + offsetToFilename];
currNullTerminatorPos = (currMemBlockPos + offsetToFilename) + filenameLength;
pArchive->m_Entries.push_back(pFilepath);
currMemBlockPos += (offsetToFilename + filenameLength + extraFieldLen);
}
return pArchive;
}
}
return nullptr;
}
ZipArchive::ZipArchive(long size)
{
m_MemBlock = new char[size];
}
ZipArchive::~ZipArchive()
{
delete[] m_MemBlock;
}
const std::vector<const char*> &ZipArchive::GetEntries()
{
return m_Entries;
}
}
TinyZip.h
#ifndef __TinyZip__
#define __TinyZip__
#include <vector>
#include <string>
namespace TinyZip
{
class ZipArchive
{
public:
ZipArchive(long memBlockSize);
~ZipArchive();
static ZipArchive* GetArchive(const char *filepath);
const std::vector<const char*> &GetEntries();
private:
std::vector<const char*> m_Entries;
char *m_MemBlock;
};
}
#endif
Usage:
TinyZip::ZipArchive *pArchive = TinyZip::ZipArchive::GetArchive("Scripts_unencrypt.pak");
if (pArchive != nullptr)
{
const std::vector<const char*> entries = pArchive->GetEntries();
for (auto entry : entries)
{
//do stuff
}
}
In case someone out there is still struggling with this problem - have a look at the repository I hosted on GitHub containing my project that could answer your questions.
Zip file reader
Basically what it does is download the central directory
part of the .zip
file which resides in the end of the file.
Then it will read out every file and folder name with it's path from the bytes and print it out to console.
I have made comments about the more complicated steps in my source code.
The program can work only till about 4GB .zip files. After that you will have to do some changes to the VM size and maybe more.
Enjoy :)
I recently encountered a similar use-case and figured I would share my solution for posterity since this post helped send me in the right direction.
Using the Zip file central directory offsets detailed on Wikipedia here, we can take the following approach to parse the central directory and retrieve a list of the contained files:
STEPS:
- Find the end of the central directory record (EOCDR) by scanning the zip file in binary format for the EOCDR signature (
0x06054b50
), beginning at the end of the file (i.e. read the file in reverse usingstd::ios::ate
if using aifstream
) - Use the offset located in the EOCDR (
16
bytes from the EOCDR) to position the stream reader at the beginning of the central directory - Use the offset (
46
bytes from the CD start) to position the stream reader at the file name and track its position start point - Scan until either another central directory header is found (
0x02014b50
) or the EOCDR is found, and track the position - Reset the reader to the start of the file name and read until the end
- Position the reader over the next header, or terminate if the EOCDR is found
The key point here is that the EOCDR is uniquely identified by a signature (0x06054b50
) that occurs only one time. Using the 16
byte offset, we can position ourselves to the first occurrence of the central directory header (0x02014b50
). Each record will have the same 0x02014b50
header signature, so you just need to loop through occurrences of the header signatures until you hit the EOCDR ending signature (0x06054b50
) again.
SUMMARY:
If you want to see a working example of the above steps, you can check out my minimal implementation (ZipReader) on GitHub here. The implementation can be used like this:
ZipReader zr;
if (zr.SetInput("blah.zip") == ZipReaderStatus::S_FAIL)
std::cout << "set input error" << std::endl;
std::vector<std::string> entries;
if (zr.GetEntries(entries) == ZipReaderStatus::S_FAIL)
std::cout << "get entries error" << std::endl;
for (auto entry : entries)
std::cout << entry << std::endl;
精彩评论