开发者

how to retrieve the large file

开发者 https://www.devze.com 2023-01-14 07:29 出处:网络
I am working on an application wherein i need to compare 10^8 entries(alphanumeric entries). To retrieve the entries from file( file size is 1.5 GB) and then to compare th开发者_JAVA百科em, i need to

I am working on an application wherein i need to compare 10^8 entries(alphanumeric entries). To retrieve the entries from file( file size is 1.5 GB) and then to compare th开发者_JAVA百科em, i need to take less than 5 minutes of time. So, what would b the effective way to do that, since, only retrieving time is exceeding 5 min. And i need to work on file only. please suggest a way out. I m working on windows with 3GB RAM n 100Gb hard disk.


  • Read a part of the file, sort it, write it to a temporary file.
  • Merge-sort the resulting files.


Error handling and header includes are not included. You need to provide DataType and cmpfunc, samples are provided. You should be able to deduce the core workings from this snippet:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

typedef char DataType; // is this alphanumeric?
int cmpfunc(char const *left, char const *right)
{
    return *right - *left;
}

int main(int argc, char **argv)
{
    int fd = open(argv[1], O_RDWR|O_LARGEFILE);
    if (fd == -1)
        return 1;
    struct stat st;
    if (fstat(fd, &st) != 0)
        return 1;
    DataType *data = mmap(NULL, st.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (!data)
        return 1;
    qsort(data, st.st_size / sizeof(*data), cmpfunc);
    if (0 != msync(data, st.st_size, MS_SYNC))
        return 1;
    if (-1 == munmap(data, st.st_size))
        return 1;
    if (0 != close(fd))
        return 1;
    return 0;    
}

I can't imagine you can get much faster than this. Be sure you have enough virtual memory address space (1.5GB is pushing it but will probably just work on 32bit Linux, you'll be able to manage this on any 64bit OS). Note that this code is "limited" to working on a POSIX compliant system.

In terms of C and efficiency, this approach puts the entire operation in the hands of the OS, and the excellent qsort algorithm.


If retrieving time is exceeding 5 min it seems that you need to look at how you are reading this file. One thing that has caused bad performance for me is that a C implementation sometimes uses thread-safe I/O operations by default, and you can gain some speed by using thread-unsafe I/O.

What kind of computer will this be run on? Many computers nowadays have several gigabytes of memory, so perhaps it will work to just read it all into memory and then sort it there (with, for example, qsort)?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号