How to create a large file on a VFAT partition efficiently in embedded Linux_问答_开发者

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:

dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023

The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:

lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages

Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from th开发者_C百科e kernel.

So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?

Thanks.

Why are VFAT "skipping" writes so slow ?

Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.

The reason for that behaviour is FAT's inability to support either:

UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).

Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.

Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.

How could the VFAT driver be modified to deal with this ?

At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:

/*
 * For moronic filesystems that do not allow holes in file.
 * We may have to extend the file.
 */
int cont_write_begin(struct file *file, struct address_space *mapping,
                    loff_t pos, unsigned len, unsigned flags,
                    struct page **pagep, void **fsdata,
                    get_block_t *get_block, loff_t *bytes)
{
    struct inode *inode = mapping->host;
    unsigned blocksize = 1 << inode->i_blkbits;
    unsigned zerofrom;
    int err;

    err = cont_expand_zero(file, mapping, pos, bytes);
    if (err)
            return err;

    zerofrom = *bytes & ~PAGE_CACHE_MASK;
    if (pos+len > *bytes && zerofrom & (blocksize-1)) {
            *bytes |= (blocksize-1);
            (*bytes)++;
    }

    return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}

That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.

The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.

Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.

Other options ...

In some situations, the task at hand involves two (or more) steps:

freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them

One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.

When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.