I have a fairly large BZ2 file that with several text files in it. Is it possible for me to use Java to uncompress certain files inside the BZ2 file and uncompress/parse the data on the fly? Let's say that a 300mb BZ2 file contains 1 GB of text. Ideally, I'd like my java program to say read 1 mb of the BZ2 file, uncompress it on the fly, act on it and keep reading the BZ2 file for more data. Is that pos开发者_JS百科sible?
Thanks
The commons-compress library from apache is pretty good. Here's their samples page: http://commons.apache.org/proper/commons-compress/examples.html
Here's the latest maven snippet:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.10</version>
</dependency>
And here's my util method:
public static BufferedReader getBufferedReaderForCompressedFile(String fileIn) throws FileNotFoundException, CompressorException {
FileInputStream fin = new FileInputStream(fileIn);
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
return br2;
}
The Ant project contains a bzip2 library. Which has a org.apache.tools.bzip2.CBZip2InputStream
class. You can use this class to decompress the bzip2 file on the fly - it just extends the standard Java InputStream
class.
You can use org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream
from Apache commons-compress
InputStream inputStream = new BZip2CompressorInputStream(new FileInputStream(xmlBz2File), true) // true should be used for big files, as I understand
and than org.apache.commons.compress.utils.IOUtils
:
int pos = 0;
int step = 1024 * 32;
byte[] buffer = new byte[step];
int actualLength = 1;
while (actualLength > 0) {
actualLength = IOUtils.readFully(inputStream, buffer, pos, step);
pos += actualLength;
String str = new String(buffer, 0, actualLength, StandardCharsets.UTF_8);
// something what you want to do
}
But it may be hard to deal with back presure (consumer may be faster then producer and vice versa). So I tried to use Akka Streams with BZip2CompressorInputStream
.
精彩评论