Process files in Java EE [closed]_问答_开发者_运维开发者技术经验分享

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 5 years ago.

Improve this question

I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).

The files are in the range of 1 document to 100 000 in each file. The files come in various types

Compressed
- Zip
- Tar + gzip
- Gzip
Plain-text
XML
PDF

Now the biggest concern is that the spe开发者_如何学编程cification forbids accessing local files. At least in the way that i'm used to.

I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB ~~and accessing the files from the database would require that you download the whole file, either into memory or onto disk.~~

My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.

My questions are basically

Is there a standard way or a recommended way of dealing with this in Java EE?
Is there an application server specific way around this?
Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?

I sketch here a few more propositions and consider the following concerns:

scalability (file size, clustering, etc.)
batch architecture (job recovery, error handling, monitoring, etc.)
compliance with J2EE

With JCA

JCA connectors belong to the Java EE stack and permit inboud/outboud connectivity from/to the EJB world. JDBC and JMS are usually implemented as JCA connector. An inbound JCA connector can use thread (through the worker abstraction) and transactions. It can then forward any processing to a message-driven bean (MDB).

write a JCA connector that polls for new file, then process them and delegate further processing to message-driven bean in a synchronous way.
the MDB can then persit the information in database with JPA
the JCA connector has control over the transaction, and several MDB invocations can be in the same transaction
file system is not transactional so you will somehow need to figure out how to deal with error such as faulty input files
you can probably use streaming (InputStream) all along the pipleline

With plain threads

We can achieve more or less the same as the JCA way, using threads that are launched from a web servlet context listener (or evt. an EJB Timer).

The thread polls for new file, if file is found it processes it and delegates further processing to regular SLSB in a synchronous way.
Thread in web container have access to UserTransaction and can control the transaction
EJB can be local so that InputStream is passed by reference
Deployment of the web module + ejb can be done with an ear

With JMS

To avoid the need of having several concurrent polling threads and the problem of job acquision/locking, the actual processing can be realized asynchronously using JMS. JMS can also be interesting to split the processing in smaller tasks.

A periodic task polls for new file. If file is found, a JMS message is queued.
When the JMS message is delivered, the file is read and processed and the information is persisted in database with JPA
if JMS processing fails, the app. server may retries automatically or put the message in the dead message queue
monitoring/error handling is more complicated
you can probably use streaming

With ESB

Many projects have emerged in the past year to deal with integration: JBI, ServiceMix, OpenESB, Mule, Spring integration, Java CAPS, BPEL. Some are technologies, some are platform, and there is some overlap between them. They all have a wagon of connectors to route, transform and orchestrate message flow. IMHO, the message are suppose to be small piece of information, and it may be hard to use these technologies to process your large data file. The website patterns of enterprise application integration is an excellent website for more information.

IMO, the approach that fits best the Java EE philosophy is JCA. But the effort to invest is relatively high. In your case, the usage of plain thread that delegate further processing to SLSB is maybe the easiest solution. The JMS approach (close to the proposition of P. Thivent) can be interesting if the processing pipelie gets more complicated. Using an ESB seems overkill to me.

Is there a standard way or a recommended way of dealing with this in Java EE?

I'd use a real integration layer (as in EAI) for this purpose, running as an external process. Integration tools (ETL, EAI, ESB) are specifically designed to deal with... integration and many of them provide everything required out of the box (simplified version: transport, connectors, transformation, routing, security).

Basically, when dealing with files, a file connector is used to monitor a directory for incoming files which are then parsed/split them into messages (applying optionally some transformations) and sent to an endpoint for business processing.

Have a look at Mule ESB for example (has a File Connector, supports many transports, can be run as a standalone process). Or maybe Spring Integration (coupled with Spring Batch?) which has File and JMS Adapters too. But I don't have much experience with it so I can't really say anything about it. Or, if you are rich, you could look at Tibco EMS, WebMethods, etc. Or build your own solution using some parsing library (e.g. jFFP or Flatworm).

Is there an application server specific way around this?

I'm not aware of anything like this.

Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?

As I said, I'd use an external process for the file processing stuff (better suited) and send the content of the file as messages over JMS to the app server for the business processing (and thus benefit from Java EE features such as load balancing and transaction management).

accessing the files from the database would require that you download the whole file, either into memory or onto disk.

This is not entirely true. You are not forced to put the whole thing in an indermetiating byte[] or so. You can just keep using streams. Get an InputStream of it using ResultSet#getBinaryStream() and immediately handle it the usual way, e.g. writing to HttpServletResponse#getOutputStream(). The cost is only the buffer size which you can define yourself.

Is there a standard way or a recommended way of dealing with this in Java EE?

Either the database or a fixed disk file system path with r/w access for the appserver. E.g. /var/webapp/files on the root disk.

I think the healthiest way to do it is to do without a Java application server.

Application servers like to manage resources (CPU, memory, threads) their own way. Performing long-running, I/O intensive batch processing is prone to distorting this kind of resource management.

I suggest using an external process to split up the files, with a periodical tidying up to keep disk usage under control, and using the AS for reading access via file-system the way BalusC suggested.

I suppose concurrent access issues would be dealt with by JPA layer -- which I admittedly don't know much about, but I think it comes also in J2SE flavour.

The specification forbids accessing files using java.io. There are other legal ways to access files, e.g. via a DataSource/JDBC driver, or via a resource connector.

See pp545 of "JSR 220: Enterprise JavaBeansTM,Version 3.0 EJB Core Contracts and Requirements"

... using JDBC for file access. Could you please explain it a bit more in detail?

A file is a data store in the same way that a database is. It's a pretty good data store for serially accessed, unstructured, character data, and not so great when you want transaction safety, multi-user access, writable random-access, or structured binary data. In an enterprise system you tend to have at least one of the latter set of requirements nearly all of the time.

Although it's not strictly true to say "In an enterprise system there are no files" (because there are log files and nearly all databases use files at a low level) it's a pretty good design rule-of-thumb, because of all of the problems that data files cause in a high performance, multi-user, transaction-safe, read-write, enterprise system.

Unfortunately the business world is full of business data stored in files. You have to deal with them. Some files (e.g. Excel spreadsheets) have enough in common with a simple database that they can be worth accessing through a JDBC driver. I've never heard of anyone accessing plain text files through a JDBC driver, but you could - or you could use a more generic resource adapter instead (according to the EJB3 specification, JDBC is a resource manager API).