Python Strategy for Large Scale Analysis (on-the-fly or deferred)_问答_开发者

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

开发者 https://www.devze.com 2023-04-09 19:30 出处：网络

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?

I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?

On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon

Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?

Are there any python projects that a开发者_Python百科re already geared toward this kind of analysis?

Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary

I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.

The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.

Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.

First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.

Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.