I am using an API (Let's pretend its facebook) to gather data between two given dates. Because of API restrictions (like most) I can only grab so many at a time, and therefor have to page my way through the results.
Here is my issue/question though.. Is it better to
- get fewer results back, and make more calls to the api
- get more results back, and fewer calls to the api
I am running a 4GB instance of a cloud server..
The data I'm looking at is in XML format, and contains about 20k entries. Each entry contains probably another 20 tags within it. Once completely pulled 开发者_如何学JAVAdown the data ends up being about 10MB.. my problem is that when my server is hitting the api, gathering this information the CPU and Memory spike to nearly 100%. I've tried retrieving 500 at a time, 1000 at a time, 5000 at a time.. is this something where I need to gather 20 at a time.. or is there something else I should look at?
I'm not sure what else to provide, if there is something I can provide just let me know
Updates based on answers
- I host with Storm on Demand, which runs perfectly for us and seems to be great hardware - https://www.stormondemand.com/cloud-server/
- I use HPricot to parse the XML (which could probably be optimized, I'm no expert here)
- I do need all of the data, this service doesn't offer an export, only API.
EDIT [to help people stumbling on this later] I switched from Hpricot to Nokogiri, MUCH faster. Also, I was building an XML file in memory, apparently that is extremely intense, and was a very time consuming task. I've cut this operation down from about 10 minutes, to just over 1 minute by fixing these two things.
Here's a list of things to look at:
- optimize your code. try profiling your code and see if you can improve it. Mast likely using a better parser (DOM vs SAX) is possible.
- get a better hardware/hosting. 4GB is just memory. Most likely you are on a shared hosting/vm and CPU limited
- offload some CPU/memory heavy operations to a faster service/application, like XML processing, data analysis, file io can be done in C/C++
- in a proper cloud environment you should be able to spawn more VMs and adjust your jobs/load accordingly. That will cost more tough and require some kind of job manager.
The questions you need to ask is why is your CPU+ memory spiking? 4GB is plenty to be handling this data, so is your code optimized to handle this task? If not, what can you do?
Is your code optimized enough? Fair enough. You can now rewrite them using C extensions.
After optimizing your code, I'd suggest checking out processing this data 'later', as in a delayed job. This way you aren't blocking on the entire dataset which may strain your server.
You also mentioned you are running a cloud server, which I can assume you have access to more Virtual Machines. You can process this data in pararel to reduce stress per machine.
精彩评论