i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.
Google Said:
CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual number of CPU cycles spent varies greatly depending on conditions internal to App Engine, so this number is adjusted for reporting purposes using this processor as a reference measurement.
And
Per Day Max RateCPU Time 6.5 CPU-hours 15 CPU-minutes/minute
What i want to Know:
Is this script going over the limit?
(if yes)How can i make it not go over the limit?
I use the urllib library, should i use Google's URL Fetch API? Why?
Absolutely any other helpful comment.
What it does:
It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.
from urllib import urlopen
import re
alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
betaPage = urlopen(betaUrl).read()
patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
findPatBeta = re.findall(patFinderBeta,betaPage)
listIteratorBeta = []
listIteratorBeta[:] = range(len(findPatBeta))
for bi in listIteratorBeta:
gammaUrl = betaUrl + findPatBeta[bi]
gammaPage = urlopen(gammaUrl).read()
patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
findPatGamma = re.findall(patFinderGamma,gammaPage)
patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
listIteratorGamma = []
listIteratorGamma[:] = range(len(findPatGamma))
for gi in listIteratorGamma:
deltaUrl = findPatGamma[gi]
deltaPage = urlopen(deltaUrl).read()
patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
findPatDelta = re.findall(patFinderDelta,deltaPage)
PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)
If I forgot anything please let me know.
Update:
This is about how many times it will run and why in case this is helpfull in answering the question.
per cycle total
Alpha: 1 1
Beta: 16 16开发者_运维知识库
Gamma: ~250 ~4000
Delta: ~6 ~24000
I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.
To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.
P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.
I use the urllib library, should i use Google's URL Fetch API? Why?
urlib on AppEngine production servers is The URLFetch API
It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.
You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.
精彩评论