I am working on a program that extracts real time quote for 900+ stocks from a website. I use HttpWebRequest to send HTTP request to the site and store the response to a stream and open a stream using the following code:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream stream = response.GetResponseStream ();
StreamReader reader = new StreamReader( stream )
the size of the received HTML is large (5000+ lines), so it takes a long time to parse it and extract the price. For 900 files, It takes about 6 mins for parsing and extracting. Which my boss isn't happy with, he told me he'd want the whole process to be done in TWO mins.
I've identified the part开发者_StackOverflow社区 of the program that takes most of time to finish is parsing and extracting. I've tried to optimize the code to make it faster, the following is what I have now after some optimization:
// skip lines at the top
for(int i=0;i<1500;++i)
reader.ReadLine();
// read the line that contains the price
string theLine = reader.ReadLine();
// ... extract the price from the line
now it takes about 4 mins to process all the files, there is still a significant gap to what my boss's expecting. So I am wondering, is there other way that I can further speed up the parsing and extracting and have everything done within 2 mins?
I was doing HTML screen scraping for a while with stock quotes but I found that Yahoo offers a great simple web service that is much better that loading websites.
http://www.gummy-stuff.org/Yahoo-data.htm
With this service you can request up to 100 stock quotes in a single request and it returns a csv formatted response with one line for every symbol. You can set what columns you want returned in the query string of the request. I built a small program that would query the service once a day for every stock in the stock market to get prices. It seemed to work well for me and was way faster than hitting websites for the data.
An example querystring would be http://finance.yahoo.com/d/quotes.csv?s=GE&f=nkqwxyr1l9t5p4
Which returns text of
"GENERAL ELEC CO",32.98,"Jun 26","21.30 - 32.98","NYSE",2.66,"Jul 25",28.55,"Jul 3","-0.21%"
for(int i=0;i<1500;++i)
reader.ReadLine();
this particulary is not good. ReadLine reads all line and stores it somewhere, but no one uses it. Extra work for GC. Read byte-by-byte and catch \D \A.
Then don't use StreamReader
at all! It is fat overhead, read from stream.
Hard to see how this is possible, StreamReader is blindingly fast compared to HttpWebRequest. Some basic assumptions: say you are downloading 900 files with 5000 lines, 100 chars each in 6 minutes. That means you need to download 900 x 5000 x 100 = 450 Megabytes. In 6 minutes, that requires a bandwidth of 450E6 / 6 / 60 * 8 = 10 Mbps.
What do you have? 10 Mbps is about typical for high-speed Internet service, although you need a server that can sustain this. To get it down to 2 seconds, you'll need to upgrade your service to 30 Mbps. Your boss can fix that.
About the speed improvement you saw: watch out for the cache.
If you really need to have real-time data fast then you should subscribe to the data feeds rather than scrape them off a site.
Alternatively, isn't there some token that you can search for to find the field/data pair(s) you need.
4 minutes sounds ridiculously long for reading in 900 files.
精彩评论