I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (which has a table with columns RecipeName, Ingredients 1-10).
So basically my database will contain the name and all the ingredients for that recipe. There is no need to edit the content, simply parse them in as is (i.e. 3/4 cup milk) since i am using character in my database.
How exactly do I go about doing this? I was looking a pre-built parsers and it seems its tough to find one that's easy to use since I am fairly new to programming. Of course, I can manually enter values in but I开发者_开发知识库 want to parse them in.
Would it be possible to just parse this content and write a file that has a RecipieName, Ingredient string which I can then parse into my database? Or should I just do it directly into the database? I am unsure as to how to connect a database to a parser also directly, but I might be able to find some information online.
Basically, I am looking for help on how to exactly go about doing this since I am not very well versed in programming and this seems to be a lot more complicated than it might be.
I am using Java as my main language right now, although I can't say I am very good at it. But I should be able to understand the basic concepts.
Any suggestions on what parser to use or how to do this?
Thanks!
This is how I would do it in PHP. This is almost certainly NOT the most efficient way to do it, nor has it been debugged.
function parseHTML($rawHTML){
$startPosition = strpos($rawHTML,'<div class="ingredients"'); //Find the position of the beginning of the ingredients list, return the character number.
$endPosition = strpos($rawHTML,'</div>',$startPosition); //Find the position of the end of the ingredients list, begin searching from the beginning of the list (found in step 1)
$relevantPart = substr($rawHTML,$startPosition,$endPosition); //Isolate the ingredients list
$parsedString = strip_tags($relevantPart); //Strip the HTML tags off of the ingredients list
return $parsedString;
}
Still to be done: You say you have a mySQL database with 10 separate ingredients columns. This code outputs everything as one big string. You would have to change the strip_tags($relevantPart)
function to strip_tags($relevantPart,"<li>")
. That would let the <li>
tags through. Then, you would have to loop through every <li>
tag, performing a similar function to this. It shouldn't be too hard, but I don't feel comfortable writing it with no functioning PHP server.
精彩评论