开发者

Parsing HTML content into a MySQL database using a parser

开发者 https://www.devze.com 2023-02-22 15:43 出处:网络
I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (whic

I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (which has a table with columns RecipeName, Ingredients 1-10).

So basically my database will contain the name and all the ingredients for that recipe. There is no need to edit the content, simply parse them in as is (i.e. 3/4 cup milk) since i am using character in my database.

How exactly do I go about doing this? I was looking a pre-built parsers and it seems its tough to find one that's easy to use since I am fairly new to programming. Of course, I can manually enter values in but I开发者_开发知识库 want to parse them in.

Would it be possible to just parse this content and write a file that has a RecipieName, Ingredient string which I can then parse into my database? Or should I just do it directly into the database? I am unsure as to how to connect a database to a parser also directly, but I might be able to find some information online.

Basically, I am looking for help on how to exactly go about doing this since I am not very well versed in programming and this seems to be a lot more complicated than it might be.

I am using Java as my main language right now, although I can't say I am very good at it. But I should be able to understand the basic concepts.

Any suggestions on what parser to use or how to do this?

Thanks!


This is how I would do it in PHP. This is almost certainly NOT the most efficient way to do it, nor has it been debugged.

function parseHTML($rawHTML){
 $startPosition = strpos($rawHTML,'<div class="ingredients"'); //Find the position of the beginning of the ingredients list, return the character number.
 $endPosition  = strpos($rawHTML,'</div>',$startPosition);     //Find the position of the end of the ingredients list, begin searching from the beginning of the list (found in step 1)
 $relevantPart = substr($rawHTML,$startPosition,$endPosition); //Isolate the ingredients list
 $parsedString = strip_tags($relevantPart);                    //Strip the HTML tags off of the ingredients list
 return $parsedString;
}

Still to be done: You say you have a mySQL database with 10 separate ingredients columns. This code outputs everything as one big string. You would have to change the strip_tags($relevantPart) function to strip_tags($relevantPart,"<li>"). That would let the <li> tags through. Then, you would have to loop through every <li> tag, performing a similar function to this. It shouldn't be too hard, but I don't feel comfortable writing it with no functioning PHP server.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号