开发者

PHP Simple HTML DOM or Python-BSoup: which one is the easier approach?

开发者 https://www.devze.com 2023-03-06 09:30 出处:网络
i am currently working on a approach to parse a site that contains datas on Foundations in Europe. http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use

i am currently working on a approach to parse a site that contains datas on Foundations in Europe.

http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use - with no limitations copyright开发者_如何学Cs on it.

What is the goal: i want to parse the data and save it locally: for the sake of better retrievals and a more handy way of usage: Perhaps it is possible to store it to Calc / or even better MySQL-Database.

Question: What's the simplest way to parse HTML with Perl Should i use LWP or Mechanize: which one is the easier approach!?

Some friends told me to try out Python!? Beautiful Soup. I thought about an approach with Perl LWP or Python Beautiful Soup. Other approaches to undertake parsing such a site i cannot see. Okay there is a way - using PHP. Sure, Somehow we can use PHP (and Curl)

Which approach is the best. Perl with LWP or Mechanize? or the Python one...?

Besides the question of language: Can anybody help me in the first steps. - helping to get onto the track!? I look forward to hear from you

regards zero


All the data are free to use - with no limitations copyrights on it.

I wouldn't be so sure. They are going out of their way to obfuscate contact data so that "data cannot be stored in tables to produce mailing lists". The details on the foundations are not HTML, they are images. Additionally they limit search results to maximum of 100. If you understand German, you should read the "Daten Schutz" (data protection) section in Informationen.

If all you want is to link the names of foundations to the search criteria the site allows you to use, then see the others for answers. If you do want to store the detailed information, then you will be violating the intent of the site and will need to consult a lawyer on whether their statements have legal merit. Additionally, you will need OCR to revert the images back to usable data.


My two cents is that you must choose according to language you know the best. If I were I would use Python, which has number of libraries and tools, and would be something like a couple of hours job.

However if you are good with Perl or PHP, you must choose one of these languages. Most scripting languages has libraries which can do the task.


On which are you good at? PHP or Python? Surely there's gonna be more than just comparison discussions when it comes to this kind of thing but let's not get to that. Go and choose the one you know better. One could say Perl or even Python, or PHP but each has its own advantages. In the end you'll be the one coding it so go with the one that you know better.

0

精彩评论

暂无评论...
验证码 换一张
取 消