I often do PHP projects designed to scrape hierarchical data from web pages and save them to the DB (essentially, structure the data - think scraping government websites that do have the data, but do not provide it in a structured way). Each time, I try to come up an OOP design that would allow me to achieve the following:
- Easily replace current HTML parsing scripts with new ones, in case the original web page changes
- Allow easy extensions of the data scraped and saved, as these projects are also meant for others to take and build on. My aim is to collect the "base" data, while others might decide to include something extra, change the way it is saved and etc.
So far I am yet to find the solution, but the closest I got it something like this:
I define an abstract class for data containers that would implement common tree-traversing functions:
abstract class DataContainer {
protected $parent = NULL;
protected $children = NULL;
public function getParent() {
return $this->parent;
}
public function getChildren() {
return $this->children;
}
}
And then I have the actual data containers. Imagine, I am scraping data on participation in parliamentary sessions down to a "specific question in a sitting" level. I would have SessionContainer
, SittingContainer
, QuestionContainer
that would all extend the DataContainer
.
Each of the开发者_JAVA技巧 session, sitting and question data are scraped from a different URL. Leaving the mechanism of getting the URL content aside, let's just say I need scraper classes, which would take the containers and a DOmDocument for actual parsing. So I would define an generic interface like this:
interface Scraper {
public function scrapeData(DOMDocument $Dom, DataContainer $DataContainer);
}
Then, each of the session, sitting and question would have their own scrapers, which implement the interface. But I'd also like to ensure that they only can accept the containers they are meant for. So it would look like:
class SessionScraper implements Scraper {
public function scrapeData(DOMDocument $DOM, SessionContainer $DataContainer) {
}
}
Finally, I would have a generic Factory
class that also implements Scraper interface and just distributes the scraping to relevant scrapers. Like this:
public function scrapeData(DOMDocument $DOM, DataContainer $DataContainer) {
//get the scraper from configuration array
$class = $this->config[get_class($DataContainer)];
$craper = new $class();
$class->scrapeData($DOM, $DataContainer);
}
This is the class that would be actually called in the code. Very similarly, I could deal with saving to DB - each data container could have its DBSaver class, which would implement DBSaver interface. Again, all the calls could be done via the Factory
class, which would also implement the DBSaver interface.
Everything would be perfect, but the problem is that classes that implement the interface should implement exact signature of the interface. E.g. method SessionScraper::scrapeData
cannot accept only SessionContainer
objects, it must accept all DataContainer
objects. But it is not meant to!
Finally, the question:
- Is my design wrong and I should be structuring everything in a completely different way? (how?), or:
- My design is OK, it's just that I need to enforce types within methods with
instanceof
and similar checks instead of enforcing it via typehinting?
Thanks in advance for all the suggestions / criticisms. I am completely happy with somebody overturning this code on its head, if necessary!
Container
springs into the eye. This name is very generic, you might need something more dynamic. I think you have Data
and you classify
it, so it has a type
.
So instead you hardcode the exact interface into the type hinting, you should resolve this dynamically.
If now each Container
would have a type
, the Scraper
could signal/tell whether or not it is applicable for the type
of Container
.
The concrete form of scraping is actually the strategy you use for specific data to parse it. Your container encapsulates this strategy providing an interface to the normalized data.
You just only need to add some logic/contract between Container
and Scraper
so that they can talk to each other. This contract you can put inside the interface of both.
This would also allow you to have a Scraper
that can deal with multiple types
if you want to stretch it.
For your Container
, take a look into SPL as well that you implement some interfaces so that you have iterators (and recursive iterators) available. This might be the generic structure you're referring to, and the SPL could boost the usability of your Container
classes.
You do not need to hardcode everything in OOP, you can keep things dynamic and especially in PHP you normally resolve things at runtime.
This will also allow you to easier replace Scrapers
with a new version. As
Scrapers
now would have a type by definition (as suggested above), you can resolve at runtime which concrete class should do the scraping, e.g. dynamically loading them from a .php file in a nice file-system structure.
Just my 2 cents.
精彩评论