I went through answers on similar topics here on SO but could't find a satisfying answer. Since i know this is a rather large topic, i will try to be more specific.
I want to write a program which processes files. The processing is nontrivial, so the best way is to split different phases into standalone modules which then would be used as necessary (since sometimes i will be only interested in the output of module A, sometimes i would need output of five other modules, etc). The thing is, that i need the modules to cooperate, because the output of one might be the input of another. And i need it to be FAST. Moreover i want to avoid doing certain processing more than once (if module A creates some data which then need to be 开发者_开发知识库processed by module B and C, i don't want to run module A twice to create the input for modules B,C ).
The information the modules need to share would mostly be blocks of binary data and/or offsets into the processed files. The task of the main program would be quite simple - just parse arguments, run required modules (and perhaps give some output, or should this be the task of the modules?).
I don't need the modules to be loaded at runtime. It's perfectly fine to have libs with a .h file and recompile the program every time there is a new module or some module is updated. The idea of modules is here mainly because of code readability, maintaining and to be able to have more people working on different modules without the need to have some predefined interface or whatever (on the other hand, some "guidelines" on how to write the modules would be probably required, i know that). We can assume that the file processing is a read-only operation, the original file is not changed.
Could someone point me in a good direction on how to do this in C++ ? Any advice is wellcome (links, tutorials, pdf books...).
This looks very similar to a plugin architecture. I recommend to start with a (informal) data flow chart to identify:
- how these blocks process data
- what data needs to be transferred
- what results come back from one block to another (data/error codes/ exceptions)
With these Information you can start to build generic interfaces, which allow to bind to other interfaces at runtime. Then I would add a factory function to each module to request the real processing object out of it. I don't recommend to get the processing objects direct out of the module interface, but to return a factory object, where the processing objects ca be retrieved. These processing objects then are used to build the entire processing chain.
A oversimplified outline would look like this:
struct Processor
{
void doSomething(Data);
};
struct Module
{
string name();
Processor* getProcessor(WhichDoIWant);
deleteprocessor(Processor*);
};
Out of my mind these patterns are likely to appear:
- factory function: to get objects from modules
- composite && decorator: forming the processing chain
I am wondering if the C++ is the right level to think about for this purpose. In my experience, it has always proven useful to have separate programs that are piped together, in the UNIX philosophy.
If your data is no overly large, there are many advantages in splitting. You first gain the ability to test every phase of your processing independently, you run one program an redirect the output to a file: you can easily check the result. Then, you take advantage of multiple core systems even if each of your programs is single threaded, and thus much easier to create and debug. And you also take advantage of the operating system synchronization using the pipes between your programs. Maybe also some of your programs could be done using already existing utility programs?
Your final program will create the glue to gather all of your utilities into a single program, piping data from a program to another (no more files at this times), and replicating it as required for all your computations.
This really seems quite trivial, so I suppose we miss some requirements.
Use Memoization to avoid computing the result more than once. This should be done in the framework.
You could use some flowchart to determine how to make the information pass from one module to another... but the simplest way is to have each module directly calling those they depend upon. With memoization it does not cost much since if it's already been computed, you're fine.
Since you need to be able to launch about any module, you need to give them IDs and register them somewhere with a way to look them up at runtime. There are two ways to do this.
- Exemplar: You get the unique exemplar of this kind of module and execute it.
- Factory: You create a module of the kind requested, execute it and throw it away.
The downside of the Exemplar
method is that if you execute the module twice, you'll not be starting from a clean state but from the state that the last (possibly failed) execution left it in. For memoization it might be seen as an advantage, but if it failed the result is not computed (urgh), so I would recommend against it.
So how do you ... ?
Let's begin with the factory.
class Module;
class Result;
class Organizer
{
public:
void AddModule(std::string id, const Module& module);
void RemoveModule(const std::string& id);
const Result* GetResult(const std::string& id) const;
private:
typedef std::map< std::string, std::shared_ptr<const Module> > ModulesType;
typedef std::map< std::string, std::shared_ptr<const Result> > ResultsType;
ModulesType mModules;
mutable ResultsType mResults; // Memoization
};
It's a very basic interface really. However, since we want a new instance of the module each time we invoke the Organizer
(to avoid problem of reentrance), we need will need to work on our Module
interface.
class Module
{
public:
typedef std::auto_ptr<const Result> ResultPointer;
virtual ~Module() {} // it's a base class
virtual Module* Clone() const = 0; // traditional cloning concept
virtual ResultPointer Execute(const Organizer& organizer) = 0;
}; // class Module
And now, it's easy:
// Organizer implementation
const Result* Organizer::GetResult(const std::string& id)
{
ResultsType::const_iterator res = mResults.find(id);
// Memoized ?
if (res != mResults.end()) return *(it->second);
// Need to compute it
// Look module up
ModulesType::const_iterator mod = mModules.find(id);
if (mod != mModules.end()) return 0;
// Create a throw away clone
std::auto_ptr<Module> module(it->second->Clone());
// Compute
std::shared_ptr<const Result> result(module->Execute(*this).release());
if (!result.get()) return 0;
// Store result as part of the Memoization thingy
mResults[id] = result;
return result.get();
}
And a simple Module/Result example:
struct FooResult: Result { FooResult(int r): mResult(r) {} int mResult; };
struct FooModule: Module
{
virtual FooModule* Clone() const { return new FooModule(*this); }
virtual ResultPointer Execute(const Organizer& organizer)
{
// check that the file has the correct format
if(!organizer.GetResult("CheckModule")) return ResultPointer();
return ResultPointer(new FooResult(42));
}
};
And from main:
#include "project/organizer.h"
#include "project/foo.h"
#include "project/bar.h"
int main(int argc, char* argv[])
{
Organizer org;
org.AddModule("FooModule", FooModule());
org.AddModule("BarModule", BarModule());
for (int i = 1; i < argc; ++i)
{
const Result* result = org.GetResult(argv[i]);
if (result) result->print();
else std::cout << "Error while playing: " << argv[i] << "\n";
}
return 0;
}
精彩评论