I have: large dataset of complex documents with tree-like data structure (each document could have its own data tree which may differ from document to document). Backend is implemented using Django 1.3 and MySQL.
I need:
- store these documents using scalable and fast storage
- filter documents by some predefined queries
- search against the data within limited subset of document's fields
- extra feature: search documents using any possible query and extract any possible information from the data tree. This extra feature is customer's requirement, and it should be comfortable for non-technical users to use it. At the moment we have an agreement that XPath is enough.
Note: It is not required to change documents frequently. 90% of time will be for reading operations.
Note: I rarely need all possible fields from data tree. The data is needed for about 90% of time is about 10% of the whole data tree. The only case all data is needed is extra feature described above. However, practically it isn't the popular feature of the system.
Note: Data tree that comes with each document is the representation of some custom format that could not be changed. I can only pick needed pieces of data from the tree and convert them to readable form (as well as write them back using this custom format).
Currently, I use:
- MySQL to store data tree for each document as XML
- some pre-picked data from XML as extra columns in the same table to speed up search
- all other actually needed fields are extracted from XML on the fly during each query using MySQL's extractvalue() function (more than 10 extractvalue() calls in each SQL query)
- all searched and filterings are performed using XPath queries and extractvalue() against stored XML
The problem is that this temporary solution is very poor from performance point of view. On 100k+ records dataset I have to wait for minutes to execute single query with ~10 extractvalue() calls in one single query.
The solutions I see at the moment:
1) Continue using the approach with pre-picked fields in extra columns. These fields are extracted once while the document comes to system.
Pros:
- Use of proven technology (MySQL)
- Most searches will be performed against pre-picked fields not using very slow extractvalue() function
- I have tested XPath query searches in this approach on 100k+ records dataset, and seems performance of 1 extractvalue() call isn't too slow (<1s for query) in comparison with 10+ simultaneous extractvalue() calls in current temporary approach
Cons:
- Due to each document could have its own data tree, as well as different set of pre-picked fields, I have to create a bunch of tables to store these sets of data, and join these tables depending on document type
- Pre-picked fields can be lists, not only single values, so each list should have separate table
- Synchronization is required of pre-picked fields and XML
2) Use MySQL for XPath custom searches (开发者_运维知识库i.e. extra feature), use MongoDB for all other searches. MongoDB stores all needed pre-picked fields, MySQL stores just XML.
Note: I don't think it is worth for me to store all possible system's data in MongoDB, like user accounts, sessions etc. MongoDB just for documents will be enough.
Pros:
- 90% of needed queries should be fast
- I could store any nested data against each document. The data trees can be different from document to document. No need to join a lot of tables.
- MongoDB seems have very convenient tools to use it from Python
Cons:
- Unproven technology (at least for me). I have no experience with MongoDB, however I have consulted with some programmers using it, and seems it looks promising.
- MongoDB has no XPath-like features (and seems it will not have them in near future), so I have to continue using MySQL for XPath (as well as in solution 1)). As a developer, I want to avoid inventing new custom query language, so XPath looks like a good compromise.
- Thus, synchronization between MySQL and MongoDB is required
My questions:
- Could there be some hidden or unobvious problems with solution 2)?
- I am still worrying about performance with extractvalue(). From my customer's view, it is possible to replace XPath approach with some similar one, however I have no idea of that, with tools available for MySQL or MongoDB. Can there be any similar solutions like XPath?
Will be grateful for any feedback, thoughts and comments on the written above.
So if I understand the question right you want to
- find a given node in a tree, given some path through a portion of the tree to that node plus extra query expressions.
- then get back that node and everything below it.
With a materialized paths approach you can do the above. The main thing that needs tweaking is if there is a path "a..b..c..d..e" to a document and you want to find documents with a path "..b..c..d..", how to make that fast. If we are starting from the very top it is easy. However here we aren't. It may make sense to use a combination approach where one has the materialized path in the document for a node plus an array of the node's ancestors, something like:
{ path : ",a,b,c,d,e,",
ancestor : ['a','b','c','d','e']
}
We could index on ancestors which will create a multikey index. Then we would do a query like the following to find nodes on path "...b,c,d..." with some efficiency:
find( { path : /,b,c,d,/, ancestor : 'd', <more_query_expressions_optionally> } )
In the above the index on ancestor would be used and only docs from 'd' down need be inspected. The following could be tried which might be even better depending on how smart the query optimizer is:
find( { path : /,b,c,d,/, ancestor : { $all : ['a','d'] }, ... } )
This is a very broad question, but some things I would consider are: XML datastores like MarkLogic and eXist - they are very good at optimizing queries on tree-structured data.
You might also consider rolling your own with a basic search index like MySQL or possibly Lucene/Solr if you want better full-text search capabilities (phrases, synonyms near queries, etc).
I would do something like index the text content of every named element and attribute (this is more or less the approach taken by the XML datastores I referred to above), use that to retrieve a candidate list of documents, and then evaluate the XPath expressions on the candidates to weed out the false positives. This is not a small project, though.
精彩评论