I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.
So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index
(maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek
method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.
In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.
My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.
To read the file I used probably the fastest way: using StreamReader
. I read somewhere that StreamReader
and XmlReader
class from System.Xml
are the fastest methods. StreamReader
even faster that XmlReader
. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.
So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.
Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.
I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.
while (reading)
{
if (!reader.EndOfStream)
{
line = reader.ReadLine();
fileIndex += enc.GetByteCount(li开发者_开发问答ne) + 2; //+2 - to cover characters \r\n not included into line
position = 0;
}
else
{
reading = false;
continue;
}
if (currentArea == Area.nothing) //nothing interesting at the moment
{
//search for position of <title> tag
position = MoveAfter("<title>", line, position); //searches until it finds <title> tag
if (position >= 0) currentArea = Area.title;
else continue;
}
(...)
if (currentArea == Area.text)
{
position = MoveAfter("<text", line, position);
if (position >= 0)
{
long index = fileIndex;
index -= line.Length;
WriteIndex(currentTitle, currentId, index);
currentArea = Area.nothing;
}
else continue;
}
}
reader.Close();
reader.Dispose();
writer.Close();
}
private void WriteIndex(string title, string id, long index)
{
writer.WriteLine(title + ";" + id + ";" + index.ToString());
}
Best Regards and Thanks in advance,
ventus
Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/
Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.
Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.
May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.
Well, if it fits with your requirements, I would first import this XML into a RDMS like SQL Server and then query against this SQL Server.
With the right indexes (full text indexes if you want to search through a lot of text), it should be pretty fast...
It would reduce a lot of the overhead coming from the parsing of the XML file by the libraries...
I like the idea of creating an index - you get to keep your code super simple and you don't need any horrible dependencies like databases :)
So - Create an index where you store the following
[content to search]:[byte offset to the start of the xml node that contains the content]
To capture the byte offset, you'll need to create your own stream over the input file, and create a reader from that. you'll query the position on every reader.Read(..). An example index record would be :
"Now is the winter of our discontent":554353
This means that the entry in the xml file that contains "Now is the winter of our discontent" is at the node at byte position 554,353. Note: I'd be tempted to encode the search portion of index such that you don't collide with the separators that you use.
To read the index, you scan through the index on disk (i.e. don't bother loading it into memory) looking for the appropriate record. Once found, you'll have the byte offset. Now create a new Stream over the .xml file and set it's position to the byte offset - create a new reader and read the document from that point.
you could store the file in couchDB. i wrote a python-script to do it:
import couchdb
import datetime
import time
from lxml import etree
couch = couchdb.Server()
db = couch["wiki"]
infile = '/Users/johndotnet/Downloads/plwiki-20100629-pages-articles.xml'
context = etree.iterparse(source=infile, events=("end", ), tag='{http://www.mediawiki.org/xml/export-0.4/}page')
for event, elem in context:
#dump(elem)
couchEle = {}
for ele in elem.getchildren():
if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}id":
couchEle['id'] = ele.text
if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}title":
couchEle['title'] = ele.text
if ele.tag == "{http://www.mediawiki.org/xml/export-0.4/}revision":
for subEle in ele.getchildren():
if subEle.tag == "{http://www.mediawiki.org/xml/export-0.4/}text":
couchEle['text'] = subEle.text
db[couchEle['title']] = couchEle
This should import all the article with id, title and text into couchDB.
now you should do a query like this:
code = '''
function(doc) {
if(doc.title.indexOf("Brzeg") > -1) {
emit(doc._id, doc);
}
}
'''
results = db.query(code)
Hope it helps!
XmlReader will be fast but you need to verify if it is fast enough in your scenario. Suppose that we are looking for a value located in a node called Item
:
using (var reader = XmlReader.Create("data.xml"))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item")
{
string value = reader.ReadElementContentAsString();
if (value == "ValueToFind")
{
// value found
break;
}
}
}
}
I would do this:
1) Break the XML into smaller files. For example, if the XML looks like this then I would create one file per article node with a name that matches the title attribute. If the title isn't unique, then I would just number the files.
Since that is a lot of files, I would break them into sub directories with each having no more than 1,000 files.
<root>
<article title="aaa"> ... </article>
<article title="bbb"> ... </article>
<article title="ccc"> ... </article>
</root>
2) Create an index table with the file names and the columns you want to search on.
3) As an option, you could store the XML fragments in the database instead of on the hard drive. SQL Server's varChar(MAX) type is good for this.
Dump it into a Solr index and use that to search it. You can run up Solr as a standalone search engine, and a simple bit of scripting to loop over the file and dump every article into the index. Solr then gives you full text search over whichever fields you decided to index...
The only way your going to be able to search through this quickly is to store it in a database, like others have suggested. If a database is not an option, then its going to take a long time, no doubt about it. What I would do is create a multithread application. Create worker threads that will read in the data and maybe stick it in a string queue. Have like 5 threads doing this segmented through the whole file (so one thread will start the beginning, the second thread will start 1/5 of the way into the file, the third thread will start 2/5 of the way in, etc etc). Meanwhile, have another thread that reads the string queue and searches for whatever it is your looking for. Then have the thread dequeue once its done. This will take a while, but it shouldn't crash or eat up tons of memory.
If you find it is eating a lot of memory, then set a limit on the number of items the queue can hold and have the threads suspend until the queue size is below this threshold.
You can use XML DataType in SQL Server which supports upto 2GB of xml data. And you can query xml directly by using that.
Refer this. http://technet.microsoft.com/en-us/library/ms189887(v=sql.105).aspx
Hope this helps!
I know this question//answer is old. But I have recently been resolving this problem myself, and found Personally I would use JSON.Net (newtonking). Which is as simple as de-serializing the XML document results to C# objects.
Now, my documents (results) are only a couple of MB in size (averaging 5MB at the moment) but I can see this growing with the Solr Index. As it stands, I am getting fast results.
A discussion on CodePlex with reference to the performance.
精彩评论