How to import dbpedia into neo4j? [closed]_问答_开发者

How to import dbpedia into neo4j? [closed]

开发者 https://www.devze.com 2023-04-10 05:43 出处：网络

Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this po

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 3 years ago.

开发者_运维技巧 Improve this question

I need to import dbpedia into neo4j. I download the dbpedia from here: http://wiki.dbpedia.org/Downloads37 Any idea?

I am currently doing the same thing. I found that the biggest problem for this is the indexing so the best solution is to write a java program that extracts the statements with md5 hashes into a triple file like follows: subjectHash \t predicateHash \t objectHash \t subject \t predicate \t object \n.

In another file you will need to store the nodes (aka subjects and objects of statements): nodeHash \t nodeValue

The code for this procedure can be downloaded from my github: https://github.com/eschleining/DbPediaImport.git

Compile it with mvn package and it creates a jar file in target that takes the gzipped dbpedia files as arguments. If you only have the bz2 files you can transform them like follows: for i in *.bz2 ; do bzcat "$i" | gzip > "${i%.bz2}.gz"; done &

Now run: java -jar ConcurrentDataTableWriter-0.0.1-SNAPSHOT.jar yourdbpediaFolder/*.gz

Then you sort the newly created files manually with the sort utility of linux: gunzip -c nodes.gz | sort -k2 -u | gzip > nodes_unique.gz

And the triples file: gunzip -c triples.gz | sort -k1,3,2 -u | gzip > triples_unique.gz

Now you can compile the batch inserter of my repo with maven3 (mvn package) and run it in the same directory as the nodes_unique.gz and triples_unique.gz files it creates a Neo4J database directory named "DbpediaNe04J" (mind the typo "0 instead of o).

I found this to be the fastest way since it only looks up an index once for each subject/object pair in a triple.

Feel free to add datatype nodes as properties and so on. I currently have implemented each triple as a relationship between two nodes.