How to use Solr DataImportHandler with XML Files?_问答_开发者

I'm researching using DataImportHan开发者_StackOverflow中文版dler to import my data files utilizing FileDataSource with FileListEntityProcessor and have a couple questions before I get started that I'm hoping you guys can assist with.

1) I would like to put a file on the local filesystem in the configured location and have Solr see and process the file without additional effort on my part. Is this doable in any way? From what I've seen, this is not supported and I must manually call a URL (e.g. http://foo/solr/dataimport?command=full-import). The manual, URL-based invocation method seems perfectly logical in a database-oriented world, where one might schedule an update to run regularly but in my case I have a couple identical indexes I load balance between and don't want to run the same hefty query multiple times in parallel. As such, I'm doing one query, writing the results to an XML file, pushing that file to each box, and then wanting that file processed. I'd like the process to be as automated as possible.

2) I would like any files processed by Solr to be deleted after they've been imported. I haven't seen any way to do this currently. I thought I might be able to subclass something, but FileListEntityProcessor, for example, doesn't seem to give any handles at the right time in the workflow to delete a file. Is there somewhere else I can look?

3) When reading the DIH documentation, I ran across this statement: "When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties." If it really does update the date to the completion date, what happens to any files added between the start and end dates? Are they lost?

4) For delta imports, I don't see mention of how processed files are ordered other than that it tries not to re-import files older than that mentioned in the conf/dataimport.properties file. In cases where order matters, does it order the files by name or creation date or ...?

the idea of solr/lucene is not to work as an database. It's an index. This means, it's an index for data, which resit somewhere else - regardless of the possibility to (index and) store the data in solr/lucene additional for special features (highlighting, etc). Therefore there is no out-of-the-box possibility to add single documents and delete those documents after importing. By the way, it's best practice to keep to original documents at an database, file system, etc. Probably you keep the original documents, but not on solr/lucene server?!

URL-based invocation method seems perfectly logical in a database-oriented world, where one might schedule an update to run regularly but in my case I have a couple identical indexes I load balance between and don't want to run the same hefty query multiple times in parallel.

You could define an operating-system scheduled job (cronjob) to start an delta import.

I would like any files processed by Solr to be deleted after they've been imported

I never heard about, that solr is able to do that. As i wrote above, the idea is, that solr is an index of data which is stored somewhere else. So the DIH expected the data/all the documents at "somewehere". If you remove the original documents from "somewehere" and updates the index, the intended target is to synchronize the index content with the (now) available documents...

Are they lost?

No.

it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties."

Solr reads the start time, run the delta queries and (...if it is finished, solr...) updates(... the start time...) as timestamp in conf/dataimport.properties."

does it order the files by name or creation date or ...?

Not sure, but i think it reads the files in the given order from the filesystem

Got a great response from Erick Erickson on the Solr Mailing List:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201109.mbox/browser

Specific replies below, but what I'd seriously consider is writing my own filesystem-aware hook that pushed documents to known Solr servers rather than using DIH to pull them. You could use the code from FileSystemEntityProcessor as a base and go from there. The FileSystemEntityProcessor isn't really intended to do very complex stuff.....

1> Don't think this is possible OOB. There's nothing built in to the DIH that puts in filesystem hooks and automatically tries to index it....

2> Nope. DIH is pretty simple that way as per the FileListEntityProcessor.

3> I'm pretty sure this is irrelevant to FileSystemEntityProcessor, it's really used for the database importation.

4> "whatever order Java returns them in". Take a look at the FileListEntityProcessor code, but the relevant bit is below. So the ordering is whatever Java does which I don't know what, if any, guarantees are made.

  private void getFolderFiles(File dir, final List<Map<String,Object>> fileDetails) {
  // Fetch an array of file objects that pass the filter, however the
  // returned array is never populated; accept() always returns false.
  // Rather we make use of the fileDetails array which is populated as
  // a side affect of the accept method.
  dir.list(new FilenameFilter() {
    public boolean accept(File dir, String name) {
      File fileObj = new File(dir, name);
      if (fileObj.isDirectory()) {
        if (recursive) getFolderFiles(fileObj, fileDetails);
      } else if (fileNamePattern == null) {
        addDetails(fileDetails, dir, name);
      } else if (fileNamePattern.matcher(name).find()) {
        if (excludesPattern != null && excludesPattern.matcher(name).find())
          return false;
        addDetails(fileDetails, dir, name);
      }
      return false;
    }
  });
}