开发者

What is a good way to index a Solr record in which the source data comes from multiple sources?

开发者 https://www.devze.com 2023-03-31 19:12 出处:网络
I have multiple sources of data from which I want to produce Solr documents. One source is a filesystem, so I plan to iterate through a set of (potentially many) files to collect one portion of the da

I have multiple sources of data from which I want to produce Solr documents. One source is a filesystem, so I plan to iterate through a set of (potentially many) files to collect one portion of the data in each resulting Solr doc. The second source is another Solr index, from which I'd like to pull just a few fields. This second source also could have many (~millions) of records开发者_开发问答. If it matters, source 1 provides the bulk of the content (the size of each record there is several orders of magnitude greater than that from source 2).

Source 1:

  • /file/band1 -> id="xyz1" name="beatles" era="60s"
  • /file/band2 -> id="xyz2" name="u2" era="80s"
  • ...
  • /file/band4000 -> id="xyz4000" name="clash" era="70s"

Source 2:

  • solr record 1 -> id="xyz2" guitar="edge"
  • solr record 2 -> id="xyz4000" guitar="jones"
  • solr record 3 -> id="xyz1" guitar="george"

My issue is how best to design this workflow. A few high-level choices include:

  1. Fully index the data from source 1 (the filesystem). Next, index the data from source 2 and update the already-indexed records. With Solr, I believe you still can't just add a single field to a record, you replace the entire old record with the new.
  2. Do the reverse of (1), indexing first the data from the Solr source, followed by the data from the filesystem.
  3. Somehow integrate the data before indexing into Solr. In general, we don't know much about the order of traversal in each source--which is to say, I don't see an easy way to iterate the two sources together, in which xyz1 gets processed from both sources, then xyz2, etc.

So some of the factors affecting the decision include the size of the data (can't afford to be too inefficient in terms of computational time or memory) and the performance of Solr when replacing records (does the original size matter much?).

Any ideas would be greatly appreciated.


I would say if you're not concerned about the data that is stored in two sources being merged first then option 1 or 2 would work fine. I would probably index the larger source first, then "update" with the second.


Go with option 3 — combine the records before updating.

Presumably you would be using a script to iterate over the files and process them before sending them to your final Solr index. Within that script, query the alternate Solr index to fetch any supplemental field information that it might have, using your shared identifier. Combine that as appropriate with the contents of your file, then send the resulting record to Solr for indexing.

By combining before you update, you don't have to worry about records overwriting each other. You also maintain more control over which source has priority. Furthermore, so long as you're not querying a server on the other side of the country, I will assume that the request time to the alternate Solr index is negligible.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号