开发者

Solr indexing HTML entities

开发者 https://www.devze.com 2023-03-09 15:53 出处:网络
I am indexing documents by Solr, which were scraped from the web. The documents contain HTML entities (such as £ or £). Mostly the do开发者_StackOverflow中文版cuments contain centra

I am indexing documents by Solr, which were scraped from the web. The documents contain HTML entities (such as £ or £). Mostly the do开发者_StackOverflow中文版cuments contain central european characters. Is there any charfilter for this task? I know solr.MappingCharFilterFactory, but using this would mean, that I have to define the mappings myself. I would be happier with a shared solution maintained by a community. Thanks for your help!


There is solr.HTMLStripCharFilterFactory, which converts HTML entities, but it also strips HTML tags.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号