thinking_sphinx generated query cuts off my large indexed text_问答_开发者

thinking_sphinx generated query cuts off my large indexed text

开发者 https://www.devze.com 2023-03-16 23:02 出处：网络

I\'ve got a weird issue with the thinking_sphinx gem for Rails. We\'ve been indexing contents from documents within our application. Whether it is a pdf, word document or xls, we dump its content dir

I've got a weird issue with the thinking_sphinx gem for Rails.

We've been indexing contents from documents within our application. Whether it is a pdf, word document or xls, we dump its content directly to the DB using several command line tools. This output is saved to a DB field called :raw_text.

In another model, called thought, we have an index block for thinking_sphinx as such:

define_index sphinx_index_name do
  indexes :title
  indexes :text
  ...
  indexes documents(:raw_text), :as => :thought_document_raw_text
  set_property                  :delta => :datetime, :delta_column => :updated_on, :threshold => TS_DELTA_INDEXING_THRESHOLD
  set_property                  :group_concat_max_len => 4294967295
end

Worth mentioning, the document's raw-text attribute is also indexed 开发者_JS百科on the document model itself. Things get weird when comparing the output of the two generated queries, coming from TS.

When looking up the output for the TS generated query for document_core I see the whole text of a indexed pdf file. Yay! Exactly what I've hoped for!

If I run the TS generated query on our thought model I only get a fraction of the :raw_text found in a column named as we've defined in our index thought_document_raw_text.

Since the TS index for thought relates to another document the query contains some elements to tie these entities together.

GROUP_CONCAT(DISTINCT IFNULL('documents'.'raw_text', '0') SEPARATOR ' ') AS 'thought_document_raw_text' and LEFT OUTER JOIN 'documents' ON documents.thought_id = thoughts.id

A stripped down version of the entire query looks like this:

SELECT SQL_NO_CACHE 'thoughts'.'id' * 11 + 8 AS 'id' , 'thoughts'.'title' AS 'title', 'thoughts'.'text' AS 'text', GROUP_CONCAT(DISTINCT IFNULL('documents'.'filename', '0') SEPARATOR ' ') AS 'thought_document_filenames', GROUP_CONCAT(DISTINCT IFNULL('documents'.'raw_text', '0') SEPARATOR ' ') AS 'thought_document_raw_text', 'thoughts'.'id' AS 'sphinx_internal_id', CAST(IFNULL(CRC32(NULLIF('thoughts'.'type','')), 1577494256) AS UNSIGNED) AS 'class_crc', 0 AS 'sphinx_deleted' FROM 'thoughts' LEFT OUTER JOIN 'documents' ON documents.thought_id = thoughts.id GROUP BY 'thoughts'.'id', 'thoughts'.'type' ORDER BY NULL;

When I check the contents of thought_document_raw_text It obviously not the whole text, since the bytesize is smaller (21876 byts)

What is the purpose of this distinct within the group concat?

What are my options to avoid the distinct being generated?

Why is my text blob being cut off?

If someone has had the same problem, or some similar problem regarding large amount of text, please let me know. Thanks in advance!

edit; what I forgot to mention. When I remove the DISTINCT from the generated query, the output matches its fellow TS query for document_core. The problem has to do with this specific part!

This is a MySQL issue, but you can customise it to some extent within Thinking Sphinx. The docs have the lowdown.