I've got a weird issue with the thinking_sphinx
gem for Rails.
We've been indexing contents from documents within our application. Whether it is a pdf, word document or xls, we dump its content directly to the DB using several command line tools. This output is saved to a DB field called :raw_text
.
In another model, called thought
, we have an index block for thinking_sphinx as such:
define_index sphinx_index_name do
indexes :title
indexes :text
...
indexes documents(:raw_text), :as => :thought_document_raw_text
set_property :delta => :datetime, :delta_column => :updated_on, :threshold => TS_DELTA_INDEXING_THRESHOLD
set_property :group_concat_max_len => 4294967295
end
Worth mentioning, the document's raw-text
attribute is also indexed 开发者_JS百科on the document
model itself. Things get weird when comparing the output of the two generated queries, coming from TS.
When looking up the output for the TS generated query for document_core
I see the whole text of a indexed pdf file. Yay! Exactly what I've hoped for!
If I run the TS generated query on our thought
model I only get a fraction of the :raw_text
found in a column named as we've defined in our index thought_document_raw_text
.
Since the TS index for thought
relates to another document
the query contains some elements to tie these entities together.
GROUP_CONCAT(DISTINCT IFNULL('documents'.'raw_text', '0') SEPARATOR ' ') AS
'thought_document_raw_text'
and LEFT OUTER JOIN 'documents' ON documents.thought_id = thoughts.id
A stripped down version of the entire query looks like this:
SELECT SQL_NO_CACHE 'thoughts'.'id' * 11 + 8 AS 'id' , 'thoughts'.'title' AS 'title', 'thoughts'.'text' AS 'text', GROUP_CONCAT(DISTINCT IFNULL('documents'.'filename', '0') SEPARATOR ' ') AS 'thought_document_filenames', GROUP_CONCAT(DISTINCT IFNULL('documents'.'raw_text', '0') SEPARATOR ' ') AS 'thought_document_raw_text', 'thoughts'.'id' AS 'sphinx_internal_id', CAST(IFNULL(CRC32(NULLIF('thoughts'.'type','')), 1577494256) AS UNSIGNED) AS 'class_crc', 0 AS 'sphinx_deleted' FROM 'thoughts' LEFT OUTER JOIN 'documents' ON documents.thought_id = thoughts.id GROUP BY 'thoughts'.'id', 'thoughts'.'type' ORDER BY NULL;
When I check the contents of thought_document_raw_text
It obviously not the whole text, since the bytesize is smaller (21876 byts)
What is the purpose of this distinct within the group concat?
What are my options to avoid the distinct being generated?
Why is my text blob being cut off?
If someone has had the same problem, or some similar problem regarding large amount of text, please let me know. Thanks in advance!
edit; what I forgot to mention. When I remove the DISTINCT from the generated query, the output matches its fellow TS query for document_core
. The problem has to do with this specific part!
This is a MySQL issue, but you can customise it to some extent within Thinking Sphinx. The docs have the lowdown.
精彩评论