I asked this question on meta, but i now realize that it may be more appropriate for the main site as it is a general question that would relate to any tagging based system (i am happy to close / delete one depending on where people think this question should go)
i have a similar system of tagged data and i am running into the same problem as SOF did where i have lots of tags that are really the same thing. I am trying to create a tag s开发者_开发问答ynonym page similar to SOF to support organizing this information.
A few questions around the relationships and "data model" of tag synonyms:
I assume that a master tag can have multiple synonym tags but a synonym tag can only be a synonym for one master tag. Is that correct?
Also, can a master tag also be a synonym tag? For example, lets say you have a tag called javascript and you had:
Master: js
Synonyms: java-script, js-webcan you also have:
Master: javascript
Synonyms: jsSo in the example above, you would keep resolving to ultimately resolve js-web to javascript because the master tag: js is itself a synonym tag.
Also, that makes me think you could also run into a circular reference where you have a
Master: js
Synonyms: java-scriptand
Master: javascript
Synonyms: jsHow does the system deal with circular refernces?
It is tempting to give you a more theoretical answer on meta concerning folksonomies, polysemy and such! Since I am answering on the StackOverflow side I will try and give a marginally more technical answer. Running queries using the StackOverflow Data Explorer will allow me to attempt to answer your questions (I am not affiliated with StackOverflow so I can't know for sure).
On StackOverflow the master/synonym tag relationship is carefully stewarded and cultivated. At the time of writing from the Data Explorer:
- Tags has 29488 rows
- TagSynonyms has 1916 rows
It is interesting to contrast this with other folksonomies, one article "Technorati tags: Good idea, terrible implementation" states.
"Technorati advertises that they're now tracking 466,951 different tags, which is pretty darn impressive when you consider that a typical dictionary has around 75,000 entries"
A quick caveat, I usually write Oracle SQL and I assume that the Data Explorer is using SQLServer so my queries may be a little amateurish. Firstly my presumptions about the data:
- anything listed in the Tags table is a "master tag".
- in the TagSynonyms table, TargetTagName is a "master tag", SourceTagName is the "synonym tag".
Now to your specific queries:
"I assume that a master tag can have multiple synonym tags but a synonym tag can only be a synonym for one master tag. Is that correct?"
select * from TagSynonyms where TargetTagName = 'javascript'
Result: Yes. A master tag can have multiple synonym tags.
select SourceTagName, count(*) from TagSynonyms group by SourceTagName having count(*) > 1
Result: Yes. A synonym tag can only be a synonym for one master tag.
"Also, can a master tag also be a synonym tag?"
select TagName from Tags
intersect
select SourceTagName from TagSynonyms
Result: Yes. A master tag can also be a synonym tag. When I ran this query there were 465 tags that were both synonym and master
"How does the system deal with circular references?"
This is where my logic/SQL may let me down. The question is can I find any circular references? To do this I think I need to work out:
- Set a - set of tags that are both master and synonym
- Set b - synonyms for the synonyms of the tags in set a
- Set c - a intersection b
Anything in set c would be a circular reference.
We have already calculated set a above (it has 465 rows).
Set b - synonyms for the synonyms of set a
select SourceTagName from TagSynonyms where TargetTagName in (
select SourceTagName from TagSynonyms where TargetTagName in (
select TagName from Tags
intersect
select SourceTagName from TagSynonyms
))
Result: 0 rows
We can stop here, there is no point working out set c as we already know set b is empty.
Unless I got my logic or SQL wrong (which is very possible) it seems there are no circular references in StackOverflow. I would imagine there are technical processes in place to prevent circular references from happening (otherwise StackOverflow could suffer StackOverflow!).
精彩评论