开发者

What is the data model used for tags and tag synonyms?

开发者 https://www.devze.com 2023-03-31 14:36 出处:网络
I asked this question on meta, but i now realize that it may be more appropriate for the main site as it is a general question that would relate to any tagging based system (i am happy to close / dele

I asked this question on meta, but i now realize that it may be more appropriate for the main site as it is a general question that would relate to any tagging based system (i am happy to close / delete one depending on where people think this question should go)


i have a similar system of tagged data and i am running into the same problem as SOF did where i have lots of tags that are really the same thing. I am trying to create a tag s开发者_开发问答ynonym page similar to SOF to support organizing this information.

A few questions around the relationships and "data model" of tag synonyms:

I assume that a master tag can have multiple synonym tags but a synonym tag can only be a synonym for one master tag. Is that correct?

Also, can a master tag also be a synonym tag? For example, lets say you have a tag called javascript and you had:

Master: js

Synonyms: java-script, js-web

can you also have:

Master: javascript

Synonyms: js

So in the example above, you would keep resolving to ultimately resolve js-web to javascript because the master tag: js is itself a synonym tag.

Also, that makes me think you could also run into a circular reference where you have a

Master: js

Synonyms: java-script

and

Master: javascript

Synonyms: js

How does the system deal with circular refernces?


It is tempting to give you a more theoretical answer on meta concerning folksonomies, polysemy and such! Since I am answering on the StackOverflow side I will try and give a marginally more technical answer. Running queries using the StackOverflow Data Explorer will allow me to attempt to answer your questions (I am not affiliated with StackOverflow so I can't know for sure).

On StackOverflow the master/synonym tag relationship is carefully stewarded and cultivated. At the time of writing from the Data Explorer:

  • Tags has 29488 rows
  • TagSynonyms has 1916 rows

It is interesting to contrast this with other folksonomies, one article "Technorati tags: Good idea, terrible implementation" states.

"Technorati advertises that they're now tracking 466,951 different tags, which is pretty darn impressive when you consider that a typical dictionary has around 75,000 entries"

A quick caveat, I usually write Oracle SQL and I assume that the Data Explorer is using SQLServer so my queries may be a little amateurish. Firstly my presumptions about the data:

  • anything listed in the Tags table is a "master tag".
  • in the TagSynonyms table, TargetTagName is a "master tag", SourceTagName is the "synonym tag".

Now to your specific queries:

"I assume that a master tag can have multiple synonym tags but a synonym tag can only be a synonym for one master tag. Is that correct?"

select * from TagSynonyms where TargetTagName = 'javascript'

Result: Yes. A master tag can have multiple synonym tags.

select SourceTagName, count(*) from TagSynonyms group by SourceTagName having count(*) > 1

Result: Yes. A synonym tag can only be a synonym for one master tag.

"Also, can a master tag also be a synonym tag?"

select TagName from Tags
intersect
select SourceTagName from TagSynonyms

Result: Yes. A master tag can also be a synonym tag. When I ran this query there were 465 tags that were both synonym and master

"How does the system deal with circular references?"

This is where my logic/SQL may let me down. The question is can I find any circular references? To do this I think I need to work out:

  • Set a - set of tags that are both master and synonym
  • Set b - synonyms for the synonyms of the tags in set a
  • Set c - a intersection b

Anything in set c would be a circular reference.

We have already calculated set a above (it has 465 rows).

Set b - synonyms for the synonyms of set a

select SourceTagName from TagSynonyms where TargetTagName in (
select SourceTagName from TagSynonyms where TargetTagName in (
select TagName from Tags
intersect
select SourceTagName from TagSynonyms
))

Result: 0 rows

We can stop here, there is no point working out set c as we already know set b is empty.

Unless I got my logic or SQL wrong (which is very possible) it seems there are no circular references in StackOverflow. I would imagine there are technical processes in place to prevent circular references from happening (otherwise StackOverflow could suffer StackOverflow!).

0

精彩评论

暂无评论...
验证码 换一张
取 消