开发者

Most efficient database schema for counting keywords

开发者 https://www.devze.com 2023-03-12 02:23 出处:网络
I\'m working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.

My question is: What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?

My two ideas (which may be terrible):

  1. Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.

  2. Have a 开发者_运维问答database with each word as a column and each product as a row and the corresponding row/column would contain the count.


The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):

class Tag(db.Model):
  # Tags should be child entities of Products and have key name based on the tag
  # eg, created with Tag(parent=a_product, key_name='awesome', ...)
  count = db.IntegerProperty(required=True, default=0)

  @classmethod
  def increment_tags(cls, product, tag_names):
    def _tx():
      tags = cls.get_by_key_name(tag_names, parent=product)
      for i, tag in enumerate(tags):
        if tag is None:
          # New tag
          tags[i] = tag = cls(key_name=tag_names[i], parent=product)
        tag.count += 1
      db.put(tags)
    return db.run_in_transaction(_tx)

  @classmethod
  def get_top_product_tags(cls, product, num=5):
    return [x.key().name() for x
            in cls.all().ancestor(product).order('-count').fetch(num)]

The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.

The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.


You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:

create table KeywordUse
( AppID     int
, UserID    int
, Sequence  int
, Word      varchar(50) -- or whatever makes sense
)

You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).

To find the top 5 keywords for any app, do a SQL query like this:

select top 5
  count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = @AppIDVariable...
group by Word, AppID
order by count(AppID) desc

If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.

REVISED ANSWER:

As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.

The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := @Value and order by Frequency (descending) with a LIMIT = 5.

You would need a separate table to track user keywords if that is important.

0

精彩评论

暂无评论...
验证码 换一张
取 消