Can I expect a performance gain from removing this JOIN?_问答_开发者

I have a "items" table with 1 million rows and a "users" table with 20,000 rows. When I select from the "items" table I do a join on the "users" table (items.user_id = user.id), so that I can grab the "username" from the users table.

I'm considering adding a username column to the items table and removing the join. Can I expect a decent performance increase from this? It's already quite fast, but it would be nice to decrease my load (which is pretty high).

The downside is that if the user changes their username, items will still reflect their old username, but this is okay with me if I can expect a decent performance increase.

I'm asking stackoverflow because benchmarks are开发者_开发百科n't telling me too much. Both queries finish very quickly. Regardless, I'm wondering if removing the join would lighten load on the database to any significant degree.

Example query with join:

SELECT Item.id, Item.submitter_id, Item.source_image, Item.cached_image, Item.source_title, Item.source_url, Item.width, Item.height, Item.status, Item.popular, Item.made_popular, Item.fave_count, Item.tags, Item.user_art, Item.nudity, Item.created, Item.modified, Item.removed, Item.nofront, Item.test, Item.recs, Item.recs_data, User.id, User.username, User.password, User.email, User.fullname, User.profileurl, User.homepage, User.bio, User.location, User.avatar, User.ff_user, User.ff_key, User.ff_last_faveid, User.twitter_user, User.twitter_pass, User.emailalerts, User.showunsafe, User.view, User.fb_uid, User.fb_session, User.fb_avatar, User.twitter_uid, User.twitter_data, User.twitter_autopost, User.uri, User.created, User.modified FROM items AS Item LEFT JOIN users AS User ON (Item.submitter_id = User.id) WHERE Item.nofront != 1 AND Item.removed != 1 AND Item.made_popular is not NULL AND nudity != 1 ORDER BY Item.made_popular DESC LIMIT 1040, 290;

Example query without join:

The right answer is to measure it, in the target environment, to see if it makes a difference. Then do a cost/benefit analysis to see if it's worth it.

The cost is the added storage and possibility of data getting out of sync (but see below on how to mitigate this). The benefit is the increased speed or reduced load.

Database schemas are not set-and-forget operations, they should be tuned periodically as the underlying data changes. That's what DBAs are paid for, continual monitoring and tuning.

In any case, the duplication of a column can be controlled quite easily in a decent DBMS by the use of triggers. By that, I mean place an insert/update trigger on the users table so that, if a user changes their user name, it's also changed in the items table (and possibly vice versa).

Whether MySQL meets my definition of a decent DBMS, I can't comment - I'm a DB2 bod myself. But reversion from third normal form is a tried and tested technique for wringing every last ounce of performance out of databases and, provided you understand the consequences, is quite acceptable. Very few people complain about their databases taking up too much disk space. Many complain about how slow their queries run.

Just keep in mind that reversion is something you do if and when you have a performance problem. It's not something that should be done just because you think it may reduce the load. Unless the load (or time taken) is actually a problem, the benefit part of your cost/benefit analysis is zero, so any sane bean counter will tell you that means "no change".

Based on your added queries, I have a couple of points to make:

First, the nudity column. Please tell me how I can get access to this database :-)
You should only extract the columns you need. If the user name is all you require from the User table, you should not be getting all that extra stuff in the first query. Possibly likewise for the Item stuff - only get what you need.
Make sure you have indexes on all columns used in WHERE clauses - this may also entail combination indexes (those with more than one column). What gets indexed depends on your queries but each column used in the WHERE clauses is a good start for analysis.
For large tables, you can consider "sweeping" removed items periodically into a separate table (e.g., RemovedItems) to minimise the size of Items and speed up the queries. But keep in mind this is only useful if you rarely need to look for romoved items, since it will complicate those queries (by forcing them to search in two tables instead of one). Again, this is a cost/benefit thing. One million rows is not really that big a table (at least in my world).

I suggest that you keep it that way to preserve a normalized table. I thinks its not a good idea to put usernames on the item table since it will make data redundant. Have you tried re-indexing your table?

You will only see a significant performance increase if you were missing an index on either items.user_id or user.id, or if you're using a crappy database. Otherwise, performance won't improve significantly.

JOINS always take up more resources than simple SELECT statements. So yes, removing the JOIN should increase performance.

I have a "items" table with 1 million rows and a "users" table with 20,000 rows.

That is, independently of whether you JOIN or denormalize, you would still transfer roughly 1M/20k = 50 times more User information over the wire than strictly necessary. Encoding, transferring and then decoding data adds load.

I'm considering adding a username column to the items table and removing the join.

Why are you then, in your original JOIN, also bringing over all this other (potentially voluminous) information (such as User.profileurl, User.homepage etc.) if all you need is the user name? Remember, for User columns, you are transferring on average 50 copies of each bit of information. Have you considered drastically trimming down the columns you are SELECTing from in the JOIN (both from the User as well as the Item tables?)

I'm asking stackoverflow because benchmarks aren't telling me too much. Both queries finish very quickly. Regardless, I'm wondering if removing the join would lighten load on the database to any significant degree.

In a first phase, removing columns you do not intend to use can reduce load, as less data has to be encoded, transferred (from server to client application) then decoded.

In a second phase, let me start with a question of my own: do you really need all million rows in one shot? If you do not, e.g. if you are user-interface driven and you paginate them (using OFFSET ... LIMIT ...), then you will not necessarily care about the 50x User information duplication (unless the LIMIT gets into the tens of thousands.) Otherwise, you may want to measure the advantage of eliminating the 50x duplication by first SELECTing only User.id and User.username into application memory (20k pairs, into a hash-table/map), then SELECTing only Item rows (1M iterations) everytime resolving, at the application level, Item.user_id against the hash-table/map.

Of course, always use EXPLAIN to ensure that the proper indices exist and are being used when an index should be used, and run ANALYZE TABLE after any of your tables grow from under a few hundred rows to thousands or millions.