开发者

Reading material on writing optimal SQL database queries

开发者 https://www.devze.com 2023-01-12 03:48 出处:网络
I have no idea whether my database queries are at all clean, efficient, and so on. At best, I can get what I need from the DB but not much else.

I have no idea whether my database queries are at all clean, efficient, and so on. At best, I can get what I need from the DB but not much else.

What books (or websitse) should I read to move forward? I want to know about performance costs of the various conditions and keywords that could potentially go into a q开发者_StackOverflowuery (e.g. JOIN, IN, WHERE, etc.).

I use SQLite on iOS devices.


For specific help with SQLite and/or the iOS, here are some resources that seem to have really good advice.

  • iOS Application Programming Guide - Tuning for Performance and Responsiveness
  • SQLite Optimization FAQ
  • SQLite performance tuning and optimization on embedded systems
  • How to improve SQLite write performance
  • The book The Definitive Guide to SQLite gets high reviews, though it spends a lot of time discussing database design which you may not need.
  • There is a brand-new book, not yet released, that also seems useful: Professional iPhone and iPad Database Application Programming.

Now, more generally, to really do performance tuning for any enterprise database system, here are the areas I think important that you would need to understand:

  • The basic logical data storage architecture of the system you're working with. For example, b-tree, extent, page, the sizes and configurations of these, how much data is read at once, the maximum size of a row (if that is an issue in your DBMS), what is done with out-of-row data (again if that is an issue in your DBMS).

  • Indexes, constraints, and basic ordering of tables and row data: heaps, clustered, nonclustered, unique and non-uniqueness of these indexes, primary keys, unique constraints, included columns. In all these indexes whether nulls are allowed, just one null is allowed, or none. Uniqueifiers. Covering index.

  • SARGability (look up SARG which is short for "Search ARGument").

  • Foreign keys, defaults, cascade deletes/update, their effect on inserts and deletes.

  • Whether NULLs require any storage space and if this is affected by column position. The number of bytes required to store each data type. When trailing spaces are stored or not stored for string data types. Packed vs. non-packed data types (e.g. float and decimal vs. integer). The concept of rows per page (or smallest unit of disk read) in both clustered and nonclustered indexes.

  • Fill factor, fragmentation, statistics, index selectivity, page splits, forwarding pointers.

  • When "batching" an operation can boost performance and why, and how to do it most efficiently.

  • INNER, LEFT, RIGHT, FULL, and CROSS JOINs. Semi-joins (EXISTS) and anti-semi-joins (NOT EXISTS). Any other language-specific syntax such as USING in mySql and CROSS APPLY/OUTER APPLY in SQL Server. The effect of putting a join condition in the ON clause of an outer join vs. putting it in the WHERE clause.

  • Independent subqueries, correlated subqueries, derived tables, common table expressions, understanding that EXISTS and NOT EXISTS generally appear to introduce a correlated subquery, but usually are seen in the execution plan as joins (semi or anti-semi joins).

  • Viewing and understanding execution plans either graphically or in text. Viewing the statistics/profile of CPU, reads, writes, and duration used by whole SQL batches or individual statements. Understanding the limitations of execution plans & profiles, which practically speaking means you generally have to use both to optimize well. Caching and reuse of execution plans, expiration of plans from the cache. Parameter sniffing and parameterization. Dynamic SQL in relation to these.

  • The relative costs of converting data types to other data types or just working with those data types. (For example, a solid rule of thumb is that working with strings is more costly than working with numbers.)

  • The generally exorbitant cost of row-by-row processing as opposed to set-based. The proper use for cursors (rare, though sometimes called for). How functions can hide execution plan costs. The tempting trap of writing functions that get called for every row when the problem could be solved in sets (though this can be tricky to learn how to see, especially because traditional application programming tends to train people to think in terms of functions like this).

  • Seeks, scans, range scans, "skip" scans. Bookmark lookups aka an index seek followed by table seek to the same table using the value found in the index seek. Loop, merge, and hash joins. Eager & lazy spools. Join order. Estimated row count. Actual row count.

  • When a query is too big and should be split into more than one, using temp tables or other means.

  • Multi-processor capabilities and the benefits and gotchas of parallel execution.

  • Tempdb or other temp file usage. Lifetime and scope of temp tables, table variables (if your DB engine has such). Whether statistics are collected for these (in SQL Server temp tables use statistics and table variables do not).

  • Locking, lock granularity, lock types, lock escalation, blocks, deadlocks. Data access pattern (such as UPDATE first, INSERT second, DELETE last). Intent, shared, exclusive locks. Lock hints (e.g. in SQL Server UPDLOCK, HOLDLOCK, READPAST, TABLOCKX).

  • Transactions and transaction isolation. Read committed, read uncommitted, repeatable read, serializable, snapshot, others I can't remember now.

  • Data files, file groups, separate disks, transaction logs, simple recovery, full recovery, oldest open transaction aka minimum log sequence number (LSN), file growth.

  • Sequences, arrays, lists, identity columns, windowing functions, TOP/rownum/limiting number of rows returned.

  • Materialized views aka indexed views. Calculated columns.

  • 1 to 1, 1 to 0 or 1, 1 to many, many to many.

  • UNION, UNION ALL, and other "vertical" joins. SQL Server has EXCEPT and INTERSECT, too.

  • Expansion of IN () lists to OR. Expansion of IsNull(), Coalesce(), or other null-handling mechanisms to CASE statements.

  • The pitfalls of using DISTINCT to "fix" a query instead of dealing with the underlying problem.

  • How linked servers do NOT do joins across the link well, queries to a linked server often become row-by-row, large amounts of data can be pulled across the link to perform a join locally even if this isn't sensible.

  • The pitfall of doing any I/O or error-prone task in a trigger. The scope of triggers (whether they fire for every row or once for each data operation).

  • Making the front-end, GUI, reporting tool, or other client do client-type work (such as formatting dates or numbers as strings) instead of the DB engine.

  • Error handling. Rolling back transactions and how this always rolls back to the first transaction no matter how deeply nested, but a COMMIT only commits one level of work.

The majority of these have some relation to performance. Some are less important to performance but (in my opinion) are important if you want to be a good SQL developer, because it's not enough to just have fast queries, they also need to be right, and play nice with others, and deal with faults properly. If you continue writing SQL professionally and want to improve, then you will eventually need to know most of this stuff. I would start with understanding table organization, indexes, seeks & scans, and hash/merge/loop joins. I don't know much about SQLite but these things are global to any DBMS.

One thing that could help you a lot is a recognition that queries involve searching for data just like you'd look up names, addresses, phone numbers, and other things in a big phone book or series of phone books having various indexes in the back. Perhaps there is a reverse phone number index or an index organized by first name. Thinking out the least-work/shortest-time paths to get the information from these physical objects will help you understand what the task of the query engine is in choosing an execution plan. That understanding will help you say things like "wait a minute, why is it doing a scan when it should be doing a seek? That table is very large and has an index on X!"

Example scenario: You have a phone book organized as usual by last name. You also have an index in the back with just first and last name, sorted by first name.

Task 1: you need to write down all the phone numbers of every person with a first name of Torstein. Best plan: look in the first-name index to find the last names of the 5 people having that first name, then look these last names up in the main phone book. You just did a nonclustered index seek with a bookmark lookup against the clustered index.

Task 2: you need to write down all the phone numbers of every person whose first names start with A. Quickly realizing there's probably no point in even turning to the first-name index, you turn to the main phone book and just read every page starting from page 1. You just did a table scan.

Inherent in choosing a good "execution plan" for these two tasks (which is what we just did) is some domain knowledge you possess: You know that Torstein is a very rare name, and that there are probably tens of thousands of people whose first name starts with A. This domain knowledge is the equivalent of what statistics provide for the query engine. Without statistics a bad execution plan can be chosen.

While these general tips may not all apply in the case of SQLite on the iPhone, getting these concepts really solid in your mind will be of huge help in working with your SQLite database. There are certain principles that are bound to be global, no matter what system you're working with (for example, understanding indexes will never be a wasted effort, as any system that doesn't have indexes will probably not be worth using).

I hope this helps. Feel free to ask for clarification on any point. If you have trouble finding a resource ask and I'll see what I can point you toward.

0

精彩评论

暂无评论...
验证码 换一张
取 消