Multi-tiered / Hierarchical SQL : How does Reddit do it? Which is the most efficient way? And what databases make it simpler?_问答_开发者

Multi-tiered / Hierarchical SQL : How does Reddit do it? Which is the most efficient way? And what databases make it simpler?

开发者 https://www.devze.com 2023-02-03 16:25 出处：网络

I\'ve been reading up a bit on how multi-tiered commenting systems are built: http://ar开发者_C百科ticles.sitepoint.com/article/hierarchical-data-database/2

相关专题：php sql

I've been reading up a bit on how multi-tiered commenting systems are built:

http://ar开发者_C百科ticles.sitepoint.com/article/hierarchical-data-database/2

I understand the two methods talked about in that article. In fact I went down the recursive path myself, and I can see how the "Modified Preorder Tree Traversal" method is very useful as well, but I have a few questions:

How well do these two method perform in a large environment like Reddit's, where you can have thousands and thousands of mutli-tiered comments?

Which method does Reddit use? It simply seems very costly, to me, to have to update thousands of rows if they use the MPTT method. I'm not deluding myself into thinking I am building a system to handle Reddit's traffic, this is simply curiosity.

There's another way of retrieving comments like this ... JOINs via SQL that return the rows with IDs defining their parents. How much slower/faster/better/worse would it be to simply take these unformatted results, loop through them and add them into a formatted array using my language of choice (PHP)?

After reading that sitepoint article, I believe I understand that Oracle offers this functionality in a much simpler, easier to use way, and MySQL does not. Are there any free databases that offer something similar to Oracle?

On a side note, how is SQL pronounced? I'm getting the feeling I've been wrong for the past several years by saying 'sequel' instead of 's - q - l', although "My Sequel" rolls easier off the tongue than "My S Q L"!

MPTT is easier to fetch (a single SQL query), but more expensive to update. Simply delegate the update to a background process (that's what queue managers are for). Also note that most of that update is a single SQL UPDATE command. It might take long to process, but a smart RDBM could make the transaction visible (in cache) to new (read-only) queries before it's committed to disk.
I'd bet it uses MPTT, but not only doing the 'hard' update in background but also quite likely do a simple rendering to in-memory cache. This way, the posting user can see his post immediately, without having to wait until updating so many rows. Also, SSDs do help in getting high transaction rates.
that's called Adjacency Model (or sometimes adjacency list), it's a more obvious way to do it, and simpler to update (doesn't modify existing records) but FAR more inefficient to read. You have to do a recursive walk of the tree, with an SQL query at each node. That's what kills you: the number of small queries.
PostgreSQL has recursive SELECTs, which do in the server what you envision in PHP. It's better than PHP because it's closer to the data; but it still has the same (huge) number of random-access disk seeks.

You should have a closer look at the links in Further reading they give in the end. The Four ways to work with hierarchical data article on evolt linked there provides another way to approach this problem (the Flat table). Since that approach is extremely easy to implement for a threaded discussion board, I wouldn't be surprised if reddit uses it (or a variation on the theme).

I do like MPTT (aka nested set) though, and have used it for hierarchies that are (almost) static.