Hierarchical data from DB - to join or not to join_问答_开发者

I am trying to find a clean and performant solution for this problem, but am stuck somehow.

Trivia:

-ASP.Net开发者_运维技巧 C# application (.Net 3.5)

-MS-SQL Server 2005

This is how the data looks like:

Category -> Template -> Instance

A Category can contain multiple Templates.

A Template can contain multiple Instances.

There is a class for each of these 3 and a corresponding database table with a lot of columns.

I want to load a complete category from the database into a C# Category class object, including all the related template and instance objects.

I have two options now:

1) Do a join on all 3 tables and read all the data at once.

Upside: A lot faster on the database side, all information in one query.

Downside: I transmit a lot of redundant data, because in each row there is the same category and template data for each instance.

Example (simplified):

CategoryID | CategoryName | TemplateID | TemplateName | InstanceID | InstanceName  
1 | FirstCategory | 1 | FirstTemplate | 1 | FirstInstance   
1 | FirstCategory | 1 | FirstTemplate | 2 | SecondInstance  
1 | FirstCategory | 1 | FirstTemplate | 3 | ThirdInstance  
1 | FirstCategory | 1 | SecondTemplate | 4 | FourthInstance

2) I query each table on its own, first collecting the category data, then the related template data with the category ID and so on.

Upside: Intuitive aproach, easier to handle on the code side, no redundant data is fetched.

Downside: Multiple queries to the server, probably slower.

What is the best way to go here? Am I missing an option?

Solution 1 seems to have better perfomance, but it looks "unclean" to me. I would have to get data for a category out of a whole bunch of data rows.

If I choose solution 1, which is the best way to fetch the category and template data?

Read it from the first data row and create a new instance once the value changes?

Do some sort of grouping?

Thanks in advance! This problem is giving me headaches since days.

I'm using Entity Framework on a project I'm doing at the moment. While profiling it under certain scenarios, it indeed uses option 1 and brings back the table with redundant data. So it seems that Microsoft opted for this approach, and they own the entire stack so presumably know how to make a good decision about this exact problem.

There may be some heuristic that decides to use option 2 under certain scenarios, but I haven't seen it in my profiling. Furthermore I haven't seen EF ever return multiple result sets in a single query.

Assumptions: you're using ADO / Stored procs and you have a normalised data structure.

You could return 3 resultsets from one stored procedure call.

1) select c.* from category c where c.id = @categoryId

2) select t.* from templates t
    join category c on t.categoryid = c.id 
    where c.id = @categoryId

3) select i.* from Instance i
    join templates t on i.templateid = t.id  
    join category c on t.categoryid = c.id
    where c.id = @categoryId

And sequentially populate your objects via a sqldatareader using sqldatareader.read() and sqldatareader.Nextresult()

It really does depend on the size of the data you are returning and how often you will be requesting this data as to whether you use set-based data retrival or something like above.

There is a third option: do a 'select *' on each table and then do the join in memory. You could use LINQ for some lazy evaluation:

  class Category
  {
    public int CategoryId { get; set; }
    public List<Template> Templates
    {
      get
      {
        return Repository.Templates.Where(t => t.CategoryId == this.CategoryId).ToList();
      }
    }
  }

Edit: you can use the same logic for the Template/Instance relationship:

  class Template
  {
    public int CategoryId { get; set; }
    public int TemplateId { get; set; }
    public List<Instance> Instances
    {
      get
      {
        return Repository.Instances.Where(i => i.TemplateId == this.TemplateId).ToList();
      }
    }
  }

For a small amount of data Option 1 sounds good, however you should change the class structure and do composition for Template and Instance, meaning Template class should have a collection of Instance and Category will have collection Template and you can get rid of redundant data.

Option 2 is prefreble if the data is huge and your network bandwith is really good to make frequent DB calls.

If the hierarchies aren't too deep and the number of children on each level is resonably small, I usually start with option #2. The intuitive approach as you described it. It lets us get away with using whatever methods we already have (getTemplates(), getInstances(234) etc).

But from a performance perspective, executing one query with a 3-table join and processing the records in sorted order will likely be the faster alternative (Option #1).