Should I use Drools in this situation?_问答_开发者

I'll use a university's library system to explain my use case. Students register in the library system and provide their profile: gender, age, department, previously completed courses, currently registered courses, books already borrowed, etc. Each book in the library system will define some borrowing rules based on students' profile, for example, a textbook for the computer algorithm can only be borrowed by students currently registered with that class; another textbook may only be borrowed by students in the math department; there could also be rules such that students can only borrow 2 computer networking book at most. As a result of the borrowing rules, when a student searches/browses in the library system, he will only see the books that can be borrowed by him. So, the requirement really comes down to the line of efficiently generating the list of books that a student is eligible to borrow.

Here is how I vision the design using Drools - each book will have a rule with a few field constraints on the student profile as LHS, the RHS of the book rule simply adds the book id to a global result list, then all the book rules are loaded into a RuleBase. When a student searches/browsers the library system, a stateless session is created from the RuleBase and the student's profile is asserted as the fact, then every book that the student can borrow will fire its book rule and you get the complete list of books that the students can borrow in the global result list.

A few assumptions: the library will handle millions of books; I don't expect开发者_如何学Python the book rule be too complicated, 3 simple field constraints for each rule on average at the most; the number of students that the system needs to handle is in the range of 100K, so the load is fairly heavy. My questions are: how much memory will Drools take if loaded with a million book rules? How fast will it be for all those million rules to fire? If Drools is the right fit, I'd like to hear some best practices in designing such a system from you experienced users. Thanks.

First, Don't make rules for every book. Make rules on the restrictions—there are a lot fewer restrictions defined than books. This will make a huge impact on the running time and memory usage.

Running a ton of books through the rule engine is going to be expensive. Especially since you won't show all the results to the user: only 10-50 per page. One idea that comes to mind is to use the rule engine to build a set of query criteria. (I wouldn't actually do this—see below.)

Here's what I have in mind:

rule "Only two books for networking"
when
  Student($checkedOutBooks : checkedOutBooks),
  Book(subjects contains "networking", $book1 : id) from $checkedOutBooks,
  Book(subjects contains "networking", id != $book1) from $checkedOutBooks
then
  criteria.add("subject is not 'networking'", PRIORITY.LOW);
end

rule "Books allowed for course"
when
  $course : Course($textbooks : textbooks),
  Student(enrolledCourses contains $course)

  Book($book : id) from $textbooks,
then
  criteria.add("book_id = " + $book, PRIORITY.HIGH);
end

But I wouldn't actually do that!

This is how I would have changed the problem: Not showing the books to the user is a poor experience. A user may want to peruse the books to see which books to get next time. Show the books, but disallow the checkout of restricted books. This way, you only have 1-50 books to run through the rules at a time per user. This will be pretty zippy. The above rules would become:

rule "Allowed for course"
   activation-group "Only one rule is fired"
   salience 10000
when
  // This book is about to be displayed on the page, hence inserted into working memory
  $book : Book(),

  $course : Course(textbooks contains $book),
  Student(enrolledCourses contains $course),
then
  //Do nothing, allow the book
end

rule "Only two books for networking"
   activation-group "Only one rule is fired"
   salience 100
when
  Student($checkedOutBooks : checkedOutBooks),
  Book(subjects contains "networking", $book1 : id) from $checkedOutBooks,
  Book(subjects contains "networking", id != $book1) from $checkedOutBooks,

  // This book is about to be displayed on the page, hence inserted into working memory.
  $book : Book(subjects contains "networking")
then
  disallowedForCheckout.put($book, "Cannot have more than two networking books");
end

Where I am using activation-group to make sure only one rule is fired, and salience to make sure they are fired in the order I want them to be.

Finally, keep the rules cached. Drools allows—and suggests that—you load the rules only once into a knowledge base and then create sessions from that. Knowledge bases are expensive, sessions are cheap.

My experience with Drools (or a rules engine in general) is that it is a good fit if user visibility into the rules are important, or if fast changes to the rules without making it a coding project is important, or if the set of rules is very large making it hard to manage, think about and analyze in code (so you would have business people asking technical people to go read the code and tell them what happens in situation X).

That being said, rules engines can be a bottleneck. They don't run anything close to the performance of code, so you do need to manage that up front architecturally. In this specific case there is certainly a database behind this, and you can add to the performance issues that the database will return a query a whole lot faster than you can analyze the whole set in code.

I would absolutely not implement that by making a million rules objects, rather I would make a book type that multiple books can be assigned to, and run the rules against the book types, and then only show books that are in an allowed type. This way you could load the types, pass them through the rules engine, and then push the allowed types to a query on the database end that pulls the list of books in the allowed types.

Types get a bit complicated by the fact it will be likely that in practice a book may be of two types (allowed if you are taking a certain course, or in general if you are part of the department), but the approach should still hold.

My questions are: how much memory will Drools take if loaded with a million book rules? How fast will it be for all those million rules to fire?

How fast is your computer and how much memory have you got? In one sense you can only find out by building a proof of concept and filling it with the right quantity of (randomly-generated) test data. My experience is that Drools is faster than you expect, and that you have to have very good knowledge of what's under the hood to be able to predict what is going to make it slow.

Note that you are talking about a million rule session facts (i.e. Book objects), not a million rules. There are only a handful of rules, which won't take long to fire. The potentially slow part is inserting the million objects, because Drools needs to decide which rules to put on the Agenda for each new fact.

It's a shame that none of us has an answer for some particular set-up with a million facts.

As for the implementation, my approach would be to insert a Book object for each book that the student wants to check out, retract the ones that are not allowed, and a query to get the remaining (allowed) Book objects, and another query to get the list of reasons. Alternatively, use RequestedBook objects that have additional boolean allowed and String reasonDisallowed properties that you can set in your rules.

Any time we are looking at large data-sets (which this question is about ... whether or not Drools is a good fit in a large data set case), think outside the box (below). Any time we are talking about "millions of objects" or similar log-N type problems, I don't think they tool in question is necessarily the problem. So yes, Drools (or JBoss Rules) can be used BUT would only make sense in a certain context...

When you have log-N of anything (cross-referencing large data-sets against inputs), I would recommend using more novel approaches like database-backed Bloom Filters. These can be implemented as Java objects and referenced by Drools for the fact lookup (custom coding there however).

Since Bloom Filters are tiny memory structures with only basic insert()/contains() functions, they do have a drawback ... about a 1% false-positive rate. So this will serve as a primary-cache. If constructing the Drools question to generally be "NO" as the answer, a Bloom Filter backed fact-table construct lookup will be lightning fast and with a tiny memory footprint (about 1.1 bytes per record in my implementation) so 1 MB of RAM for this case. Then in the "contains" case (which might be a false-positive), use the database-backed fact table to clarify. Again, if in 80% of the time, the lookup is false, then the Bloom Filter will be a huge cost-savings in memory and time. Otherwise, the pure (anything - Drools facts, database, etc) 1M record lookups will be very expensive every time (in memory and speed).

I would be worried about the need to have the number of rules a function of the number of students - that could really make things tricky (that sounds like the biggest problem).