Need help deciding on a database scheme for a reporting project (PHP)_问答_开发者

Update: I'm not looking for someone to come up with a schema for me. I guess I'm looking for input on whether some normalized DB design would work, when I can't clearly define all the relationships of the entities I'm working with, or if something more along the lines of a data warehouse would be something I wanted to look at further (realizing here I know enough about data warehouses to be dangerous - and that's about it.)

I've been tasked at work with 'streamlining' a reporting process for a small call center. Most of my background is in web application, and I'd consider myself an intermediate PHPer (self taught, no college - I'll take a moment of silence for the collective gasp to subside). So this was a bit if a different project from my norm - though there still needs to be a web-based interface, so it's at least a little like home.

The reporting process as it stands involves getting printed reports from an ACD system that needs to be manually entered into Crystal Reports. Additionally, Crystal is used to run reports from a ticketing system to find things like resolution rate for calls taken, etc. The task I've been given is to allow the uploading of electronic CSV files which should be parsed, then loaded into a database. Once loaded into said database, reports should be able to be generated and emailed just by clicking a link on a website (basically).

I usually start out projects by looking at the data I have, and building a database to model that data. I'm no DB rockstar, so the databases are usually pretty simple, but I do try to normalize to at lease 3NF. With this project though, I quickly saw that it would be difficult to reliably determine a relationship - since scripting that I have no control over determines how a lot of this data is related, and the reports I get are not conducive to sniffing out that relationship. So I started looking on the web. I've worn google out. I've read a bunch of questions and answers on SO; many of which have more acronyms in them than I care to look up.

So I come to you, SO, for help. Assuming I've given enough information, can anyone tell me if what I'm looking at would be best served by me running off and learning some more about data warehousing (also, if so, where, who, what should I be reading/doing), or would a pretty denormalized SQL database probably work?

Keep in mind that at most there will be about 300 - 400 rows of data entered a day - and most of that data are simple INTs. This is a very small database.

The business just wants to reduce the amount of manpower used to create the reports. They're not seeking to change the reports.

I hope I've given enough information, if not, I'll do my best to be more specific, based on comments/questions I receive back.

I started down the road of doing a 3NF schema, and ended up with several tables. One for Agents (id, name, email, extension), Agent Groups (id, group name) and Applications (id, application name).

It broke d开发者_高级运维own a bit when I realized that when an Application receives a call, it can go to any number of groups based on any number of criteria, and I have no way of getting that information (I'm not sure anyone does). So I started to think that there wasn't a need to relate these 3 things together at all.

With that in mind, there were going to be 4 more tables, AgentProfiles, AgentEvaluations, AgentGroupSummary, and ApplicationSummary. Each of these tables would have a column that corresponds to the data in the report I'm getting. Additionally, there would be an FK that points back to the Application, Agent or Agent Group associated with this 'line' of data.

I think I started to panic at this point, and I guess I'm looking for some input from people outside the quagmire of this project on how to proceed. I don't want to denormalize to the point that future maintenance becomes a nightmare - I seem to be stuck; afraid I'll over/under design and screw myself in the long run.

Edit:

I resisted the urge to go too much in-depth about the data I was working with for fear of creating a giant wall of text. I'll explain the data that I'll be getting, but the CSV files are so malformed I can't really provide an example of the reports I'll be getting. I can (and do, a little further down) provide an example of the data that would be going into the DB.

Basically, the reports I get are measurements of a call analyst's stats. How many calls they take in an hour, in a day, how long it takes them to answer a call, the length of time they're talking, etc. In the reports, each analyst is called an Agent. Each Agent belongs to an Agent Group, and each Agent Group is associated with an Application.

Once I have the data into the DB, I'll need to make pretty reports that can be exported to management, and also to agents, on a daily basis.

There are 2 reports that deal specifically with Agents - an Agent Profile report, and an Agent Evaluation report. I'll give examples of one of the reports. The rest of the reports aren't exactly conducive to being distilled into text without 40 minutes of typing.

AGNTNAME,07:18:56,03:29:36,26,265,74,0,339,11

Agent Evaluation Report
- Agent name
- Length of time (HH:MM:SS) that an agent was signed in to their phone
- Length of time (HH:MM:SS) that an agent was ready to take calls
- How many calls they took (INT)
- Then there are several calculated averages, these are calculated by whatever generates the reports. Unless otherwise noted, these are integers expressing a total time in seconds (think 180 vs 00:03:00)
  - talk time (time a call starts until an agent disconnects)
  - work time (time spent unavailable after taking a call, until available to take calls again)
  - hold time (time a caller was on hold with a particular agent)
  - handling time (talk time + work time)
  - calls per hour (this is an imaginary #, meaning that it's extrapolated based on the amount of calls an agent takes against the amount of time they were logged in. I'm not sure what the formula is that is used to get this number)

The Agent Profile Report breaks an agents day down into distinct login in periods. Anytime an agent becomes unavailable, a new login period is generated, along with a new line of data structured similarly to the Agent Evaluation Report, but with more fields that measure more and more analytical points. Most of which are averages, or manufactured averages (meaning they aren't true averages of actual numbers, but averages of computed numbers based on some secret-sauce criterion).

Agents are also group into logical subsets based on skill, these are called Agent Groups. Each Agent Group belongs to 1 or more Applications. You can think of an Application as a call queue ("Press 1 for password resets", "Press 2 for Microsoft Office help"; etc). However, each Application has a script that determines how a call gets routed to up to 10 Agent Groups that are associated with an Application.

This is where determining relationships gets hairy, because there's nothing in the reports that tells me "call X was routed to Agent Group Y because of criteria Z". So I end up with 3 objects that are hard to relate together reliably.

Agents
Agent Groups
Applications

An Agent belongs to 1 or more Agent Groups. An Agent belongs to 0 Applications (directly - they get associated through Agent Groups).

An Agent Group can have 1 or more Agents. An Agent Group belongs to 1 or more Applications.

An Application has between 1 and 10 Agent Groups. An Application has 0 Agents (again, directly).

Because I'll be required to keep historical data, I'll need a way to weed out stats for agents that no longer exist, so I'm not emailing stats to nonexistent email addresses.

Hope the extra information helps.

Without a sample of the data you hope to parse and then store it's quite hard to say where you need to go from where you already are. It's possible that someone could suggest a reasonable database schema, or at the very least say that whether a 3NF schema is viable.

Based on what you've provided, the question of whether to learn more or just proceed with a denormalised database requires a consideration of time, effort and the extent to which the database is to be used.

A normalised 3NF schema will:

take the greatest design time
enable a minimal-complexity application
provide an excellent position from which to denormalised with respect to performance

A denormalised schema will:

take the least design time (quickest option: one DB row per CSV line)
shift complexity from the database layer to the application layer
present future maintenance nightmares

The application layer can always compensate for shortcomings of the database layer at the cost of increased application complexity. Complexity can be managed through intelligent software design, and a good OO design can make the complex look simple. Consider where your strongest skills lie.

If your skills aren't in DB schema design and you can handle the increased application complexity, go for a quick schema design and crack on with getting the application working. Results trump perfection in a business environment.

If you have plenty of time, learn more about DB schemas and find a 3NF form that works for your data.

Remember that performance is relative to the frequency of use. Performance can be a pain for users if you need to generate reports from the application one per minute, less of a pain if the reports are run daily, no pain at all if the report generation is automated and happens once a day over night.

An ideal approach would be to:

determine the relationships between data (someone knows how those scripts you don't control work)
create a good 3NF schema
create the least-complex application needed (aided by a good 3NF schema)
iterate: denormalise as needed based on performance and user feedback

Keep in mind the business considerations. Getting out something that works in two weeks instead of something perfect in two months may be a better option. You may have difficulty convincing management of the time and cost sink of the two month solution (which may require extensive learning on your part).

If you're not sure which direction to take:

go for the quickest option that is most management friendly that you know you can handle
make sure you are aware of the shortcomings in your design
make management aware of the shortcomings of your quick option and what time/cost investment is required to overcome the shortcomings
you never know until you start whether any perceived shortcomings will actually be a problem
based on usage levels, your quick-and-it-works option might well be good enough