I have requirements from our client where we basically have to 'parse' PDF files from various different sources.
The solution we have come with, as 'phase 1' (as we have short time to market and will save them a huge amount of time) is
Manually use Able2Extract application to pull out the columns you need from the PDF file, and spit out an Excel file. This excel file is still very 'dirty' as it contains tons of header information, extra fields that we don't need, etc..
Run our application, feeding it this excel file, which will do the remaining of the cleanup. It takes this 'dirty' Excel file and then gives them a very clean excel file which just has the 3 or 4 columns they need all lines up very neatly.
The first solution we are exploring is using VBA/Excel for step 2). They take their dirty output, paste it in Excel, then run our cleanup macro. Excel is great for this sort of stuff - shifting around and scrubbing data that's already in an Excel spreadsheet. We did a proof of concept with one specific 'source' file, and it came out great. Tooks around half a day to develop this one 'scrubbing script'...
Simple enough huh? Not really. This script only works for one specific file type from one specific source. We will have 10 different sources each with possible 3-10 different file types. That means in the end, we may wind up with a huge Excel macro that has 120 of these very specific 'scrubbing scripts'. So my worry is about long term maintainability here. We might also bump into files that we had never seen before that might 'break' our scrubbing script and have to do a quick re-deply / ch开发者_StackOverflow社区ange to a scrubbing script... I've never used Visual Studio Tools for Office and have minimal experience with VBA Excel Macros - but it seems like this might be a good case here.
Any words of wisdom from someone who might have done something similar to this before? Are huge VBA Macros like what could result here nightmares to maintain? Is VSTFO a good alternative that will give me that 'easy to shift/scrub data' functionality, but with scalability and robustness? To be honest; my first instinct was a pure .NET solution with dynamically compiled scripts pulled from the database, using our Syncfusion Excel API to do the cleaning/scrubbing... but perhaps this is overkill.
First, you are going to need 'n' scrubbing programs no matter what. The fact is, that Excel/VBA is not much worse for maintaining this functionality than many other platforms.
You could add an interface using Userform, or play the auto-detect game, spitting out any 'new' file formats that it doesn't understand. There are several robust error handling schemes available as well, so there's no need to worry that things will get broken.
One Oil company paid me to write an Excel application using 4 Userforms and over 5000 lines of VBA as a tool to assist its accountants in doing monthly joint venture reporting. The application was used for 4 years past its End-of-life because the interface was so familiar and easy to use.
...sorry for rambling on about this, but there is a tendency to 'look down' on VBA because so few 'real programmers' use it...
VBA is much much easier to deal with than VSTO. OK, VBA may not be such a nice language to work with, but at least it gives on-the-metal access to the Excel object model. And a solution based on VBA is likely to be much more stable than one built on VSTO.
I'd say go with VBA, and if you're concerned about the maintainability, think about storing the "scrubbing scripts" in separate files. You can either
(a) have one Excel file per scrubbing script, each with a single macro with the same name; your add-in can load (and execute code in) the appropriate Excel file for any given input file
(b) have one text file per scrubbing script, each with the text of the same macro as above; your add-in can create import this as a new module at run-time - either into itself or into a temporary workbook. This is less efficient, but plays better with version control systems, since you can diff between versions of text files but it's not so easy to diff the modules in two Excel workbooks.
In both of these cases, you can store the scrubbing scripts in a shared folder so that you have centralized update if you need to modify a script.
I love programming in C# but I hate VSTO.
The two main problems I have:.
you've got no live access to the code anymore, it's all compile to a DLL which is attached to the workbook, with no debugging-on-the-go (which can be very useful for little RAD pieces). Debugging through Visual Studio is not an alternative to the ability to debug anywhere when using Excel VBA.
you're using the Excel VBA interface wrapped up for .NET usage rather than something that feels native. You have horrible function calls like
sheet.get_Range("A1:B1", System.Type.Missing);
with Missing being in the place of optional parameters.
There are a lot of people who use VSTO but having spent many years on the Excel VBA platform, I found few reasons to migrate at this point. But consider if you need to do some pretty cool stuff in C#/.NET that you can't pull off in VBA (e.g. reflection).
You can write very good code in VBA; it gets a lot of bad press as it's an environment that doesn't penalize you for writing bad code and absolutely anyone can dabble with VBA.
These may be just gripes of a grumpy developer who is experienced with VBA and not VSTO. So having said all that - if you are unfamiliar with VBA, you might just be better off going straight to VSTO. I'm not sure what Microsoft intends to do with VBA down the line; VSTO is supposed to be the future.
I have written many VBA functions within Excel, some of which have become very large and complex. I do not think maintaining them is that much harder then handling any other large project, except in the cases where people do not understand VBA very well. VBA gives you many ways to do things, most of which are not optimal. For example, if you are not very careful you will have a lot of code that looks like
Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Value = "Test"
Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Font.Bold = True
Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1)).Font.Italics = True
Where it should be something like
With Workbook("name").Sheets("name).Range(Cells(1,1),Cells(3,1))
.Value = "Test"
With .Font
.Bold = True
.italics = True
End With
End With
Both will do the same thing, however the second one should be a slightly better performer (there are probably better examples of this) and is atleast in my opinion easier to maintain.
With that said, if you and your team have the experience to write good VBA code, then I think that it is the way to go here. Otherwise, for long term maintainability, I would consider a solution that you have more experience with.
I wouldn't be writing anything that needs long term maintainability in VBA, but if its short term VBA would be fine.
In terms of performance VBA is slightly faster than .NET, but you lose so many nice features, and with the new versions of VSTO gripes such as with debugging and full OM access are things of the past.
If all of the code is purely for Excel OM manipulation I would still consider VBA as it will be slightly faster and no clear advantage to using .NET (other than the mixture of familiarity within the team that was mentioned above).
If you are using other libraries then use .NET - the main reason is that you get rid of 1/2 dozen library dependancies that you'll need to add in VBA such as FSO, ADO, CDO, etc.
Another common complaint you hear is that you have to use get accessor from C# and that you have to use Type.Missing alot.
With the newer version of .NET the type.missing is a thing of the past. The get accessor problem was only with the early version of the interop library, and I think a common misunderstanding of the usage of the range object and the range property in C#.
I've never had to use accessor methods at all, and once you write some wrapper methods for common Excel OM methods, you won't have to write missing parameters at all either. Apparently .NET 4.0 has an even better way to solve this problem.
I reckon you should go with your first instict.
Although pulling dynamically compiled scripts from the DB sure does sound like overkill to me. I probably don't fully understand your problem because I'm not sure what issue pulling dynamically compiled scripts from a DB solves.
You've got Syncfusion Excel API, for step #2 why not just write a pure .net application using Syncfusion to load and manipulate the excel files and re-save them. When you encounter new file types to support you update the application and re-distribute it.
This solution might possibly take a little longer to develop but:
- Will entirely be in .NET (I loathe VBA).
- Will not be using Excel as a server app (which another poster already pointed out is not something that Excel was built to do and MS advise against it for the reasons the other poster mentioned).
- Will (based on my experience) perform an order of magnitude faster than VSTO (interop) and probably VBA too.
If step 2 eventually needs to be a service, and you're willing to invest more time upfront (depends on your deliverable schedule) and you're dealing with excel in Open XML (although possible with the older binary formats as well) - check out the Open XML SDK and have a look at Microsoft's recommended server side automation of Office documents.
If you need to deliver it fast, VBA will get you there. If you want something that is easy to package and distribute, VSTO will get you there with a little more effort. If you need a service, go for something else entirely.
Referring to the wider question, the things to take into account:
- VBA IDE ships with Excel. Not so easy with VSTO if you want a wider group to edit code.
- More people know how to write VBA than VSTO at this stage.
- More online support for VBA at this stage.
- VBA is not designed to be anything more than automation langauge for Office products. It is perfectly adequate for that and won't be going away anytime soon. MS realise that it is one of the things that Office has over OpenOffice - Ken from Accounts is not about to sit down with Eclipse and start typing Public Static Void Main
- There are considerable limitations to VBA once you want to start using it like app code. Just including class libraries is a pain. If this will be widely distributed, I would go with VSTO.
As a poster above stated: 5,000 lines of code is 5,000 lines of code, give or take.
I am not a big fan of VSTO. VBA works for what it is intended to do. No need to rewrite it. If you need to get hard code, use C#.
Maybe Excel Services for Microsoft Office SharePoint Server 2007/2010 could be something? It seems that Excel Services can't be used without SharePoint though [look here].
Excel Services 2007 - Overview
Excel Services 2007 - Architecture
Excel Services 2010 - Overview
Excel Services 2010 - Architecture
What Is Excel Services 2007?
精彩评论