I need to parse a CSV file which has this header:
Company;Registered office;Notifying party;Domicile or Registered office;Holdings of voting rights;;;;;;Publication
;;;;directly held;;additionally counted;;to开发者_开发技巧tal;;in Germany;;in foreign countries
;;;;percentage;single rights;percentage;single rights;percentage;single rights;Official stock exchange
I was wondering whether this is a standard header format, because I expected to have all the fields listed one after another, like (in the first row) "Holdings of voting rights-directly held-percentage;Holdings of voting rights-directly held-single rights", while I see that information spread over three lines.
Currently my file has 6 lines of header (the three shown and other three in another language), how can I detect, if a day they'll add some more header lines?? The file continues with the following line (the first data) and so on. The first line of real data isn't always the same
BBS Kraftfahrzeugtechnik AG;Schiltach;Baumgartner, Heinrich;Deutschland;62,5;;37,5;;100,0;;Börsenzeitung;04.04.2002
I'm also looking for java libraries which are able to parse CSV files.
I disagree to others who claim that only comma is allowed. Wikipedia, for example, gives a case of German CSV which uses semicolons for CSV separation (as commas are used for decimal separation). I think MS Excel is also pretty much flexible on what delimiters to use. It's just programmers' minds that try to gravitate towards most simplistic case.
For CSV parsing I recommend Ostermiller Utils.
Q> how can I detect, if a day they'll add some more header lines?
A> you can't. The only thing you can rely is either dynamic layout (where you know column names in advance) or static layout (where you assume that this column is always n-th).
Despite CSV (Comma Seperated Value) files having the word comma in their name, I've seen some very weird stuff in the enterprise world.
I would suggest creating your own representation of the data. It sounds like you may be reading in multiple files all formatted a bit differently?
I would approach the problem in a modular fashion. Have importers for the different formats, bring it in to a normalized data representation that you than do what you want with.
This is all assuming that these files contain the same type of data and that you have no control over the files you are receiving.
Even if this is not the case, abstracting out the data from it's representation and sticking that in a separate project would be useful.
I would also recommend the use of OpenCSV
Yes, you have a legitimate CSV file. I read it in successfully by Excel, and suspect I would have no problem with OpenOffice. For Excel, I saved it as a .txt file, but then had to tell Excel in the opening dialogue that it was delimited by semicolons.
This is "standard" in the sense that it is separating columns by a delimiter (semicolons are OK, as are tabs and of course commas) and rows by new lines.
The reason that you were given this format is because the second and third header lines don't come directly under the first line. "Holdings of voting rights" spans 6 columns. Underneath it, on the second header line, "directly held" spans 2 columns, as does "additional counted" and "total." The third header line breaks down the second header line into "percentage" and "single rights."
I don't think you will easily be able to find when the headers stop and the data begins. This is a semantic problem -- one of meaning. It is easier for a human, though!
This is not a CSV file. You need to get the specification for the file from whoever is generating it.
CSV files are Comma-Separated-Values, with one record per line. It's a loose specification with regards to how to escape commas and escape characters. Excel uses double quotes around values, and then doubled-up double-quotes.
With regards to CSV parsing libraries, I would highly recommend OpenCSV.
Also see: Can you recommend a Java library for reading (and possibly writing) CSV files?
There is no standard header format. It can been seen as a convention that the first line is a comma separated list of values representing the column headers.
In your case, your table has three header lines (my guess based on counting cells and comparing with the content of your data example).
It is still csv, but you have t know in advance which line is the first line holding actual data. There is no clue given by the format itself.
As for CSV headers go, there is no standard format. In all cases, we do assume that first line is a header. Altough if header spans over multiple lines (which I am seeing for first time here) then you would need to know the count of header columns before you start parsing this file. Atleast that is a start.
The next assumption in csv files is generally that one line is one row or record. So usually headers and data are seperated by newline. In your case, I am not sure how you are generating the file and how is it planned to be used.
精彩评论