I have a bunch of strings that look like this:
mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc开发者_如何学Python_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00
What is the best way to parse this? You'd figure the people who created it would have put some kind of break in it...
Anyhow, any help would be greatly appreciated.
Edit:
I appreciate everyone's post. I was wondering if I could do something like this:
- Create a list of tags. ex.
mc_gross=
,first_name=
, ... - Do a replace in the string:
thestring.replace("first_name","\r\nfirst_name")
I'm thinking this will give me the breaks I need to parse it further.
What do you think?
Unless this is fixed width (highly doubt it), I would say you are going to need to get a list of the keywords that indicate a field. Put them in a database (SQL, XML, CSV, etc. - doesn't really matter where) and then use them to parse the file. Hopefully this will come in the same order and it won't leave any tags out. If so, do a Substring that finds the value from the end of the equals sign after your tag to the beginning of the next tag in line. That will give you the value that corresponds to the appropriate tag.
So, for example, if we take just the first part mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmed
, our tags would be mc_gross, invoice, protection_eligibility, and address_status
We would then start with mc_gross=
, find it in the string using Substring. For the length to give it, we would go until we found our next tag, invoice
. The Substring line would be complicated but it should do the job. Loop through each tag. When you get to the last tag, you would need to find the end of the string instead of another tag.
As others have stated, unless you can get the original data to include line breaks in the appropriate areas then the next best thing is to get the list of key names.
I assume that the 60K other lines have the same key names as the one sample line you provided? If so, if someone can't provide you the list, then manually (not programmatically) identifying the key names yourself seems to be the only way.
I tried it myself. It did not seem too bad to do (a few minutes at most) but probably still needs someone knowledgeable to confirm that the key list is correct.
Once you have the list, then you can split by the keys and then recombine them into a new list:
string rawData =
"mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";
string[] keys = {
"mc_gross", "invoice", "protection_eligibility", "address_status", "payer_id", "tax",
"address_street", "payment_date", "payment_status", "charset", "address_zip",
"first_name", "mc_fee", "address_country_code", "address_name", "notify_version",
"custom", "payer_status", "business", "address_country", "address_city", "quantity",
"verify_sign", "payer_email", "txn_id", "payment_type", "last_name", "address_state",
"receiver_email", "payment_fee", "receiver_id", "txn_type", "item_name",
"mc_currency", "item_number", "residence_country", "handling_amount",
"transaction_subject", "payment_gross", "shipping"
};
string[] values = rawData.Split(keys, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<string> parsedList = keys.Zip(values, (key, value) => key + value);
foreach (string item in parsedList)
{
Console.WriteLine(item);
}
This will output the data in this format:
mc_gross=22.99
invoice=ff1ca57d9fa80cf93e6b300dd7f063e1
protection_eligibility=Ineligible
address_status=confirmed
payer_id=SGA8X3TX9HCVY
tax=0.00
address_street=155 5th ave se
payment_date=16:08:28 Nov 15, 2010 PST
payment_status=Completed
charset=windows-1252
address_zip=98045
first_name=jackob
mc_fee=1.08
address_country_code=US
address_name=john martin
notify_version=3.0
custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1
payer_status=unverified
business=gold-me@hotmail.com
address_country=United States
address_city=north bend
quantity=1
verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYL
payer_email=me@gmail.com
txn_id=4DU53818WJ271531M
payment_type=instant
last_name=Martin
address_state=WA
receiver_email=cravbill@hotmail.com
payment_fee=1.08
receiver_id=QG8JPB4RZJGG4
txn_type=web_accept
item_name=Some item of consequenceSpecifie
mc_currency=USD
item_number=G10W151
residence_country=US
handling_amount=0.00
transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1
payment_gross=22.99
shipping=0.00
You can further parse the list by splitting each item by the equal sign ("=") or replace the original data string with one that now contains the missing line breaks:
string newData = parsedList.Aggregate((data, next) => data + Environment.NewLine + next);
Look into using System.Text.RegularExpressions they can be very helpful.
But an easy way to do it would be to use a split function from the string class.
string head = "mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";
string splitStrings[] = new string[2];
splitString[0] = "mc_gross";
splitString[1] = "invoice";
string headArray[] = head.Split(splitStrings, StringSplitOptions.RemoveEmptyEntries);
You get the idea, it breaks everything into parts.
Equal signs are a very good indicator. Between the equal signs, then I'd suggest using some lexical tool with some type inferencing engine.
精彩评论