开发者

How can I parse email text for components like <salutation><body><signature><reply text> etc?

开发者 https://www.devze.com 2023-03-06 03:49 出处:网络
I\'m writing an application that analyzes emails and it would save me a bunch o开发者_开发知识库f time if I could use a python library that would parse email text down into named components like <s

I'm writing an application that analyzes emails and it would save me a bunch o开发者_开发知识库f time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.

For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as

Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."

I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?


https://github.com/Trindaz/EFZP

This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.


If you score each line based on the types of words it contains you may get a fairly good indication.

E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)

A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..). Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation http://ogden.basic-english.org/ http://osteele.com/projects/pywordnet/

the signature will contain closing words.

If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.

Each word would get a score [salutation score, body score, signature score,..] e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature. this means hello would get assigned [900, 10, 3, ..] cheers might get assigned [10,3,100,..]

now you will have a large list of about 500,000 words. words that don't have a large range aren't useful. e.g. catch might have [100,101,80..] = range of 21 (it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.

Now you can reduce the number of words down to about 10,000

now for each line, give the line a score also of the form [salutation score, body score, signature score,..]

this score is calculated by adding the vector scores of each word.

e.g. a sentence "hello cheers for giving me your number" could be: [900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..] =[1023,900,500,..] say

then because the biggest number is at the start in the salutation score position, this sentence is a salutation.

then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score

Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.


The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff

linearray=emailtext.split('\n') now you have an array of strings, each one like a paragraph or whatever

so linearray[0] would contain the salutation

deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.

Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.

Once you figure out where the signature is the rest is the rest is easy

hope this helped


I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.

Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.

{
  "subject": "Thanks for meeting...",
  "from_address": "bgates@example.com",
  "from_name": "Bill Gates",
  "htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div><a href=\"https://www.linkedin.com/in/williamhgates/\">LinkedIn</a><a href=\"https://twitter.com/BillGates\">Twitter</a>",
  "plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
  "date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}
0

精彩评论

暂无评论...
验证码 换一张
取 消