Decoding UTF-8 email subject?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-08 15:22 出处：网络

I have a string in this form: =?utf-8?B?zr... And I want to get the name of the file in proper UTF-8 encoding. Is there a library method somewhere in maven central that will do this decoding for me,

I have a string in this form: =?utf-8?B?zr...

And I want to get the name of the file in proper UTF-8 encoding. Is there a library method somewhere in maven central that will do this decoding for me, or will I need to test the pattern and开发者_如何学C decode base64 manually?

In MIME terminology, those encoded chunks are called encoded-words. Check out javax.mail.internet.MimeUtility.decodeText in JavaMail. The decodeText method will decode all the encoded-words in a string.

You can grab it from maven with

 <groupId>javax.mail</groupId>
 <artifactId>mail</artifactId>
 <version>1.4.4</version>

MimeUtility.decodeText is working for me,

eg,

MimeUtility.decodeText("=?UTF-8?B?4K6q4K+N4K6q4K+K4K604K6/4K614K+BIQ==?=");

javax.mail.internet.MimeUtility.decodeWord()

On the other hand, if you use JavaMail for decoding your emails, you don't have to care about either subject parsing or MIME body (attachments) parsing at all.

BTW it does not need to be Base64 (common with Apple's clients), it can also be Quoted-Printable (common with MS Outlook client).

Thunderbird uses whichever format is shorter (Base64 for Japanese, QP for most European languages).

If you really want to implement it yourself, have a look at RFC2047 and RFC2184 (you have to, there are a few subtleties like split encoding in two different character sets or merging adjacent encoded words only separated by folding white space)