开发者

How does "cut and paste" affect character encoding and what can go wrong?

开发者 https://www.devze.com 2022-12-14 05:01 出处:网络
I have a document A in encoding开发者_运维问答 A displayed in tool A and a document B in encoding B displayed in tool B. If I cut and paste (part of) B into A what might be the resultant character enc

I have a document A in encoding开发者_运维问答 A displayed in tool A and a document B in encoding B displayed in tool B. If I cut and paste (part of) B into A what might be the resultant character encoding? I realise this depends on tool A and tool B and the information held in the paste buffer (which presumably can contain an encoding?) and the operating system.

What should high-quality tools do? and in practice how many of the common tools (e.g. Word, TextPad, various IDEs, etc.) do a good job?


First of all, a text editor's internal representation of text has no bearing on how the text is encoded (serialized) when you save the file. So a document is not "in" an encoding; it's a sequence of abstract characters. When the document is saved to a file (or transmitted over the network) then it gets encoded.

It's up to each application to decide what it puts on the clipboard. Typically, a windows app that knows what it's doing will put a number of different representations on the clipboard. When you paste in the other app, the app will look for the representation that best suits its need.

In your case, a text editor (that knows what it's doing) will put a Unicode representation of a selected string onto the clipboard (where Unicode, in Windows, is typically moved around as UTF-16, but that's not important). When you paste in the other app, it will insert that sequence of Unicode characters into the document at the selection point.

There's an app floating around called "ClipSpy" that will help you see what I'm talking about, interactively.


I observed the following behavior when I looked into Unicode normalization: When copying a canonically decomposed string (NFD) in Firefox in macOS 10.15.7, the string is normalized to NFC when pasting it in Chrome. What's weird is that the pasting affects the content of the clipboard: When pasting the string in Firefox again, it's then also canonically composed there. If I don't paste it anywhere else before pasting it in Firefox again, the NFD form survives. Interestingly, the problem doesn't occur in the other direction: When copying a canonically decomposed string in Chrome, it's pasted in NFD form anywhere I can tell. My conclusion is that Firefox stores text to the clipboard differently from other applications. One way to play around with this yourself is to copy 'mañana' === 'mañana' to your JavaScript console. The statement returns false if the NFD form of the string on the right survived the copy & paste.


This is a very good question. When you copy/paste, exactly what is copied/pasted - CHARACTERS or BYTES?. And if BYTES, what encoding are they in?

From the answers, it sounds like the answer is "it depends". Different programs will put different things in the clipboard, sometimes placing multiple representations.

Then the pasting program needs to pick the best one and "do the right thing" with it.


Following my conversion with @Kaspar Etter, I did some testing. Here is what I found:

Copy from and Paste to:

Firefox:
Firefox to Firefox: NO normalization
Other apps to Firefox: NO normalization
Firefox to other apps: normalization

Even if we use AppleScript, JXA, or Python to directly read the SystemClipboard that contains the text copied from Firefox, the text is still normalized. Since copying and pasting from Firefox to Firefox does not involve normalization, Firefox probably does not normalize the text during the copy process. I have no idea when the normalization happens.

Safari (MacOS, not iOS):
Safari to Safari: normalization
Other apps to Safari: normalization
Safari to other apps: NO normalization

For Safari (MacOS), the normalization also happens at least on Canvas by instructure.com. In the fill-in-blank questions of Classic Quizzes, when students type Hebrew words in quizzes and hit "submit", the input was normalized, but the answer key was not. In that of the New Quizzes, however, both the input and the answer key are normalized. It's a mystery to me.

Chrome:
Chrome to Chrome: NO normalization
Other apps to Chrome: NO normalization (Firefox overrides)
Chrome to other apps: NO normalization (Safari overrides)

Conclusion: Firefox and Safari behave in the opposite way. Chrome behaves normally and consistently (except when it is overridden by Firefox and Safari).

0

精彩评论

暂无评论...
验证码 换一张
取 消