Designing a translation API - How to handle spaces_问答_开发者

Designing a translation API - How to handle spaces

开发者 https://www.devze.com 2023-02-18 23:53 出处：网络

My application consumes an external Translation API (no option to use other translation engines).I\'m seeing the following unexpected behavior when I call the translation engine.

My application consumes an external Translation API (no option to use other translation engines). I'm seeing the following unexpected behavior when I call the translation engine.

input

<b1> Hello World. </b1>

expected output

<b1> Hola a todos. </开发者_Go百科b1>

actual output

<b1>Hola a todos.</b1>

Is it proper for the API to be trimming the spaces? This feels wrong to me.

Note: it is documented to replace non-html tags with <b1></b1> tag pairs (numbers increment to keep tag pairs unique).

Update: The end result was that I had to hack around the issue, encode spaces before I call the translation API. I don't like it, but I was not able to convince the API owner change it to GIGO (Garbage In, Garbage Out).

Well, in general whitespaces are not considered part of a word so it is not really surprising that the API is doing that. Whether or not this behaviour is ok is probably debateable (at least it should be documented) but you should follow the rule "be liberal in what you accept and strict in what you produce". As you produce the tokens you should be more strict.

As far as I know, whitespace in HTML is not particularly significant, multiple spaces are collapsed to single space, newlines are ignored, etc. so it's not much of a surprise that the leading and trailing spaces in that string are being dropped. From the browser's point of view, they're equivalent.

So the question then becomes, is there an option in the API to preserve spaces or treat the incoming text as "plain text" and not html?