Confusing language in specification of strtol, et al_问答_开发者

The specification for strtol conceptually divides the input string into "initial whitespace", a "subject sequence", and a "final string", and defines the "subject sequence" as:

the longest initial subsequence of the input string, starting with the first non-white-space character that is of the expected form. The subject sequence shall contain no characters if the input string is empty or consists entirely of white-space characters, or if the first non-white-space character is other than a sign or a permissible letter or digit.

At one time I thought the "longest initial subsequence" business was akin to the way scanf works, where "0x@" would 开发者_开发技巧scan as "0x", a failed match, followed by "@" as the next unread character. However, after some discussion, I'm mostly convinced that strtol processes the longest initial subsequence that is of the expected form, not the longest initial string which is the initial subsequence of some possible string of the expected form.

What's still confusing me is this language in the specification:

If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

If we accept what seems to be the correct definition of "subject sequence", there is no such thing as a non-empty subject sequence that does not have the expected form, and instead (to avoid redundancy and confusion) the text should just read:

If the subject sequence is empty, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Can anyone clarify these issues for me? Perhaps a link to past discussions or any relevant defect reports would be useful.

I think the C99 language is quite clear:

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form.

Given "0x@", "0x@" is not of the expected form; "0x" is not of the expected form; therefore "0" is the longest initial subsequence that is of the expected form.

I agree that this implies that you cannot have a non-empty subject sequence that isn't of the expected form - unless you interpret the following:

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.

...as allowing a locale to define other possible forms that the subject sequence might have, that are nonetheless not of "the expected form".

The wording in the final paragraph seems to be just "belt-and-braces".

It might be easier to understand if you started at §7.20.1.4 (The strtol, strtoll, strtoul, and strtoull functions) ¶2 of the C99 standard, instead of ¶4:

¶2 The strtol, strtoll, strtoul, and strtoull functions convert the initial portion of the string pointed to by nptr to long int, long long int, unsigned long int, and unsigned long long int representation, respectively. First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string. Then, they attempt to convert the subject sequence to an integer, and return the result.

¶3 If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.4.4.1, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36 (inclusive), the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 through 35; only letters and digits whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.

¶4 The subject sequence is defined as the longest initial subsequence of the input string, ...

In particular, ¶3 clarifies what a subject sequence is.

The POSIX spec for strtol seems to be more clear:

These functions shall convert the initial portion of the string pointed to by str to a type long and long long representation, respectively. First, they decompose the input string into three parts:

An initial, possibly empty, sequence of white-space characters (as specified by isspace())

A subject sequence interpreted as an integer represented in some radix determined by the value of base

A final string of one or more unrecognized characters, including the terminating NUL character of the input string.

Then they shall attempt to convert the subject sequence to an integer, and return the result.

But of course, it is not normative and "defers to the ISO C standard".

I completely agree with your assessment: By definition, all non-empty subject sequences are of expected form, so the wording of the standard is dubious.

In case of the floating point conversion functions, there's another blunder (C99:TC3 section 7.20.1.3, §3):

[...] The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is not of the expected form.

This implies that the whole input string must be of expected form, defeating the purpose of the endptr parameter. One could argue that the expected form for the input string is different from the expected form for the subject sequence, but it's still pretty confusing.

You are also correct that the semantics of the strto*() and *scanf() family of functions are different: If both match, they will always agree on the value and consume the same number of characters (and any libc implemetation where they do not is broken, including newlib and glibc last time I checked), but *scanf() additionally fails to match cases where it would need to backtrack more than one character, as in your examples "0x@" and "1.0e+".