How to distinguish between two Blank Nodes in RDF?_问答_开发者

I am having difficulty understanding a passage from w3.org. The confusing passage may be an error, or I may just be confused.

The following is Section 6.6 of the RDF Concepts Specification,

开发者_Python百科

6.6 Blank Nodes

The blank nodes in an RDF graph are drawn from an infinite set. This set of blank nodes, the set of all RDF URI references and the set of all literals are pairwise disjoint.

Otherwise, this set of blank nodes is arbitrary.

RDF makes no reference to any internal structure of blank nodes. Given two blank nodes, it is possible to determine whether or not they are the same.

So, the thing I'm confused about is: If there is no way to know the "internal structure of blank notes", how can one tell them apart? Is this a typo?

It is not a typo and I agree, it is not straight forward to understand. This is a also recurrent issue. Blank nodes exist because sometimes there aren't ways to create an URI to represent a node. This case happens all the time in OWL when constructing constrains, for example.

A blank node ID is created, normally, when the RDF file is parsed and it must be unique. So by definition you shouldn't find two blank node with same identifiers. One way of distinguish between two blank nodes is to look at all the incoming/out-coming predicates plus their objects/subjects in order to see if the connected sub-graphs are identical. This is hard to implement and it could be very expensive to compute for large graphs.

This problem has been widely discussed in connection with finding differences between RDF graphs. One very interesting article is one of the TimBL's design issues Delta: an ontology for the distribution of differences between RDF graphs. Also have a look at How to diff RDF graphs wiki from the w3c.

If you are the data publisher then try to avoid blank nodes if posible. If you need blank nodes then try to come up with a hash function that gives you a unique ID for different blank node constructions in such a way that two different blank nodes with the same graph structure will have the same ID and therefore you can put them appart.

Note that RDF 1.1, standardised in February 2014, slightly edit this text:

Blank nodes are disjoint from IRIs and literals. Otherwise, the set of possible blank nodes is arbitrary. RDF makes no reference to any internal structure of blank nodes.

and adds a note about blank node identifiers:

Note: Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes. Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation. The syntactic restrictions on blank node identifiers, if any, therefore also depend on the concrete RDF syntax or implementation. Implementations that handle blank node identifiers in concrete syntaxes need to be careful not to create the same blank node from multiple occurrences of the same blank node identifier except in situations where this is supported by the syntax.

There is also a new piece of spec that recommends a skolemisation scheme for blank node management.

In any case, you say that:

there is no way to know the "internal structure of blank nodes"

but this is not what the spec says. The spec simply says that it does not define such a way, which means that it is the responsibility of the implementers to decide how they want to internally represent and identify blank nodes. But I agree that the wording of the 2004 spec is confusing.

There is an algorithm discussed in this draft W3C Community Group report:

RDF Dataset Normalization

A Standard RDF Dataset Normalization Algorithm

...

This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015.

...

This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected, essentially creating Skolem blank node identifiers. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting normalized dataset, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes.

-- https://json-ld.github.io/normalization/spec/