SPARQL query to select/ construct a latest revision from RDF data_问答_开发者

I have a RDF file thats used to track item revisions. Using this data I can traceback the changes made to an item through its lifetime. Once a specific has changed the corresponding data is placed as a new revision. Have a look..

@开发者_如何学Pythonprefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix mymeta: <http://www.mymeta.com/meta/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<urn:ITEMID:12345> rdf:type mymeta:item .
<urn:ITEMID:12345> mymeta:itemchange <urn:ITEMID:12345:REV-1> .
<urn:ITEMID:12345:REV-1> dc:title "Product original name"@en .
<urn:ITEMID:12345:REV-1> dc:issued "2006-12-01"@en .
<urn:ITEMID:12345:REV-1> dc:format "4 x 6 x 1 in"@en .
<urn:ITEMID:12345:REV-1> dc:extent "200"@en .

<urn:ITEMID:12345> rdf:type mymeta:item .
<urn:ITEMID:12345> mymeta:itemchange <urn:ITEMID:12345:REV-2> .
<urn:ITEMID:12345:REV-2> dc:title "Improved Product Name"@en .
<urn:ITEMID:12345:REV-2> dc:issued "2007-06-01"@en .

According to this data, there was an item revision on "2007-06-01" where only the item name was changed to "Improved Product Name". As you can see, "dc:format" and "dc:extent" are missing from the latest data revision. This is on purpose to avoid millions of duplicate records!

I can write a SPARQL query that shows me the latest product revision information (REV-2: dc:title and dc:issued), but its missing "dc:format" and "dc:extent" which I want carried over from the last revision (REV-1).

How can I write a SPARQL query to do this? Any help much appreciated!

Not sure you can do this in one query. I'll think more on it if I can, but the following two queries might get you started in the right direction:

1) Find the changes that don't have a format

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mymeta: <http://www.mymeta.com/meta/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

DESCRIBE ?change
WHERE 
{
    ?item a mymeta:item;
             mymeta:itemchange ?change.
    ?change ?p ?o.
    OPTIONAL 
    {
        ?change dc:format ?format .
    }
    FILTER (!bound(?format)) 
}

2) I think this will find the oldest change that does have a format

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mymeta: <http://www.mymeta.com/meta/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?format
WHERE {
    ?item a mymeta:item;
             mymeta:itemchange ?change.
    ?change  dc:format ?format;
                  dc:issued ?issued.
    OPTIONAL {
        ?moreRecentItem a mymeta:item;
                ?moreRecentItem dc:issued ?moreRecentIssued.
        FILTER (?moreRecentIssued > ?issued)}
    FILTER (?bound (?moreRecentIssued))
}

With some more work it should be possible to limit the ?format of (2) to be from those changes with an issue date before the issue data of a result from (1). So for each row from (1) you'd execute (2) to find the format value to use. You might have better results though if you used a rule-based reasoning engine rather than SPARQL. I'd recommend EulerSharp or Pellet.

For a single item, this is a pretty straightforward query using SPARQL 1.1's subqueries. The trick is to order the revisions that have a given property by their date, and take the value from the latest revision. The values form is just used to specify the items that you're selecting. If you need to query for more items, you can add them in the values block.

prefix mymeta: <http://www.mymeta.com/meta/> 
prefix dc: <http://purl.org/dc/elements/1.1/> 

select ?item ?title ?format ?extent where {
  values ?item { <urn:ITEMID:12345> }

  #-- Get the title by examining all the revisions that specify a title, 
  #-- ordering them by date, and taking the latest one.  The same approach
  #-- is used for the format and extent.
  { select ?title { ?item mymeta:itemchange [ dc:title ?title ; dc:issued ?date ] . }
    order by desc(?date) limit 1 }

  { select ?format { ?item mymeta:itemchange [ dc:format ?format ; dc:issued ?date ] . }
    order by desc(?date) limit 1 }

  { select ?extent { ?item mymeta:itemchange [ dc:extent ?extent ; dc:issued ?date ] . }
    order by desc(?date) limit 1 }
}

$ sparql --data data.n3  --query query.rq
----------------------------------------------------------------------------------
| item               | title                      | format            | extent   |
==================================================================================
| <urn:ITEMID:12345> | "Improved Product Name"@en | "4 x 6 x 1 in"@en | "200"@en |
----------------------------------------------------------------------------------

If you actually need to do this for all the items, you can use another subquery to select the items. That is, instead of values ?item { ... }, use:

{ select ?item { ?item a mymeta:item } }

Though it wasn't mentioned in the original question, it's come up in the comments, if you're interested in getting the most recent property values for all the properties, you can a subquery as in the following, which is based on How to limit SPARQL solution group size?

select ?item ?property ?value {
  values ?item { <urn:ITEMID:12345> }

  ?item mymeta:itemchange [ ?property ?value ; dc:issued ?date ]

  #-- This subquery finds the earliest date for each property in
  #-- the graph for each item.  Then, outside the subquery, we 
  #-- retrieve the particular value associated with that date.  
  {
    select ?property (max(?date_) as ?date) {
      ?item mymeta:itemchange [ ?property [] ; dc:issued ?date_ ]
    }
    group by ?item ?property
  }
}

---------------------------------------------------------------
| item               | property  | value                      |
===============================================================
| <urn:ITEMID:12345> | dc:issued | "2007-06-01"@en            |
| <urn:ITEMID:12345> | dc:title  | "Improved Product Name"@en |
| <urn:ITEMID:12345> | dc:extent | "200"@en                   |
| <urn:ITEMID:12345> | dc:format | "4 x 6 x 1 in"@en          |
---------------------------------------------------------------

I have implemented this using RDF Quads, storing each revision inside of a separate named graph and using a well known named graph to track the latest revision for each item, along with all of the revisions.

The theory of your patch algorithm is flawed currently, as you don't have a method for identifying the latest revision, and you cannot easily trace back through the revisions in order to find the last time the triple occurred. In addition, how do you know if a triple was legitimately removed in a revision if you always try to go back to previous revisions to get the triple if you can't find it in the most recent revision?

An RDF database should be able to restrict the amount of duplication by only storing literals and URIs once and using pointers to contruct triples or quads. You may be able to make it work in the naive case where everything is stored for each revision that you are keeping.