开发者

How to select random DBPedia nodes from SPARQL?

开发者 https://www.devze.com 2023-02-24 21:47 出处:网络
How can I select random sample from DBpedia using the sparql endpoint? This query SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

How can I select random sample from DBpedia using the sparql endpoint?

This query

SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

(found here) seems to work ok on most SPARQL endpoints, but on http://dbped开发者_开发技巧ia.org/sparql it gets cached (so it returns always the same 10 nodes).

If i try from JENA, I get the following exception:

Unresolved prefixed name: bif:rnd

And I can't find the what the 'bif' namespace is.

Any idea on how to solve this?

Mulone


In SPARQL 1.1 you can do:

SELECT ?s
WHERE {
  ?s ?p ?o
}
ORDER BY RAND()
LIMIT 10

I don't know offhand how many store will optimise, or even implement this yet though.

[see comment below, this doesn't quite work]

An alternative is:

SELECT (SAMPLE(?s) AS ?ss)
WHERE { ?s ?p ?o }
GROUP BY ?s

But I'd think that's even less likely to be optimised.


bif:rnd is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

Where some_random_number is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.

Try to avoid completely open patterns like ?s ?p ?o and your query will be much more efficient.


bif:rnd is a Virtuoso specific extension and will thus only work again Virtuoso SPARQL endpoints.

bif is the prefix for Virtuoso Built In Functions which enable any Virtuoso function to be called in SPARQL, with rnd being a Virtuoso function for returning random numbers.


I encountered the same problem and none of the solutions here addressed my issue. Here is my solution; it was non-trivial and quite a hack. This works for DBPedia as of now, and may work for other SPARQL endpoints, but it is not guaranteed to work for future releases.

DBPedia uses Virtuoso, which supports an undocumented argument to the RAND function; the argument effectively specifies the range to use for the PRNG. The game is to trick Virtuoso into believing that the input argument cannot be statically-evaluated before each result row is computed, forcing the program to evaluate RAND() for every binding:

select * {
    ?s dbo:isPartOf ?o .  # Whatever your pattern is
    bind(rand(1 + strlen(str(?s))*0) as ?rid)
} order by ?rid

The magic happens in rand(1 + strlen(str(?s))*0) which generates the equivalent of rand(); but forces it to run on every match by exploiting the fact that the program cannot predict the value of an expression that involves some variable (in this case, we just compute the length of the IRI as a string). The actual expression is not important, since we multiply it by 0 to ignore it completely, then add 1 to make rand execute normally.

This only works because the developers did not go this far in their static-code-evaluation of expressions. They could have easily written a branch for "multiply by zero", but alas they did not :)


None of the above methods works with Jena/Fuseki, so I've done it in another way:

SELECT DISTINCT ?s ?p ?o
{
  ?s ?p ?o.
  BIND ( MD5 ( ?s ) AS ?rnd)
}
ORDER BY ?rnd ?p
LIMIT 100

Obviously this doesn't select random triples, but the set of the first k MD5-ordered subjects should have relevant features of a statistically significant sample (i.e. the sample is representative of the entire population, there is no particular selection bias).


SELECT ?s WHERE { 
    ?s ?p ?o . 
    bind(<SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o) as ?rid)
}
ORDER BY ?rid
LIMIT 10

How about this one?

<SHORT_OR_LONG::bif:rnd> may be better than <bif:rnd>. (http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)

You simply bind random id (?rid) to each row of binding (?s ?p ?o) then order results by random id.


After much experimentation I have ended up with the following solution, a combination of using a hash to avoid RAND() being statically-evaluated and RAND() to avoid the selection biases caused by only using a hash.

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?s))) AS ?random) .
} ORDER BY ?random
LIMIT 1

Here used to select a random valley glacier from Wikidata:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?item))) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 1

Try it (the service caches responses, you can bypass this by just making a new comment before running the query)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号