Exclude results from DBpedia SPARQL query based on URI prefix

It might seem a little awkward, but your comment about casting to a string and doing some string-based checks is probably on the right track. You can do it a little bit more efficiently using the SPARQL 1.1 function strstarts:

SELECT DISTINCT ?concept
WHERE {
    ?x a ?concept
    FILTER ( !strstarts(str(?concept), "http://dbpedia.org/class/yago/") )
}
LIMIT 100

SPARQL Results

The other alternative would be to find a top level YAGO class, and to exclude those concepts that are rdfs:subClassOf that top level class. This would probably be a better solution in the long run (since it doesn’t require casting to strings, and it’s based on graph structure). Unfortunately, it doesn’t look like there is a single top level YAGO class comparable to owl:Thing. I just downloaded the YAGO type hierarchy from DBpedia’s download page and ran this query, which asks for classes with no superclasses, against it:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?root where {
  [] rdfs:subClassOf ?root 
  filter not exists { ?root rdfs:subClassOf ?superRoot }
}

and I got these nine results:

----------------------------------------------------------------
| root                                                         |
================================================================
| <http://dbpedia.org/class/yago/YagoLegalActorGeo>            |
| <http://dbpedia.org/class/yago/WaterNymph109550125>          |
| <http://dbpedia.org/class/yago/PhysicalEntity100001930>      |
| <http://dbpedia.org/class/yago/Abstraction100002137>         |
| <http://dbpedia.org/class/yago/YagoIdentifier>               |
| <http://dbpedia.org/class/yago/YagoLiteral>                  |
| <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> |
| <http://dbpedia.org/class/yago/Thing104424418>               |
| <http://dbpedia.org/class/yago/Dryad109551040>               |
----------------------------------------------------------------

Given that the YAGO concepts aren’t quite as structured as some of the others, it looks like the string based approach may be the best in this case. However, if you wanted to, you could do the a non-string-based query like this, which asks for 100 concepts, excluding those which have one of those nine results as a superclass:

select distinct ?concept where {
  [] a ?concept .
  filter not exists {
    ?concept rdfs:subClassOf* ?super .
    values ?super { 
      yago:YagoLegalActorGeo
      yago:WaterNymph109550125
      yago:PhysicalEntity100001930
      yago:Abstraction100002137
      yago:YagoIdentifier
      yago:YagoLiteral
      yago:YagoPermanentlyLocatedEntity
      yago:Thing104424418
      yago:Dryad109551040
    }
  }
}
limit 100

SPARQL Results

I’m not sure which ends up being faster. The first requires a conversion to string, and the strstarts, if implemented in a naïve fashion, has to consume http://dbpedia.org/class/ in each concept before something is a mismatch. The second requires nine comparisons that, if IRIs are interned, are just object identity checks. It’s a an interesting question for further investigation.

Leave a Comment