How do I limit the number of results for a specific variable in a SPARQL query? - query-optimization

Let's say I have a SPARQL query like this, looking for resources that have some shared property with a focal resource, and also getting some other statements about the focal resource :
CONSTRUCT {
?focal pred:icate ?shared .
?other pred:icate ?shared .
}
WHERE {
?focal pred:icate ?shared ;
more:info ?etc ;
a "foobar" .
?other pred:icate ?shared .
}
LIMIT 500
If there are more than 500 other resources, that LIMIT might exclude that more:info statement and object. So, is there a way to say "I only want at most 500 of ?other", or do I have to break this query into multiple pieces?

You can use LIMIT in subqueries, i.e. something like the following:
CONSTRUCT {
?focal pred:icate ?shared .
?other pred:icate ?shared .
}
WHERE {
?focal pred:icate ?shared ;
more:info ?etc ;
a "foobar" .
{
SELECT ?shared {
?other pred:icate ?shared .
}
LIMIT 500
}
}

http://www.w3.org/TR/2012/WD-sparql11-query-20120105/#modResultLimit
The LIMIT clause puts an upper bound on the number of solutions
returned. If the number of actual solutions, after OFFSET is applied,
is greater than the limit, then at most the limit number of solutions
will be returned.
You can only limit the number of solutions to your query, not a specific subset of it. You can use a subquery with a LIMIT clause though: http://www.w3.org/TR/sparql-features/#Subqueries.

Related

SPARQL how to use concatenated strings for a subject of a subquery?

I created a SPARQL query like below. I get ?year and ?month from SERVICE query in Wikidata, and would like to use them as a part of a subject of a subquery (i.e. ?uri ?p ?o part). I managed to concatenate ?year and ?month and generate a URI, but somehow the the query does not return a result.
I tested both the subquery part (e.g. only using <https://example.com/date/10-1> ?p ?o) and the SERVICE query individually. They both return results properly (13 and 5 results respectively, so the size is not an issue). My guess is it is concatenated variable is a string not URI, which cannot be a subject. But I am not sure. As I am not sure what is wrong, I tried similar queries, but they get time-out, due to the subquery I think. Can you spot the problem and let me know how to fix it? Many thanks in advance!
SELECT DISTINCT ?event ?eventLabel ?d1 ?d2 ?d3 ?date ?year ?month ?uri ?p ?o
WHERE {
SERVICE <https://query.wikidata.org/sparql> {
select DISTINCT ?event ?eventLabel ?d1 ?d2 ?d3 ?date ?year ?month
where{
?event wdt:P31/wdt:P279* wd:Q13418847 .
?event wdt:P276 wd:Q1741 .
OPTIONAL {?event wdt:P580 ?d1}
OPTIONAL {?event wdt:P585 ?d2}
OPTIONAL {?event wdt:P582 ?d3}
BIND(IF(!BOUND(?d1),(IF(!BOUND(?d2),?d3,?d2)),?d1) as ?date)
BIND(year(?date) AS ?year)
BIND(month(?date) AS ?month)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?date
LIMIT 100
}
BIND(CONCAT("<https://example.com/date/", str(?year), "-", str(?month), ">") AS ?uri) .
{
SELECT ?uri ?p ?o
WHERE {?uri ?p ?o .}
LIMIT 10
}
}
You can use the IRI function to make the string (without the </>) into a URI:
BIND(IRI(CONCAT("https://example.com/date/", str(?year), "-", str(?month))) AS ?uri) .

Perl performance is slow, file I/O issue or due to while loop

I have the following code in my while loop and it is significantly slow, any suggestions on how to improve this?
open IN, "<$FileDir/$file" || Err( "Failed to open $file at location: $FileDir" );
my $linenum = 0;
while ( $line = <IN> ) {
if ( $linenum == 0 ) {
Log(" This is header line : $line");
$linenum++;
} else {
$linenum++;
my $csv = Text::CSV_XS->new();
my $status = $csv->parse($line);
my #val = $csv->fields();
$index = 0;
Log("number of parameters for this file is: $sth->{NUM_OF_PARAMS}");
for ( $index = 0; $index <= $#val; $index++ ) {
if ( $index < $sth->{NUM_OF_PARAMS} ) {
$sth->bind_param( $index + 1, $val[$index] );
}
}
if ( $sth->execute() ) {
$ifa_dbh->commit();
} else {
Log("line $linenum insert failed");
$ifa_dbh->rollback();
exit(1);
}
}
}
By far the most expensive operation there is accessing the database server; it's a network trip, hundreds of milliseconds or some such, each time.
Are those DB operations inserts, as they appear? If so, instead of inserting row by row construct a string for an insert statement with multiple rows, in principle as many as there are, in that loop. Then run that one transaction.
Test and scale down as needed, if that adds up to too many rows. Can keep adding rows to the string for the insert statement up to a decided maximum number, insert that, then keep going.†
A few more readily seen inefficiencies
Don't construct an object every time through the loop. Build it once befor the loop, and then use/repopulate as needed in the loop. Then, there is no need for parse+fields here, while getline is also a bit faster
Don't need that if statement for every read. First read one line of data, and that's your header. Then enter the loop, without ifs
Altogether, without placeholders which now may not be needed, something like
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });
# There's a $table earlier, with its #fields to populate
my $qry = "INSERT into $table (", join(',', #fields), ") VALUES ";
open my $IN, '<', "$FileDir/$file"
or Err( "Failed to open $file at location: $FileDir" );
my $header_arrayref = $csv->getline($IN);
Log( "This is header line : #$header_arrayref" );
my #sql_values;
while ( my $row = $csv->getline($IN) ) {
# Use as many elements in the row (#$row) as there are #fields
push #sql_values, '(' .
join(',', map { $dbh->quote($_) } #$row[0..$#fields]) . ')';
# May want to do more to sanitize input further
}
$qry .= join ', ', #sql_values;
# Now $qry is readye. It is
# INSERT into table_name (f1,f2,...) VALUES (v11,v12...), (v21,v22...),...
$dbh->do($qry) or die $DBI::errstr;
I've also corrected the error handling when opening the file, since that || in the question binds too tightly in this case, and there's effectively open IN, ( "<$FileDir/$file" || Err(...) ). We need or instead of || there. Then, the three-argument open is better. See perlopentut
If you do need the placeholders, perhaps because you can't have a single insert but it must be broken into many or for security reasons, then you need to generate the exact ?-tuples for each row to be inserted, and later supply the right number of values for them.
Can assemble data first and then build the ?-tuples based on it
my $qry = "INSERT into $table (", join(',', #fields), ") VALUES ";
...
my #data;
while ( my $row = $csv->getline($IN) ) {
push #data, [ #$row[0..$#fields] ];
}
# Append the right number of (?,?...),... with the right number of ? in each
$qry .= join ', ', map { '(' . join(',', ('?')x#$_) . ')' } #data;
# Now $qry is ready to bind and execute
# INSERT into table_name (f1,f2,...) VALUES (?,?,...), (?,?,...), ...
$dbh->do($qry, undef, map { #$_ } #data) or die $DBI::errstr;
This may generate a very large string, what may push the limits of your RDBMS or some other resource. In that case break #data into smaller batches. Then prepare the statement with the right number of (?,?,...) row-values for a batch, and execute in the loop over the batches.‡
Finally, another way altogether is to directly load data from a file using the database's tool for that particular purpose. This will be far faster than going through DBI, probably even including the need to process your input CSV into another one which will have only the needed data.
Since you don't need all data from your input CSV file, first read and process the file as above and write out a file with only the needed data (#data above). Then, there's two possible ways
Either use an SQL command for this – COPY in PostgreSQL, LOAD DATA [LOCAL] INFILE in MySQL and Oracle (etc); or,
Use a dedicated tool for importing/loading files from your RDBMS – mysqlimport (MySQL), SQL*Loader/sqlldr (Oracle), etc. I'd expect this to be the fastest way
The second of these options can also be done out of a program, by running the appropriate tool as an external command via system (or better yet via the suitable libraries).
† In one application I've put together as much as millions of rows in the initial insert -- the string itself for that statement was in high tens of MB -- and that keeps running with ~100k rows inserted in a single statement daily, for a few years by now. This is postgresql on good servers, and of course ymmv.
‡
Some RDBMS do not support a multi-row (batch) insert query like the one used here; in particular Oracle seems not to. (We were informed in the end that that's the database used here.) But there are other ways to do it in Oracle, please see links in comments, and search for more. Then the script will need to construct a different query but the principle of operation is the same.

Cypher statement with distinct match conditions is returning the same result

I am using Neo4j as a database to store voting information related to another database object.
I have a Vote object which has fields:
type:String with values of UP or DOWN.
argId:String which is a string ID value linking to a unique argument object
I am trying to query the number of votes assigned to a given argId using the following queries:
MATCH (v:Vote) WHERE v.argId = '214' AND v.type='DOWN'
RETURN {downvotes: COUNT(v)} AS votes
UNION
MATCH (v:Vote) WHERE v.argId = '214' AND v.type='UP'
RETURN {upvotes: COUNT(v)} AS votes
Note that this above cypher -- works and returns the expected result result like so:
[
{
"downvotes": 1
},
{
"upvotes": 10
}
]
But I feel like the query could be a bit neater and want to write something like this:
MATCH (v:Vote) WHERE v.argId = '214' AND v.type='UP'
MATCH (b:Vote) WHERE b.argId = '214' AND b.type='DOWN'
RETURN {upvotes: COUNT(v), downvotes: COUNT(b)}
Just reading it through, I think it makes sense, b and v are declared as separate variables, so all should be good (so I thought).
But running it given me this:
{
"upvotes": 10,
"downvotes": 10
}
But it should be what I have above.
Why is this?
I'm kinda new to neo4j and cypher so I've probably not understood how cypher works fully.
Can anyone shine any light?
Thank you!
p.s. I'm using Neo4j 3.5.6 and running the queries via the Desktop web browser app.
I think if you run this query you will get a clearer picture of what is happeneing. Your query produces a cartesian product of the upvotes(10) and the downvotes(1). The product is a result set of 10 rows. When they are subsequently counted, there are ten of each.
MATCH (v:Vote) WHERE v.argId = '214' AND v.type='UP'
MATCH (b:Vote) WHERE b.argId = '214' AND b.type='DOWN'
RETURN v.type, b.type
In order to get the result you want you need to filter the values and count them individually.
Rather than have two match statements, have a single match statement that retreives all of the values of interest and then use a conditional statement to filter them into upvotes and downbotes buckets.
Something like this may suit you.
MATCH (v:Vote {argId: '214'})
WHERE v.type IN ['UP', 'DOWN']
RETURN {
upvotes: count(CASE WHEN v.type = 'DOWN' THEN 1 END),
downvotes: count(CASE WHEN v.type = 'UP' THEN 1 END)
} AS vote_result
Using APOC you could do something like this whereby you use the type values themselves to aggregate the counts and then use APOC to convert it to a map with the types as the keys in the map.
MATCH (v:Vote {argId: '214'})
WHERE v.type IN ['UP', 'DOWN']
WITH [v.type, count(*)] AS vote_pair
RETURN apoc.map.fromPairs(collect(vote_pair)) AS votes

SPARQL DBPedia query for seating capacity, optimize and remove duplicates

I want to get all objects with seating capacity information on DBPedia. Optionally, I want to get their label, address, lat and lon information.
My issue is that I get a lot of duplicates even after filtering by language. How can I get distinct entries based on, say, 'address', or any other attribute?
Also, can you tell which part of this query can be improved so that my query doesn't time out when I use the public DBpedia endpoint? Thanks!
PREFIX dbpediaO: <http://dbpedia.org/ontology/>
SELECT ?place ?label ?capacity ?address ?lat ?lon WHERE {
?place dbpedia2:seatingCapacity ?capacity .
OPTIONAL{
?place dbpediaO:address ?address .
?place rdfs:label ?label .
?plage geo:lat ?lat .
?place geo:long ?lon .
}
filter (lang(?label) = "en" || lang(?label) = "eng")
filter (lang(?address) = "en" || lang(?address) = "eng")
}
Your places have multiple values of, for example, address. The unique thing is the URI itself. Moreover, you should put each property in a separate OPTIONAL, or at least use separate OPTIONAL clauses for lat/long. For label you do not need an OPTIONAL clause at all in DBpedia. The only way to get unique places is to group by the place and sample or group_concat all other properties. Something like this:
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?place (sample(?_label) as ?label)
(group_concat(?capacity; separator=";") as ?capacities)
(group_concat(?address; separator=";") as ?adresses) ?lat ?lon
WHERE {
?place dbo:seatingCapacity ?capacity ;
rdfs:label ?_label .
filter (langmatches(lang(?_label),"en"))
OPTIONAL {
?place dbo:address ?address .
filter (langmatches(lang(?address), "en"))
} OPTIONAL {
?place geo:lat ?lat ; geo:long ?lon .
}
}
group by ?place ?lat ?lon
order by desc(?place)
limit 100
As you can see, there are also multiple capacity values for places.

How to check if the value matches the one from previous ?th row? (? is dynamic)

Here is my data set.
Data in
I'd like to check if the gender with "Potential Original" matched the gender with "Potential Duplicate'. There is no specified group but 1 duplicate + 1 or more original acted like a group.
Here is the output I want (for duplicate it's NA because it's comparing to itself).
Data out
Appreciate your help. Thanks.
Thanks Rahul for looking into this. This is what I tried and I think it worked. The logic is to create the seq # first for each block of Duplicate and Original and then pull the lag value with corresponding distance.
library(data.table)
setDT(df)[, counter := seq_len(.N), by = list(cumsum(Status == "Potential
Duplicate"))]
for (i in 1:nrow(df)) {
if (df$Status[i]=="Potential Duplicate") {
df$Gender_LAG[i] <-df2$Gender[i]
}
else {
df$Gender_LAG[i]<-df2$Gender[i-df2$counter[i]+1]
}
}
Thanks.
Looking forwards to seeing other options.

Resources