Geting all possible predicates between Entity types in DBpedia using Sparql - query-optimization

I am trying to get all the possible predicates that are possible in between two entity types. Below is the example for two entities with Person type.
SELECT distinct ?p
WHERE { ?url1 rdf:type <http://dbpedia.org/ontology/Person> .
?url2 rdf:type <http://dbpedia.org/ontology/Person> .
?url1 ?p ?url2 .
FILTER(STRSTARTS(STR(?p), "http://dbpedia.org/ontology")).
}
However, the resulting output consists of predicates like birthPlace, deathPlace which definitely cannot be in between two Person types.
Am I missing any constraints to get more logical outputs?

Not sure whether it's not worth to provide it as an answer ...
You're missing only one point:
The data in DBpedia is not perfect and free of noise since it's automatically extracted from Wikipedia.
You can check why this happens when you use
SELECT * WHERE {
?url1 rdf:type <http://dbpedia.org/ontology/Person> .
?url2 rdf:type <http://dbpedia.org/ontology/Person> .
?url1 <http://dbpedia.org/ontology/birthPlace> ?url2 .
}
limit 10
Among others, you get
+-------------------------------------------+------------------------------------------------------------------------+
| url1 | url2 |
+-------------------------------------------+------------------------------------------------------------------------+
| http://dbpedia.org/resource/Analía_Núñez | http://dbpedia.org/resource/David |
| http://dbpedia.org/resource/Jorvan_Vieira | http://dbpedia.org/resource/Luís_Alves_de_Lima_e_Silva,_Duke_of_Caxias |
| http://dbpedia.org/resource/Adebayo_Lawal | http://dbpedia.org/resource/Offa_of_Mercia+ |
| ... | ... |
+-------------------------------------------+------------------------------------------------------------------------+
Let's have a look at http://dbpedia.org/resource/Analía_Núñez:
DESCRIBE <http://dbpedia.org/resource/Analía_Núñez>
Among others, it returns the triples
dbr:Analía_Núñez dbo:birthPlace dbr:Panama ,
dbr:David ,
<http://dbpedia.org/resource/David,_Chiriqu\u00ED> .
You can see three birth places. While it should be http://dbpedia.org/resource/David,_Chiriqu%C3%AD , you can see that something went wrong during the extraction from the infobox
in the Wikipedia article about Analía Núñez.

Related

Choice of data structure and integral computations for time based events

This question should go in the category: "a practical, answerable problem that is unique to software development".
Having a system that receives time-based events in the following form (simplified for clarity):
2020-04-10T11:00:00.000Z:system1,UP
2020-04-15T07:00:00.000Z:system1,DOWN
2020-04-20T09:15:00.000Z:system2,UP
2020-05-10T06:30:00.000Z:system3,UP
2020-05-15T22:30:00.000Z:system3,DOWN
2020-05-16T08:30:00.000Z:system3,UP
And imagine we graph this time series on a chart where:
X-axis: time
Y-axis: state (DOWN/UP which we can give values 1/0). Note that the values are reversed in order for the mathematical graph area in this question to make sense.
How can I save these events to a database (or equivalent) and compute the area between the X-axis and the graph (called numerical integration)? The start and end boundaries for the computation can be a full calendar month or the fragment of current month so far.
Example requirement
| <---- compute interval ----> |
|--------------------------------------------- <= DOWN
| xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
S1 . . .|--------|. . . . . . . . . . . . . . . . . . . . . . . . . . <= UP
| |
S2 . . . . . . . . . . . .|------------------------------------ . . .
| |
|--|
|xx|
S3 . . . . . . . . . . . . . . . . . . .|--------| |---------- . . .
| |
time: 10Apr 15Apr 20Apr 1May 10May 15May 16May 20May/now
Compute the amount of time (marked with xxx in the graphs above) each of the systems were unavailable in May considering the current date 2020-05-20. Note that:
system1 and system2 didn't even receive any events in May. They are "historically" DOWN and UP respectively. This case can go indefinitely long in the past.
system3 had a 10 hour interruption in May
And the result should be:
system1: X minutes (all minutes so far)
system2: 0 minutes
system3: 600 minutes
What I tried so far
I have studied a bit Elasicsearch and this use case does not come as an easy task to solve (one would probably have to consider transform APIs and several aggregations)
a potential solution with relational databases could be to save each new event with two timestamps (begin, end), and upon saving, to search the history and close the previous open end event for the same system. Then comes the collection of these intervals and intersecting them with the period/month in question, etc.

Cassandra's ORDER BY does not work as expected

So, I'm storing some statistics in cassandra.
I want to get the top 10 best entrys based on a specific column. The column in this case is kills.
As there is no ORDER BY command like in mysql, I have to create a PARTITION KEY.
I've created the following table:
CREATE TABLE IF NOT EXISTS stats ( uuid uuid, kills int, deaths int, playedGames int, wins int, srt int, PRIMARY KEY (srt, kills) ) WITH CLUSTERING ORDER BY (kills DESC);
The Problem I have is the following, as you see above, I'm using the column srt for ordering because when I'm going to use the column uuid for ordering, the result from my select query is totally random and not sorted as expected.
So I tried to add a column with always the same value for my PARTITION KEY. Sorting works now, but not really good. When I now try to SELECT * FROM stats;, the result is the following:
srt | kills | deaths | playedgames | uuid | wins
-----+-------+--------+-------------+--------------------------------------+------
0 | 49 | 35 | 48 | 6f284e6f-bd9a-491f-9f52-690ea2375fef | 2
0 | 48 | 21 | 30 | 4842ad78-50e4-470c-8ee9-71c5a731c935 | 4
0 | 47 | 48 | 14 | 91f41144-ef5a-4071-8c79-228a7e192f34 | 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 | 2 | 32 | 20 | 387448a7-a08e-46d4-81a2-33d8a893fdb6 | 31
0 | 1 | 16 | 17 | fe4efbcd-34c3-419a-a52e-f9ae8866f2bf | 12
0 | 0 | 31 | 25 | 82b13d11-7eeb-411c-a521-c2c2f9b8a764 | 10
The problem about the result is, that "per kill" amout/value, there is only one row - but there should be definitly more.
So, any idea about using sorting in Cassandra without getting data stripped out?
I also heard about Datastax Enterprise (DSE) which supports solr in querys but DSE is only free for non-productive (and also only for 6 months) and the paid version is, at least what I heared of, pretty expensive (around 4000$ per node). So, is there any alternative like a Datastax Enterprise Community Edtion? Does not make sense but I'm just asking. I haven't found anything from googling so, can I also use solr with the "normal" cassandra?
Thank you for your help!
PS: Please don't mark this as a duplicate of order by caluse not working in Cassandra query because it didn't helped me. I already googled like 1 and a half hour for a solution.
EDIT:
Because of the fact that my primary key is PRIMARY KEY(srt, kills), the combination of (srt, kills) must be unique. Which basicly means, that rows with the same amout of kills are getting overwritten by each other. I would use PRIMARY KEY(uuid, kills) which would solve the problem with overwriting rows but when I do SELECT * FROM stats LIMIT 10 then, the results are totally random and not sorted by kills.
If you want to use column for sorting get it out from partition key. Rows will be sorted by this column within every partition - Cassandra splits data between nodes using partition key, and ordering it in each partition using clustering key:
PRIMARY KEY ((srt), kills)
EDIT:
You need to understand concepts a little bit more, i suggest you to take some free course on DSE site, it can help you with further development.
Anyway, about your question:
Primary key is a set of columns that make each row unique.
There are 2 types of columns in this primary key - partition key columns and clustering columns.
You can't use partition key for sorting or range queries - it is against the model of Cassandra - such query will be splitted to several nodes, or even all nodes and sstables. If you want to use both of the listed columns for sorting, you can use other column for partitioning (random number from 1 to 100 for example), and then you need to execute your query for each "batch", or simply use another column that has high enough number of unique values (at least 100),the data is evenly distributed between such values, and data is accessed using all these values, otherwise you will end up with hot nodes/partitions.
Primary key ((another_column), kills, srt)
What you have to understand, you can order your data only within partitions, but not between partitions.
that "per kill" amout/value - can you elaborate? There are only one row for each key in Cassandra, if you insert several rows with same key they will be overwritten with last insert values (read about upserts).

How can I specify the type of a column in a python behave step data table?

Assume I have a step defined as follows:
Then I would expect to see the following distribution for Ford
| engine | doors | color |
| 2.1L | 4 | red |
And I have the step implementation that reads the table and does the assert as follows:
#then('I would expect to see the following distribution for {car_type}')
def step(context, car_type):
car = find_car_method(car_type)
for row in context.table:
for heading in row.headings:
assertEqual(getattr(car, heading),
row[heading],
"%s does not match. " % heading + \
"Found %s" % getattr(car, heading))
(I do it like this as this approach allows for adding more fields, but keeping it generic enough for many uses of checking attributes of the car).
When my car object has 4 doors (as an int), it does not match as data table requires that there are '4' doors (as a unicode str).
I could implement this method to check the name of the column and handle it differently for different fields, but then maintenance becomes harder when adding a new field as there is one more place to add it. I would prefer specifying it in the step data table instead. Something like:
Then I would expect to see the following distribution for Ford
| engine | doors:int | color |
| 2.1L | 4 | red |
Is there something similar that I can use to achieve this (as this does not work)?
Note that I have cases where I need to create from the data table where I have the same problem. This make is useless trying to use the type of the 'car' object to determine the type as it is None in that case.
Thank you,
Baire
After some digging I was not able to find anything, so decided to implement my own solution. I am Posting it here as it might help someone in the future.
I created a helper method:
def convert_to_type(full_field_name, value):
""" Converts the value from a behave table into its correct type based on the name
of the column (header). If it is wrapped in a convert method, then use it to
determine the value type the column should contain.
Returns: a tuple with the newly converted value and the name of the field (without the
convertion method specified). E.g. int(size) will return size as the new field
name and the value will be converted to an int and returned.
"""
field_name = full_field_name.strip()
matchers = [(re.compile('int\((.*)\)'), lambda val: int(val)),
(re.compile('float\((.*)\)'), lambda val: float(val)),
(re.compile('date\((.*)\)'), lambda val: datetime.datetime.strptime(val, '%Y-%m-%d'))]
for (matcher, func) in matchers:
matched = matcher.match(field_name)
if matched:
return (func(value), matched.group(1))
return (value, full_field_name)
I could then set up by scenario as follows:
Then I would expect to see the following distribution for Ford
| engine | int(doors) | color |
| 2.1L | 4 | red |
I then changed by step as follows:
#then('I would expect to see the following distribution for {car_type}')
def step(context, car_type):
car = find_car_method(car_type)
for row in context.table:
for heading in row.headings:
(value, field_name) = convert_to_type(heading, row[heading])
assertEqual(getattr(car, field_name),
value,
"%s does not match. " % field_name + \
"Found %s" % getattr(car, field_name))
It should be each to move 'matchers' to the module level as they don't have to be re-created each time the method is called. It is also easy to extend it for more conversion methods (e.g. liter() and cc() for parsing engine size while also converting to a standard unit).

'nested fields' in MS Access

I have a primary field called 'Unit', and I need to specify how many objects (exhibiting great variability) were found within them. There are 116 types of these objects, which can be broken down into 6 categories. Each category dictates what attributes may be used to describe each object type. For every unit, I need to record how many types of objects are found within it, and document how many of them exhibit each attribute. I sketched out examples of the schema and how I need to apply it. Perhaps the simplest solution would be to create a table for each type, and relate them to the table containing the unit list, but then there will be so many tables (is there a limit in MS Access?). Otherwise, is it possible to create 'nested fields' in access? I just made that term up, but it seems to describe what I'm looking to do.
Category 1 (attribs: a, b, c, d)
Type 1
Type 2
Category 2 (attribs: x, y, z)
Type 3
Type 4
Unit
Type 1 | a | b | c | d |
Type 2 | a | b | c | d |
Type 3 | x | y | z |
Type 4 | x | y | z |
UPDATE:
To clarify, I essentially need to create subtables for each field of my primary table. Each field has sub-attributes, and I need to be able to specify the distribution of objects at this finer-grained resolution.
What you want is similar to this question on SO: Database design - multi category products with properties.
So you'll have a third table which relates each the attribute value to the Unit. And to control which attributes each Category can have, a fourth table will be required which specifies the attributes (names) for each category.
Unit:
.id
.categoryid
Category:
.id
.cat_name
Category_Attributes:
.attribID
.categoryid
.attribute_name
Unit_Attributes:
.unitid
.attribID
.attrib_value

LUBM benchmark classes

I have used Lehigh University Benchmark (LUBM) to test my application.
What I know about LUBM is that its ontology contains 43 classes.
But when I query over the classes I got 14 classes!
Also, when I used Sesame workbench and check the "Types in Repository " section I got 14th classes which are:
AssistantProfessor
AssociateProfessor
Course
Department
Fullprofessor
GraduateCourse
GraduateStudent
Lecturer
Publication
ResearchAssistant
ResearchGroup
TeachingAssistant
UndergraduateStudent
University
Could any one explain to me the differences between them?
Edit: Problem partially solved but now How can I retrieve RDF instances from the upper level of Ontology (e.g. Employee, book, Article, Chair, college, Director, PostDoc, JournalArticle ..etc) or let's say all 43 classes because I can just retrieve instances for the lower classes (14th classes) and the following picture for retrieving the instances from ub:Department
You didn't mention what data you're using, so we can't be sure that you're actually using the correct data, or even know what version of it you're using. The OWL ontology can be downloaded from the Lehigh University Benchmark (LUBM), where the OWL version of the ontology is univ-bench.owl.
Based on that data, you can use a query like this to find out how many OWL classes there are::
prefix owl: <http://www.w3.org/2002/07/owl#>
select (count(?class) as ?numClasses) where { ?class a owl:Class }
--------------
| numClasses |
==============
| 43 |
--------------
I'm not familiar with the Sesame workbench, so I'm not sure how it's counting types, but it's easy to see that different ways of counting types can lead to different results. For instance, if we only count the types of which there are instances, we only get six classes (and they're the OWL meta-classes, so this isn't particularly useful):
select distinct ?class where { ?x a ?class }
--------------------------
| class |
==========================
| owl:Class |
| owl:TransitiveProperty |
| owl:ObjectProperty |
| owl:Ontology |
| owl:DatatypeProperty |
| owl:Restriction |
--------------------------
Now, that's what happens if you're just querying on the ontology itself. The ontology only provides the definitions of the vocabulary that you might use to describe some actual situation. But where can you get descriptions of actual (or fictitious) situations? Note that at SWAT Projects - the Lehigh University Benchmark (LUBM) there's a link below the Ontology download:
Data Generator(UBA):
This tool generates syntetic OWL or DAML+OIL data
over the Univ-Bench ontology in the unit of a university. These data
are repeatable and customizable, by allowing user to specify seed for
random number generation, the number of universities, and the starting
index of the universities.
* What do the data look like?
If you follow the "what do the data look like" link, you'll get another link to an actual sample file,
http://swat.cse.lehigh.edu/projects/lubm/University0_0.owl
That actually has some data in it. You can run a query like the following at sparql.org's query processor and get some useful results:
select ?individual ?class
from <http://swat.cse.lehigh.edu/projects/lubm/University0_0.owl>
where {
?individual a ?class
}
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| individual | class |
=============================================================================================================================================================
| <http://www.Department0.University0.edu/AssociateProfessor9> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#AssociateProfessor> |
| <http://www.Department0.University0.edu/GraduateStudent127> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent98> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent182> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/GraduateStudent1> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#TeachingAssistant> |
| <http://www.Department0.University0.edu/AssistantProfessor4/Publication4> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#Publication> |
| <http://www.Department0.University0.edu/UndergraduateStudent271> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent499> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent502> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/GraduateCourse61> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateCourse> |
| <http://www.Department0.University0.edu/AssociateProfessor10> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#AssociateProfessor> |
| <http://www.Department0.University0.edu/UndergraduateStudent404> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
…
I think that to get the kind of results you're looking for, you need to download this data, or download a version of the UBA test data generators and generate some of your own data.
If you use UBA (the LUBM data generator), you get instance data, where instances are declared to be of certain types. E.g.
<http://www.Department0.University31.edu/FullProfessor4>
rdf:type
ub:FullProfessor
It turns out, when you run UBA, it only asserts instances into the 14 classes you mention.
The LUBM ontology actually defines 43 classes. These are classes that are available to be used in instance data sets.
In OpenRDF Sesame, when it lists "Types in Repository", it is apparently just showing those types which are actually "used" in the data. (I.e., there is at least 1 instance asserted to be of that type/class in the data.)
That is the difference between your two lists. When you look at the ontology, there are 43 classes defined (available for use), but when you look at actual instance data generated by LUBM UBA, only those 14 classes are directly used.
(NOTE: If you had a triple store with OWL reasoning turned on, the reasoner would assert the instances into more of the classes defined in the ontology.)

Resources