LUBM benchmark classes - dataset

I have used Lehigh University Benchmark (LUBM) to test my application.
What I know about LUBM is that its ontology contains 43 classes.
But when I query over the classes I got 14 classes!
Also, when I used Sesame workbench and check the "Types in Repository " section I got 14th classes which are:
AssistantProfessor
AssociateProfessor
Course
Department
Fullprofessor
GraduateCourse
GraduateStudent
Lecturer
Publication
ResearchAssistant
ResearchGroup
TeachingAssistant
UndergraduateStudent
University
Could any one explain to me the differences between them?
Edit: Problem partially solved but now How can I retrieve RDF instances from the upper level of Ontology (e.g. Employee, book, Article, Chair, college, Director, PostDoc, JournalArticle ..etc) or let's say all 43 classes because I can just retrieve instances for the lower classes (14th classes) and the following picture for retrieving the instances from ub:Department

You didn't mention what data you're using, so we can't be sure that you're actually using the correct data, or even know what version of it you're using. The OWL ontology can be downloaded from the Lehigh University Benchmark (LUBM), where the OWL version of the ontology is univ-bench.owl.
Based on that data, you can use a query like this to find out how many OWL classes there are::
prefix owl: <http://www.w3.org/2002/07/owl#>
select (count(?class) as ?numClasses) where { ?class a owl:Class }
--------------
| numClasses |
==============
| 43 |
--------------
I'm not familiar with the Sesame workbench, so I'm not sure how it's counting types, but it's easy to see that different ways of counting types can lead to different results. For instance, if we only count the types of which there are instances, we only get six classes (and they're the OWL meta-classes, so this isn't particularly useful):
select distinct ?class where { ?x a ?class }
--------------------------
| class |
==========================
| owl:Class |
| owl:TransitiveProperty |
| owl:ObjectProperty |
| owl:Ontology |
| owl:DatatypeProperty |
| owl:Restriction |
--------------------------
Now, that's what happens if you're just querying on the ontology itself. The ontology only provides the definitions of the vocabulary that you might use to describe some actual situation. But where can you get descriptions of actual (or fictitious) situations? Note that at SWAT Projects - the Lehigh University Benchmark (LUBM) there's a link below the Ontology download:
Data Generator(UBA):
This tool generates syntetic OWL or DAML+OIL data
over the Univ-Bench ontology in the unit of a university. These data
are repeatable and customizable, by allowing user to specify seed for
random number generation, the number of universities, and the starting
index of the universities.
* What do the data look like?
If you follow the "what do the data look like" link, you'll get another link to an actual sample file,
http://swat.cse.lehigh.edu/projects/lubm/University0_0.owl
That actually has some data in it. You can run a query like the following at sparql.org's query processor and get some useful results:
select ?individual ?class
from <http://swat.cse.lehigh.edu/projects/lubm/University0_0.owl>
where {
?individual a ?class
}
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| individual | class |
=============================================================================================================================================================
| <http://www.Department0.University0.edu/AssociateProfessor9> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#AssociateProfessor> |
| <http://www.Department0.University0.edu/GraduateStudent127> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent98> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent182> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/GraduateStudent1> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#TeachingAssistant> |
| <http://www.Department0.University0.edu/AssistantProfessor4/Publication4> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#Publication> |
| <http://www.Department0.University0.edu/UndergraduateStudent271> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent499> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/UndergraduateStudent502> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
| <http://www.Department0.University0.edu/GraduateCourse61> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateCourse> |
| <http://www.Department0.University0.edu/AssociateProfessor10> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#AssociateProfessor> |
| <http://www.Department0.University0.edu/UndergraduateStudent404> | <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> |
…
I think that to get the kind of results you're looking for, you need to download this data, or download a version of the UBA test data generators and generate some of your own data.

If you use UBA (the LUBM data generator), you get instance data, where instances are declared to be of certain types. E.g.
<http://www.Department0.University31.edu/FullProfessor4>
rdf:type
ub:FullProfessor
It turns out, when you run UBA, it only asserts instances into the 14 classes you mention.
The LUBM ontology actually defines 43 classes. These are classes that are available to be used in instance data sets.
In OpenRDF Sesame, when it lists "Types in Repository", it is apparently just showing those types which are actually "used" in the data. (I.e., there is at least 1 instance asserted to be of that type/class in the data.)
That is the difference between your two lists. When you look at the ontology, there are 43 classes defined (available for use), but when you look at actual instance data generated by LUBM UBA, only those 14 classes are directly used.
(NOTE: If you had a triple store with OWL reasoning turned on, the reasoner would assert the instances into more of the classes defined in the ontology.)

Related

Need help creating a simple form for reviewing a (very) large number of diagnosis codes

OK, been lurking here for a long time, but never asked a question before. Apologies for long and complicated question. So I have a very large excel sheet with nearly 40,000 unique codes from the ICD-10 classification system, which classifies essentially all known diseases. Theis is a hierarchical clasisfication system where codes are organized in 20 something chapters and gradually more specific codes, with 3 or more positions. For example, the code A22 is anthrax, with a number of sub-codes A22.0=Cutaneous anthrax, A22.1=Pulmonary anthrax, etc. However, for some diseases, there are no 4-digit codes under the 3-digit codes (e.g. C01, below) or only one 4-digit code that is meaningful for us to recognize (e.g. C00, below). For other diseases, we want full precision (e.g. G23).
Example table
| 3-digit code | Specific code | Description |
| -------- | -------- |-------- |
| C00 | C00.0 | External upper lip |
| C00 | C00.1 | External lower lip |
| C00 | C00.2 | External lip, unspecified |
| C00 | C00.3 | Upper lip, inner aspect |
| C01 | C01 | Malignant neoplasm of base of tongue |
| G23 | G23 | Other degenerative diseases of basal ganglia |
| G23 | G23.0 | Hallervorden-Spatz disease |
| G23 | G23.1 | Progressive supranuclear ophthalmoplegia [Steele-Richardson-Olszewski] |
| G23 | G23.2 | Multiple system atrophy, parkinsonian type [MSA-P] |
| G23 | G23.3 | Multiple system atrophy, cerebellar type [MSA-C] |
The issue at hand is that I'm conducting a large-scale research study based on a health register where diagnoses are coded using this system. Due to a policy of information minimization/data privacy, we need to select which of these 40,000 codes where we need full precision (i.e. on 4-digit level) and where it is sufficent with 3-digit codes. This is a very tedious task and I need to make it as efficient as possible. My idea is to create a simple form that links to my large table (which has the exact format as above, only longer) and presents each 3-digit code one by one, with a simple checkbox or something that allows me to select or not select whether this group should have full precision. I'm envisioning something simple like this:
enter image description here
Sorry for the stupidly long prelude, but my question is much simpler: what would be a simple way to achieve this? I don't "know" any graphical programming languages, but have used SAS, R and statistical programming systems for about 20 years, so I really just need a push in the right direction. Could it, for example, be done using Access form? Any help would be much appreciated!
Thanks,
Gustaf
So, I haven't really tried anything yet as I don't even know where to start.

Combining fields in Google Data Studio

I have a CSV file of the form (unimportant columns hidden)
player,game1,game2,game3,game4,game5,game6,game7,game8
Example data:
Alice,0,-10,-30,-60,-30,-50,-10,30
Bob,10,20,30,40,50,60,70,80
Charlie,20,0,20,0,20,0,20,0
Derek,1,2,3,4,5,6,7,8
Emily,-40,-30,-20,-10,10,20,30,40
Francine,1,4,9,16,25,36,49,64
Gina,0,0,0,0,0,0,0,0
Hank,-50,50,-50,50,-50,50,-50,50
Irene,-20,-20,-20,50,50,-20,-20,-20
I am looking for a way to make a Data Studio view where I can see a chart of all the results of a certain player. How would I make a custom field that combines the data from game1 to game8 so I can make a chart of it?
| Name | Scores |
|----------|---------------------------------|
| Alice | [0,-10,-30,-60,-30,-50,-10,30] |
| Bob | [10,20,30,40,50,60,70,80] |
| Charlie | [20,0,20,0,20,0,20,0] |
| Derek | [1,2,3,4,5,6,7,8] |
| Emily | [-40,-30,-20,-10,10,20,30,40] |
| Francine | [1,4,9,16,25,36,49,64] |
| Gina | [0,0,0,0,0,0,0,0] |
| Hank | [-50,50,-50,50,-50,50,-50,50] |
| Irene | [-20,-20,-20,50,50,-20,-20,-20] |
The goal of the resulting chart would be something like this, where game1 is the first point and so on.
If this is not possible, how would I best represent the data so what I am looking for can work in Data Studio? I currently have it implemented in a Google Sheet, but the issue is there's no way to make views, so when someone selects a row it changes for everyone viewing it.
If you have two file games as data sources, I guess that you want to combine them by the name, right?
You can do it with the blending data option. Resource > manage blends I think is the option.
Then you can create a blend data source merging it by the name.
You can add also both score fields, with different labels.
This is some documentation about it: https://support.google.com/datastudio/answer/9061420?hl=en

Best database storage for matching products from offers

I have following problem. I have products, offers and their parameters (in MySQL about 300 000 000 rows). Based on offer parameters and their rate (parameters are dynamic and every parameter type has different rate) I must join offers to product. Of course there will be a lot of updates, deletes or inserts (for example around 5000req/s).
Second functionality will be sending these connected information via api. Anyone have any recommendations what NoSQL, relational database or something similar to use for storage?
Edit
I'll show my example on a small sample of data in MySQL:
Offer
+----------+-----------------+
| offer_id | name |
+----------+-----------------+
| 1 | iphone_se_black |
| 2 | iphone_se_red |
| 3 | iphone_se_white |
+----------+-----------------+
Parameter_rating
+--------------+----------------+--------+
| parameter_id | parameter_name | rating |
+--------------+----------------+--------+
| 1 | os | 10 |
| 2 | processor | 10 |
| 3 | ram | 10 |
| 4 | color | 1 |
+--------------+----------------+--------+
Parameter value
+----+--------------+----------------+
| id | parameter_id | value |
+----+--------------+----------------+
| 1 | 1 | iOS |
| 2 | 2 | some_processor |
| 3 | 3 | 2GB |
| 4 | 4 | black |
| 5 | 4 | red |
| 6 | 4 | white |
+----+--------------+----------------+
Parameter_to_value
+----------+--------------------+
| offer_id | parameter_value_id |
+----------+--------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 5 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 6 |
+----------+--------------------+
and based on this data I must return that bids 1,2 and 3 are one product.
The biggest problem is that data often changes. For example, changing prices, removing offers, etc. Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
Platform
any recommendations what NoSQL, relational database or something similar to use for storage?
Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
All that is ordinary fare for a Relational database. Tens of thousands of banks run trading and pricing systems that are extremely active from hundreds of thousands of users, on such systems. Every day. The changes you allude to are normal on such systems (eg. pricing and pricing basis, change all the time, in response to Buys & Sells).
But they use genuine SQL platforms. Freeware/shareware/vapourware/nowhere suites such as MySQL and PostgreSQL are neither SQL-compliant, nor viable platforms for high-throughput OLTP systems (no server architecture; no ACID Transactions; etc). They are still implementing the basics that SQL platforms have had since 1984, which is very difficult (impossible!) because they do not have a server architecture.
Therefore MySQL and PostgreSQL are not suitable for the reason of abject performance; zero concurrency; etc, and not for any database design concerns.
For an appreciation of the value of a genuine OLTP Server Architecture, refer to Oracle vs Sybase ASE. Although the article deals with Oracle explicitly, it applies to all freeware because all freeware has the same non-architecture that Oracle has. Actually, even less than Oracle. You get what you pay for.
Data Analysis
This answer is limited to Relational databases; SQL, its designated data sublanguage; and a genuine, commercially viable, SQL platform.
It appears the system supports an auction of some kind, which means you have to maintain an inventory of available/sold items. The database design that is required is quite ordinary.
However, your question is not clear enough to be answered. You are making many assumptions, that we are not party to. Allow me to ask some leading questions, which you need to consider and answer (update your Question):
what are the fundamental things that the systems transacts operations against ?
(products such as phones ?)
how are those things identified ?
(Not the ID but how do humans identify each thing)
what are the properties of those things ?
(please, not "parameter" ... maybe OS; RAM; Processor; Colour) ?
Then property values can be understood
(You can't mess with the attributes of a thing unless you hold and maintain the thing)
what are the operations or transactions against those things
(a) internal or admin transactions
(eg. AddProperty; AddPropertyValue; AddProduct; etc)
(b) external or online user transactions
(eg. BidProduct [offer to buy]; CloseBid; etc)
who are the operators, to which those transactions are permitted ?
(eg. Admins; product suppliers; online bidders; etc)
I can't make any sense of your Parameter_to_value, please explain
What is rating ? Some kind of weighting for the property vs the other properties, or something the bidders declare ?
Database Design • Tentative
This might take a few iterations.
Don't worry about ID fields on each and every file: first we have to understand the data, how it relates to other data, and how it is identified. We can add ID fields at the end.
Note
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates from the Data Model, let me know and I will produce them in text form.

Mapping Tables of Database in NiFi

Here is my requirement.
I have a big table in Vertica say base_table as follows.
base_table
ID | path | service | experience
20 | /abc/xyz | trz | moderate
22 | /wer/cmz | brd | professional
Mapping Tables
map_table1
path_id | path
1 | /abc/xyz
map_table2
exp_id | experience
1 | beginner
Final Table
ID | path_id | service | exp_id
20 | 1 | trz | -
22 | - | brd | 2
In the First case, I need to get ID as 1 as the path column is present in the map_table1 as well as base table and insert that record into the final table.
In the Second case, I need to insert id as 2 in map_table2 as experience professional is not present in that table as well as insert it into the final table.
which processors should I go for or how the flow should look like in Nifi?
I am not sure if I understand your question correctly, but if I generalize the situation, you want to insert a record if it does not exist, and then get the value of the corresponding ID (which may or may not have existed before).
The good news is that NiFi can easily work with a database like Vertica, have a look at the QueryDatabaseTable processor.
The challenge here however, is that NiFi is designed to efficiently handle many individual messages, and is therefore designed not to be very context aware. For your usecase you would probably want to use a tool that is built to work with tables. In general the solution for this would be Spark, or perhaps it can be built into your database with some procedures.

Implementing a Model in a Relational Database

I have a super-class/subclass hierarchical relationship as follows:
Super-class: IT Specialist
Sub-classes: Databases, Java, UNIX, PHP
Given that each instance of a super-class may not be a member of a subclass and a super-class instance may be a member of two or more sub-classes, how would I go about implementing this system?
I haven't been given any attributes to assign to the entities so I find this very vague and I'm at a loss where to start.
To get started, you would have one table that contains all of your super-classes (in your example case, there would only be IT Specialist, but it could also contain things like Networking Specialist, or Digital Specialist). I've included these to give a bit more flavour:
ID | Name |
-----------------------------
1 | IT Specialist |
2 | Networking Specialist |
3 | Digital Specialist |
You also would have another table that contains all of your sub-classes:
ID | Name |
--------------------
1 | Databases |
2 | Java |
3 | UNIX |
4 | PHP |
For example, let's say that a Networking Specialist needs to know about Databases, and a Digital Specialist needs to know about both Java and PHP. An IT Specialist would need to know all four fields listed above.
There are two possible ways to go about this. One such way would be to set 'flags' in the sub-class table:
ID | Name | Is_IT | Is_Networking | Is_Digital
----------------------------------------------------
1 | Databases | 1 | 1 | 0
2 | Java | 1 | 0 | 1
3 | UNIX | 1 | 0 | 0
4 | PHP | 1 | 0 | 1
Keep in mind, this is only using a small number of skills. If you started to have a lot of super-classes, the columns in the sub-class table could get out of hand pretty quickly.
Fortunately, you can also use something known as a bridging table (also known as an associative entity). Essentially, a bridging table allows you to have two foreign keys that are primary keys in another table, solving the problem of a many-to-many relationship.
You would set this up by having a new table that associates which sub-classes belong with which super-classes:
ID | Sub-class ID | Super-class ID |
-------------------------------------
1 | 1 | 1 |
2 | 1 | 2 |
3 | 2 | 1 |
4 | 2 | 3 |
5 | 3 | 1 |
6 | 4 | 1 |
7 | 4 | 3 |
Note that there are 'duplicates' in both the sub-class ID and super-class ID fields, yet no duplicates in the ID field. This is because the bridging table has unique IDs, which it uses to make independent associations. Sub-class 1 (Databases) needs to be associated to two different groups (IT Specialist and Networking Specialist). Thus, two different associations need to be formed.
Both approaches above give the same 'result'. The only real difference here is that a bridging table will give you more rows, while setting multiple flags will give you more columns. Obviously, the way in which you craft your query will be different as well.
Which of the two approaches you choose to go with really depends on how much data you're dealing with, and how much scope the database is going to have for expansion in the future :)
Hope this helps! :)

Resources