Ontology Matching Evaluation - owl

fellow students and I have developed an Ontology Matching Algo. in a student project.
We thus chiefly evaluate our algo. on the oaei dataset
(http://oaei.ontologymatching.org/) but we would like to test it on other as well.
Does anyone know others datasets or challenges as oaei?

Related

Difference between homonyms and synonyms in data science with examples

Please share the difference between homonyms and synonyms in data science with examples.
Synonyms for concepts:
When you determine that two concepts are synonyms (say, sofa and couch), you use the class expression owl:equivalentClass. The entailment here is that any instance that was a member of class sofa is now also a member of class couch and vice versa. One of the nice things about this approach is that "context" of this equivalence is automatically scoped to the ontology in which you make the equivalence statement. If you had a very small mapping ontology between a furniture ontology and an interior decorating ontology, you could say in the map that these two are equivalent. In another situation if you needed to retain the (subtle) difference between a couch and a sofa, you do that by merely not including the mapping ontology that declared them equivalent.
Homonyms for concepts:
As Led Zeppelin says, "and you know sometimes words have two meaningsā€¦" What happens when a "word" has two meanings is that we have what WordNet would call "word senses." In a particular language, a set of characters may represent more than one concept. One example is the English word "mole," for which WordNet has 6 word senses. The Semantic Web approach is to give each its own namespace; for instance, I might refer to the counterspy mole as cia:mole and the burrowing rodent as the mammal:mole. (These are shortened qnames for what would be full namespace names.) The nice thing about this is, if the CIA ever needed to refer to the rodent they could unambiguously refer to mammal:mole.
Credit
Homonyms- are words that have the same sound but have different in meaning.
2. Synonyms- are words that have the same or almost the same meaning.
Homonyms
Machine learning algorithms are now the subject of ethical debate. Bias, in layman's terms, is a pre-formed view created before facts are known. It applies to an estimating procedure's proclivity to provide estimations or predictions that are, on average, off goal in machine learning and data mining.
A policy's strength can be measured in a variety of ways, including confidence. "Decision trees" are diagrams that show how decisions are being made and what consequences are available. Rescale a statistic to match the scale of other variables in the model to normalise it.
Confidence is a statistician's metric for determining how reliable a sample is (we are 95 percent confident that the average blood sugar in the group lies between X and Y, based on a sample of N patients). Decision tree algorithms are methods that divide data across pieces that are becoming more and more homogeneous in terms of the outcome measure as they advance.
A graph is a graphical representation of data that statisticians call plots and charts. A graph seems to be an information structure that contains the ties and links among items, according to computer programmers. The act of arranging relational databases and their columns such that table relationships are consistent is known as normalisation.
Synonyms
Statisticians use the terms record, instance, sample, or example to describe their data. In computer science and machine learning, this can be called an attribute, input variable, or feature. The term "estimation" is also used, though its use is generally limited to numeric outcomes.
Statisticians call the non-time-series data format a record, or record. In statistics, estimation more often refers to the use of a sample statistic to measure something. Predictive modelling involves developing aggregations of low-level predictors into more informative "features".
The spreadsheet format, in which each column is still a variable, so each row is a record, is perhaps the most common non-time-series data type. Modeling in machine learning and artificial intelligence often begins with some very low-level prediction data.

short text syntactic classification

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

How to get a trained Watson natural language classifier to NOT pick up a class?

When using the nice demo at http://watson-on-classifier.mybluemix.net, you sometimes got the answer "Sorry, I don't understand the question. Please try to rephrase it." when your question is not related to any of the supported themes.
I don't understand how to do this using Watson natural language classifier: it seems to me that whatever the entry, it choose one of the classes it has been trained for... How do you achieve rejection of some entries as "does not match any of the classes with enough confidence" ?
Thanks for your help.
Roughly speaking, what NLC does behind the scenes (I guess) is to try to correlate one statement with another based on concepts parsed from the input text and calculated using some ontology, so it can find synonyms or concepts that are "kind of" or "part of" other concepts.
So, in order to have a rejection, I can see 3 possible ways
the entry has no correlation to any of the data used in the classifier because the concepts are too far from the concepts of the training data, in the ontology
the entry has equal correlation to more than one category, so the system can't tell if it belongs to one or another
the entry has correlation with one category, but the confidence level is too low, so it does not satisfy some threshold defined by the system
NLC will always return answers in order of confidence. The system has been set up that if intents fall below a certain level of confidence it will not return an answer.
This is defined by the person writing the application.

Ontology Dataset for "Project"?

I am looking for a ontology dataset, which can describe the concept: "Project". For example:
the type of the project(new project, improvement, translate...)
industries involved in the project(mobile application, design, music)
products of the project
but i can't even find a keyword for such a case.
Are there any suitable datasets for my case?
Check this one : http://ckan.citek.ipn.pt/dataset/r-d-industry-ontology-and-data
Also, check the below website for similar data sets:
http://datahub.io/dataset

Data Warehousing - OLAP operations

I would like to know how to find the standard deviation of final scores from a Data warehouse (represented by a schema) representing a universities gradebook using OLAP Operations (slicing,drilling), I cannot post the image for the schema because I don't have enough reputation points.
The schema has the following dimensions:
course
student
semester
instructor
department
gradebook
Could you please help with this?
I think you need to be more specific in your question. Are you talking about a specific product vendor in relation to OLAP databases, Analysis Services, Oracle OLAP, etc? For example, if you are using Analysis Services, the MDX Language has functions (StdDev & StdDevP) that would calculate the standard deviation and population standard deviation of an appropriate Set you passed in Oracle likewise doesn't use MDX but has an appropriate function (STDDEV)
If more generally you want to understand how to calculate the standard deviation, then you probably understand it's a mathematical formula, that has nothing to do with OLAP and I would recommend Brendan Foltz's excellent, easy to consume, series of videos on broad range of statistical topics if you are interested. There are 3 on standard deviation on his blog here - you will find them in the middle of the select row (Statistics 101: Standard Deviation and NFL Field Goals - Part 1/3, 2/3, 3/3).
Either way, the second will help you understand what to do with the first, once you consult appropriate product documentation.

Resources