Contextualize expanded json-ld for a given #context - json-ld

The expansion algorithm https://www.w3.org/TR/json-ld-api/#expansion expands a json-ld document using its #context.
Is there also a "reverse" algorithm, which, given an expanded json-ld document and a context definition (IRI or document), will generate json-ld where the absolute IRIs, blank node identifiers, or keywords are "compacted" according to the context definitions?
I can't find it in https://github.com/jsonld-java/jsonld-java or on json-ld.org/spec/latest/json-ld-api/

The Compaction Algorithm in JSON-LD API does much of that. Non-document relative IRIs are compacted to terms, or prefixed names consistent with the definitions in your context. For document-relative IRIs (such as the value of #id), if you include #base within the context, IRIs will be made relative to that base. To make IRIs relative without hard-coding #base in your context. Processors may include a mechanism to pass a base into the compaction algorithm to do this without using #base; I can't say what jsonld-java does. In the Ruby JSON-LD gem, the compact algorithm accepts a base option for doing this.

Related

how to do logistic partial least squares using ordinal explanatory variables

This is a general question without codes.
My dataframe consists of a binary response variable and ordinal predictor variables (likert-type scale). I want to do partial least squares by retrieving the most relevant components from the predictor variables (1st stage) and using those as my new predictors for a logit model - 2nd stage (since my response is binary).
So far, the package plsRglm seem the most applicable since it allows a logit in the second stage. The challenge is that it seems plsRglm does not have provision for ordinal factor variables. If you know about the plsRglm package, could you please suggest how to handle ordinal factor variables?
Or could you suggest another package that solves this problem?
Thanks

Best way to represent part-of (mereological) transitivity for OWL classes?

I have a background in frame-based ontologies, in which classes represent concepts and there's no restriction against assertion of class:class relationships.
I am now working with an OWL2 ontology and trying to determine the best/recommended way to represent "canonical part-of" relationships - conceptually, these are relationships that are true, by definition, of the things represented by each class (i.e., all instances). The part-of relationship is transitive, and I want to take advantage of that so that I'd be able to query the ontology for "all parts of (a canonical) X".
For example, I might want to represent:
"engine" is a part of "car", and
"piston" is a part of "engine"
and then query transitively, using SPARQL, for parts of cars, getting back both engine and piston. Note that I want to be able to represent individual cars later (and be able to deduce their parts by reference to their rdf:type), and of course I want to be able to represent sub-classes of cars as well, so I can't model the above-described classes as individuals - they must be classes.
It seems that I have 3 options using OWL, none ideal. Is one of these recommended (or are any discouraged), and am I missing any?
OWL restriction axioms:
rdfs:subClassOf(engine, someValuesFrom(partOf, car))
rdfs:subClassOf(piston, someValuesFrom(partOf, engine))
The major drawback of the above is that there's no way in SPARQL to query transitively over the partOf relationship, since it's embedded in an OWL restriction. I would need some kind of generalized recursion feature in SPARQL - or I would need the following rule, which is not part of any standard OWL profile as far as I can tell:
antecedent (body):
subClassOf(B, (P some A) ^
subClassOf(C, (P some B) ^
transitiveProperty(P)
consequent (head):
subClassOf(C, (P some A))
OWL2 punning: I could effectively represent the partOf relationships on canonical instances of the classes, by asserting the object-property directly on the classes. I'm not sure that that'd work transparently with SPARQL though, since the partOf relationships would be asserted on instances (via punning) and any subClassOf relationships would be asserted on classes. So if I had, for example, a subclass six_cylinder_engine, the following SPARQL snippet would not bind six_cylinder_engine:
?part (rdfs:subClassOf*/partOf*)+ car
Annotation property: I could assert partOf as an annotation property on the classes, with values that are also classes. I think that would work (minus transitivity, but I could recover that easily enough with SPARQL paths as in the query above), but it seems like an abuse of the intended use of annotation properties.
I think you have performed a good analysis of the problem and the advantages/disadvantages of different approaches. I don't know if any one is discouraged or encouraged. IMHO this problem has not received sufficient attention, and is a bigger problem in some domains than others (I work in bio-ontologies which frequently use partonomies, and hence this is very important).
For 1, your rule is valid and justified by OWL semantics. There are other ways to implement this using OWL reasoners, as well as RDF-level reasoners. For example, using the ROBOT command line wrapper to the OWLAPI, you can run the reason command using an Expression Materializing Reasoner. E.g
robot reason --exclude-tautologies true --include-indirect true -r emr -i engine.owl -o engine-reasoned.owl
This will give you an axiom piston subClassOf partOf some car that can be queried using a non-transitive SPARQL query.
The --exclude-tautologies removes inferences to owl:Thing, and --include-indirect will include transitive inferences.
For your option 2, you have to be careful in that you may introduce incorrect inferences. For example, assume there are some engines without pistons, i.e. engine SubClassOf inverse(part_of) some piston does not hold. However, in your punned shadow world, this would be entailed. This may or may not be a problem depending on your use case.
A variant of your 2 is to introduce different mapping rules for layering OWL T-Tboxes onto RDF, such as described in my OWLStar proposal. With this proposal, existentials would be mapped to direct triples, but there is another mechanism (e.g. reification) to indicate the intended quantification. This allows writing rules that are both safe (no undesired inferences) and complete (for anything expressible in OWL-RL). Here there is no punning (under the alternative RDF to OWL interpretation). You can also use the exact same transitive SPARQL query you wrote to get the desired results.

Efficient Way of Filtering URLs by looking through keyword list

What is the best way to filter urls by comparing where a keyword is inside the url or not?
I have a list of keywords (a kind of blacklist) which contains 50000 words.
The search method uses the following steps:
While (end of keywords)
1. Get the keyword from database
2. Check whether the keyword is in the url
3. Redirect the user to a specific page.
When I use this method, the cpu usage becomes around %90. Is there an efficient way to do this? It seems that I can't use regex, since the keyword always changes.
The problem is multi pattern search and can be effectively solved with Aho-Coracisk algorithm. This algorithm searches a set of strings simultaneously. The complexity of the algorithm is linear in the length of the keyowords plus the length of the URL plus the number of output matches.
Check whether the keyword is in the url
[...]
Is there an efficient way to do this?
The vice versa will be much more efficient: split the URL onto the keywords and lookup them in the database.
To speed up the database lookup, you can use a variety of methods. For example, sort the database and do a binary search, use a trie structure, hash table, etc etc.
Aho-Corasick algorithm is the best solution for this problem.
Here is the python implementation Aho-Corasick
Below is a code sample
import ahocorasick
A = ahocorasick.Automaton()
for index, word in enumerate('asim sinan yuksel uksel sel sina sim asi as nan an in ina uks .com .co www. http//'.split()):
A.add_word(word, (index, word))
A.make_automaton()
for item in A.iter('http://wwww.asimsinanyuksel.com'):
print(item)

Inserting a TermVector into Lucene

Learning how to use Lucene!
I have an index in Lucene which is configured to store term vectors.
I also have a set of documents I have already constructed custom term vectors for (for an unrelated purpose) not using Lucene.
Is there a way to insert them directly into the Lucene inverted index in lieu of the original contents of the documents?
I imagine one way to do this would be to generate bogus text using the term vector with the appropriate number of term occurrences and then to feed the bogus text as the contents of the document. This seems silly because ultimate Lucene will have to convert the bogus text back into a term vector in order to index.
I'm not entirely sure what you want to do with these term vectors ultimately (score? just retrieve?) but here's one strategy I might advocate for.
Instead of focusing on faking out the text attribute of term vectors, consider looking into payloads which attach arbitrary metadata to each token. During analysis, text is converted to tokens. This includes emitting a number of attributes about each token. There's standard attributes like position, term character offsets, and the term string itself. ALL of these can be part of the uninverted term vector. Another attribute is the payload which is arbitrary metadata you can attach to a term.
You can store any token attribute uninverted as a "term vector" including payloads, which you can access at scoring time.
To do this you need to
Configure your field to store term vectors, including term vectors with payload
Customize analysis to emit payloads that correspond to your terms. You can read more here
Use an IndexReader.getTermVector to pull back Terms. From that you can get a TermsEnum. You can then use that to get a DocsAndPositionEnum which has an accessor for the current payload
If you want to use this in scoring, consider a custom query or custom score query

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

Resources