How to write customized metadata to index in Lucene?

How to write customized metadata to index in Lucene? - file

I need to write some metadata to index in Lucene. These metadata describes the relationship between indexes, which helps me to do cross-index query.
The data structure of metadata is key-value pair. The key may be Integer or String. And the value is a list of Integer or String.
In the begining, I tried to extend the Codec. Obvously this key-value pair is not belongs to existing Formats. Then I turned into write this by adding a field. But it is not belongs to the index either and field is hard to change.
How to extends this metadata? Thank you.

Related

How do we give an identification to a relationship in OntoRefine RDF mapping?

I'm working on a transformation work in which I need to transform a property graph dataset into a RDF dataset. There are so many n-ary relationships that need to be traited as a class, but I do not know how to affect an unique identification on these relations. I tried to use the row index but I've got more than one file on this work so this can't work. So I would like to know how do you affect an unique identification to relationships, if the URI is the solution, how do we do this in OntoRefine mapping? Thank you for your answers.
Lee

There are several ways to address this:
Ideally, use some characteristics of the related entities to make a deterministic URL. Eg if you're making a position (membership) node between a person and an org that involves a mandatory role and start date, you could use a URL like org/<org_id>/person/<person_id>/role/<role_id>/date/<date>
Use a blank node. In that case you don't need to worry about a URN
Use the row index if you prepend it with the table/file name (as a constant)
Use the GREL function random(). It doesn't produce a globally unique identifier, but if you ask for a large enough range, it'll be unique with a very high probability
Use a Jython function, as shown at How to create UUID in Openrefine based on the MD5 hash of the values
If you do your mapping using SPARQL, then use the builtin uuid() function

SOLR indexing arbitrary data

Let's say you have a simple forms automation application, and you want to index every submitted form in a Solr collection. Let's also say that form content is open-ended so that the user can create custom fields on the form and so forth.
Since users can define custom forms, you can't really predefine fields to Solr, so we've been using Solr's "schema-less" or managed schema mode. It works well, except for one problem.
Let's say a form comes through with a field called "ID" and a value of "9". If this is the first time Solr has seen a field called "ID", it dutifully updates it's schema, and since the value of this field is numeric, Solr assigns it a data type of one of it's numeric data types (we see "plong" a lot).
Now, let's say that the next day, someone submits another instance of this same form, but in the ID field, they type their name instead of entering a number. Solr spits this out and won't index this record because the schema says ID should be numeric, but on this record, it's not.
The way we've been dealing with this so far is to trap the exception we get when a field's data type disagrees with the schema, and then we use the Solr API to alter the schema, making the field in question a text or string instead of a numeric.
Of course, when we do this, we need to reindex the entire collection since the schema changed, and so we need to persist all the original data just in case we need to re-index everything after one of these schema data-type collisions. We're big Solr fans, but at the same time, we wonder whether the benefits of using the search engine outweigh all this extra work that gets triggered if a user simply enters character data in a previously numeric field.
Is there a way to just have Solr always assign something like "text_general" for every field, or is there some other better way?

I would say that you might need to handle the Id values at your application end.
It would be good to add a validation for Id, that Id should be of either string or numberic.
This would resolve your issue permanently. If this type is decided you don't have to do anything on the solr side.
The alternative approach would be have a fixed schema.xml.
In this add a field Id with a fixed fieldType.
I would suggest you to go with string as a fieldType for ID if don't want it to tokenize the data and want the exact match in the search.
If you would like to have flexibility in search for the Id field then you can add a text_general field type for the field.
You can create your own fieldType as well with provided tokenizer and filter according to your requirement for you the field Id.
Also don't use the schemaless mode in production. You can also map your field names to a dynamic field definition. Create a dynamic field such as *_t for the text fields. All your fields with ending with _t will be mapped to this.

What is the relationship between inverted index and field in solr

Forgive me I just a newbie of Solr. I am trying to understand some basic concept of Solr.
I quoted some read about inverted index as following .
This is like retrieving pages in a book related to a keyword by
scanning the index at the back of a book, as opposed to searching
every word of every page of the book.
This type of index is called an inverted index, because it inverts a
page-centric data structure (page->words) to a keyword-centric data
structure (word->pages).
In my understanding. I think the index would indicate the specific token term pointing to some document. But I can't understand what does the field of document use for when in the indexing and query?
In my understanding. in the query . Solr just search in the index and find the document. It is nothing to do with the field . Right ?Thanks.

Documents (which can have one or more fields) are the I/O entities exchanged between client and server during the index and the query phases. The inverted index is a low-level concept (hidden to the client) and it is the immutable and underlying data structure that solr uses to organize its data.
Solr uses fields for searching and indexing. Document instead is a logical grouping of them. (Improperly) speaking in RDBMS terminology
Document = record
Field = columns values belongin to that record

Metadata field duplication in Solr

I am trying to index millions of strings, that are associated with metadata objects.
Each metadata object can have n thousands of strings.
I need to be able to search both string content, and the associated object metadata.
Currently this means that I am indexing the copies of relevant metadata fields with each string, which leads to ridiculous amounts of duplication and incredibly large index sizes.
In a relational db model, i could just store one copy of the metadata and join the tables to be able to filter and search by the combined fields, but I can’t see any way of eliminating this duplication in Solr.
Is there something obvious I am missing, or is Solr just the wrong tool for the job?

Solr has support for join, which behaves more like subquery than join in relational database terms, but might do what you want. You can have Solr return metadata objects that have one or more strings that match your query. With another non-join query, you can also find out which strings are matched. (Note: This SO question explains why you cannot get both the metadata objects and the matched strings with one query yet.) If your metadata objects and the strings have a 1-to-N relationship, then you should also look into block join, which is designed for such relationship. You can index the metadata objects as parent documents, and the strings as child documents.

Storing arbitrary key/value entries alongside a datomic entity

Say I have entities that I want to store in datomic. If the attributes are all known in advance, I just add them to my datomic schema once and can then make use of them.
What if in addition to known attributes, entities could have an arbitrary number of arbitrary keys, mapping to arbitrary values. Of course I can just store that list in some "blob" attribute that I also add to the schema, but then I couldn't easily query those attributes.
The solution that I've come up with is to define a key and a value attribute in datomic, each of type string, and treat every one of those additional key/value entries as entities in their own right, using aforementioned attributes. Then I can connect all those key/value-entities to the actual entity by means of a 1:n relation using the ref type.
That allows me to query. Is that the way to go or is there a better way?

I would be reluctant to lose the power of attribute definitions. Datomic attributes can be added at any time, and the limit is reasonably high (2^20), so it may be reasonable to model the dynamic keys and values as they come along, creating a new attribute for each.