SOLR indexing arbitrary data - solr

Let's say you have a simple forms automation application, and you want to index every submitted form in a Solr collection. Let's also say that form content is open-ended so that the user can create custom fields on the form and so forth.
Since users can define custom forms, you can't really predefine fields to Solr, so we've been using Solr's "schema-less" or managed schema mode. It works well, except for one problem.
Let's say a form comes through with a field called "ID" and a value of "9". If this is the first time Solr has seen a field called "ID", it dutifully updates it's schema, and since the value of this field is numeric, Solr assigns it a data type of one of it's numeric data types (we see "plong" a lot).
Now, let's say that the next day, someone submits another instance of this same form, but in the ID field, they type their name instead of entering a number. Solr spits this out and won't index this record because the schema says ID should be numeric, but on this record, it's not.
The way we've been dealing with this so far is to trap the exception we get when a field's data type disagrees with the schema, and then we use the Solr API to alter the schema, making the field in question a text or string instead of a numeric.
Of course, when we do this, we need to reindex the entire collection since the schema changed, and so we need to persist all the original data just in case we need to re-index everything after one of these schema data-type collisions. We're big Solr fans, but at the same time, we wonder whether the benefits of using the search engine outweigh all this extra work that gets triggered if a user simply enters character data in a previously numeric field.
Is there a way to just have Solr always assign something like "text_general" for every field, or is there some other better way?

I would say that you might need to handle the Id values at your application end.
It would be good to add a validation for Id, that Id should be of either string or numberic.
This would resolve your issue permanently. If this type is decided you don't have to do anything on the solr side.
The alternative approach would be have a fixed schema.xml.
In this add a field Id with a fixed fieldType.
I would suggest you to go with string as a fieldType for ID if don't want it to tokenize the data and want the exact match in the search.
If you would like to have flexibility in search for the Id field then you can add a text_general field type for the field.
You can create your own fieldType as well with provided tokenizer and filter according to your requirement for you the field Id.
Also don't use the schemaless mode in production. You can also map your field names to a dynamic field definition. Create a dynamic field such as *_t for the text fields. All your fields with ending with _t will be mapped to this.

Related

Is Solr allows indexing a single field with multiple data types

In my scenario, there is table with column "column1" of string type and column1 can store have int, float ..(any)type of values in string format. I'm creating index with Solr on thable.
Can Solr allow indexing on a single field (column1) with multiple data types like string,int..etc ?
No, it can't. Solr needs a field to be a defined type (either through a defined field or through a dynamic field), since querying and sorting is defined differently for different fields. If you're used to processing all the values as a string today, you might be best suited to go with a StrField in the future as well.
Another option is to decide what type of column the field is when indexing, and add the value to different fields based on the type (since it's a string field in the database, you'll have to come up with a heuristic that matches your expected types for the different values in the field). For example by having column1_int, column1_string, column1_float, and then indexing to the field of the correct type. When querying you'll query all the relevant fields, either based on the input data type or by massaging the data appropriately.

Difference between Standard Fields and Custom Fields in Salesforce

I am kind of new to Salesforce. Could you please let me know what is the difference between Standard Fields and Custom Fields in Salesforce? Can I consider combination of Standard Fields as the unique identifier for a record?
Custom fields are just that. Fields that have been added to the standard Salesforce schema to tailor the data for each object. The user who creates the field can specify the field type and any applicable limitations, such as the maximum number of characters in a text field. These fields might be added to an Org via a managed package or through direct customization.
Standard fields in contrast are those that are already present in the Salesforce schema when a new Organization is created. They are present in all Orgs where the same features are enabled. You can't customize these fields to the same degree. E.g. you could change the display label, but not the underlying API name or data type.
You can see the list of the standard fields in the Salesforce Field Reference Guide
From an API perspective, custom fields are usually identified by a __c suffix (there are a few exceptions, such as GeoLocation fields).
Can I consider combination of Standard Fields as the unique identifier for a record?
You would usually rely on the Id field to be unique. If you wanted to augment this with another unique value, you would create a custom field and mark it as an External ID.
A composite key isn't directly supported. Instead you need to create a Unique Text field and then use a workflow field update or before trigger to populate the unique field with the components of the composite key.
Incidentally, the salesforce.stackexchange.com site is a great place to ask Salesforce specific questions.

changing solr id from string to uuid

I am very new to solr.
Initially the "id" in my solr schema was of type string.
I have 30,000 documents, but now I want to use uuid instead of a string.
Simply changing the id to uuid and following instructions from http://wiki.apache.org/solr/UniqueKey
It did not work because it tried to string id as uuid and it failed.
My question is how do i change my id to uuid without deleting any data ?
Any info on this will be helpful.
Hope your id field is be mentioned as uniqueKey in the schema.xml. That means every solr document in your Solr instance must contain the id field. When you modify the type of any field in the schema, the previously created index for those fields get messed up. Now you can't query on those field, though they are still present in your Solr instance.
What good is that if you can not query on the data, you indexed to query? So, there is no good keeping the old document in your Solr, on which you can't query. And this time you have modified the uniqueKey field. So, you must re-index. If you would have modified the type of other field except uniqueKey, then Atomic update or partial update would have been a solution.

Obtaining record from Solr using single Key

I am using solr and looked over the documentations but couldn't find a way to get a single record from Solr by using a key?
If I know the key value of the record what is the query I need to pass to Solr to obtain this record?
Thanks.
Not sure what you mean by key, but guessing from context, you mean a field defined by your schema, if this is the case, you could issue the following:
// Assumes Id is a schema field
// If via solr admin
q=Id:1
// Properly escaped
q=Id%3A1

Index document "linked" to multiple users

Hi I want to index a Solr Document and tag the document with multiple associated users. I want to enable searches like "give me the documents assocaited with userid 1000,1003...9300 containing the word X. More people will be added to the document during the lifetime of the document. I want to potentially associate thousands of users to one document. There is no need to show the associated users in the results, just for search, will indexing of userid or username be more performant and scalable. What field type would be more performant and scalable, appending to a text field, a multivalued field or any other approach?
I believe that using the userid (as an integer) would be the most performant. (At least from my experience so far). Also, using a multivalued field will allow you to use a filter query on the userid field to help improve the query response time.

Resources