i am pretty new to solr. and i don't know what is the best practice for the id column.
currently i wish to exclude the internal "id" parameter from solr search results (i am using my custom user_id field ).
i know i can use the fl=field1,field2. but this means specifying all my fields here. and i don't have a deep knowledge in solr and i fear this will hurt performance. ?
another question is it recommended to add another field user_id or overwrite the default id field ?
thank you very much.
If the value you have in your user_id field is unique, index that into your id column or define the user_id field as your unique key instead and don't use the id field.
The important thing is that there's a unique field in your document so that Solr knows when a document should be updated compared to when a new document should be added instead.
If the id field is not relevant / secret, I'm not sure why you'd be worried about including it.
Related
First I want to say that the concept of a dedicated search engine is all new to me, so please be indulgent :-)
How does a transactional database entity with an Id and a Name does translate into an Azure Search Index field ?
Should we add only Name, or both Id and Name ?
For example, let's say I want the Client in my index.
I want both to search and have facets on Client.
Should I add only ClientName into the index ?
What if ClientName is renammed ?
What if ClientName is not unique ?
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
Is my reasoning ok ?
Is there any best practices / guidelines ?
EDIT
Here is a more concrete example.
Let say that in the transactional db, we have tables with Id and Name for Format, Location, Author, Genre, Region, ...
If we were to build those facets in Azure Search, would the recommended approach be to add both the Id and Name for each of them, and set the Id field as Facetable ?
It's probably a good idea to add both Id and Name, since potentially Name can change. Also, the Name field can contain arbitrary characters, while document id can only contain alphanumeric characters, dashes, underscores and equal signs (see Naming Rules).
Only id field must be unique (it has the same semantics as the primary key in a relational database). All other fields can have non-unique values. If a value changes, you just update the document (using merge or mergeOrUpload indexing action).
Azure Search supports batches of up to 1000 documents. If you want to update more documents than that, you'll have to break your updates into multiple batches. See Indexing API. The links shows REST API, but of course the same functionality is available in .NET SDK, if you're on .NET.
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
We do not recommend making ClientId facetable. Facets work best on fields with a relatively small number of unique values. Since ClientId by definition must be unique, faceting will not be useful and any faceting queries that reference ClientId will probably perform poorly if you have many documents in your index. It is reasonable to make ClientId filterable though, since there may be situations when you want to retrieve or exclude certain documents by ClientId.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
This is not necessary. Making ClientId filterable allows you to filter by ClientId, nothing more. You always need to specify document IDs when updating fields using the Index API, but that doesn't require the ID field to be filterable.
I hope this gets you started, and as you have more specific questions, you can post them here.
I want to store urls in an index but I want unique url.
I'm making POST request to store my documents but I want to avoid duplicate document based on the url field.
Is there a way to specify a unique constraint on the url field ?
I have around 5 million of data so I don't want to make url as the document ID instead as it will slowdown my search query.
No, the _id is the only field that can have the uniqueness restriction. You probably know this but a new document with existing id would override the existing document with same id. You can use op_type=create or /my_index/my_type/ID/_create in order to get back an error if a document with same id already exists.
I am very new to solr.
Initially the "id" in my solr schema was of type string.
I have 30,000 documents, but now I want to use uuid instead of a string.
Simply changing the id to uuid and following instructions from http://wiki.apache.org/solr/UniqueKey
It did not work because it tried to string id as uuid and it failed.
My question is how do i change my id to uuid without deleting any data ?
Any info on this will be helpful.
Hope your id field is be mentioned as uniqueKey in the schema.xml. That means every solr document in your Solr instance must contain the id field. When you modify the type of any field in the schema, the previously created index for those fields get messed up. Now you can't query on those field, though they are still present in your Solr instance.
What good is that if you can not query on the data, you indexed to query? So, there is no good keeping the old document in your Solr, on which you can't query. And this time you have modified the uniqueKey field. So, you must re-index. If you would have modified the type of other field except uniqueKey, then Atomic update or partial update would have been a solution.
What is pk in solr DIH delta import? I am trying to delta index multiple fields in solr?
I believe it is whatever field you specify in your schema.xml file as the id field.
It is a name of Solr field that serves as a unique key for that record. You define your mapping of source to that Solr column and then - after mapping - Solr checks its presence and values based on the pk field you specified.
It is different from primaryKey because you may be generating primaryKey or it may not be suitable somehow. But it could be same. I think the clearest Wiki explanation may be in the example for HttpDataSource.
I believe, you may also be able to define a compound pk for when you are flattening inner source entries into one Solr entry.
I think the problem is in your delta-query for the child entity. You have given,
deltaQuery="select id from cc_gadget_lang where '${cc_gadget.last_modified_date}' > '${dataimporter.last_index_time}'"
I think the where condition in the above query validates to TRUE always and there is no specific purpose of having that.
The Solution I would suggest is to have a separate "last_modified_date" field in the "cc_gadget_lang" table in your database and use that in the delta query of your child entity.
I also believe that there is no need to have the "pk" of the child entity in your schema file because, they are stored and used temporarily during delta-imports and do not require to be stored permanently in Index.
I am using solr and looked over the documentations but couldn't find a way to get a single record from Solr by using a key?
If I know the key value of the record what is the query I need to pass to Solr to obtain this record?
Thanks.
Not sure what you mean by key, but guessing from context, you mean a field defined by your schema, if this is the case, you could issue the following:
// Assumes Id is a schema field
// If via solr admin
q=Id:1
// Properly escaped
q=Id%3A1