First I want to say that the concept of a dedicated search engine is all new to me, so please be indulgent :-)
How does a transactional database entity with an Id and a Name does translate into an Azure Search Index field ?
Should we add only Name, or both Id and Name ?
For example, let's say I want the Client in my index.
I want both to search and have facets on Client.
Should I add only ClientName into the index ?
What if ClientName is renammed ?
What if ClientName is not unique ?
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
Is my reasoning ok ?
Is there any best practices / guidelines ?
EDIT
Here is a more concrete example.
Let say that in the transactional db, we have tables with Id and Name for Format, Location, Author, Genre, Region, ...
If we were to build those facets in Azure Search, would the recommended approach be to add both the Id and Name for each of them, and set the Id field as Facetable ?
It's probably a good idea to add both Id and Name, since potentially Name can change. Also, the Name field can contain arbitrary characters, while document id can only contain alphanumeric characters, dashes, underscores and equal signs (see Naming Rules).
Only id field must be unique (it has the same semantics as the primary key in a relational database). All other fields can have non-unique values. If a value changes, you just update the document (using merge or mergeOrUpload indexing action).
Azure Search supports batches of up to 1000 documents. If you want to update more documents than that, you'll have to break your updates into multiple batches. See Indexing API. The links shows REST API, but of course the same functionality is available in .NET SDK, if you're on .NET.
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
We do not recommend making ClientId facetable. Facets work best on fields with a relatively small number of unique values. Since ClientId by definition must be unique, faceting will not be useful and any faceting queries that reference ClientId will probably perform poorly if you have many documents in your index. It is reasonable to make ClientId filterable though, since there may be situations when you want to retrieve or exclude certain documents by ClientId.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
This is not necessary. Making ClientId filterable allows you to filter by ClientId, nothing more. You always need to specify document IDs when updating fields using the Index API, but that doesn't require the ID field to be filterable.
I hope this gets you started, and as you have more specific questions, you can post them here.
Related
i am pretty new to solr. and i don't know what is the best practice for the id column.
currently i wish to exclude the internal "id" parameter from solr search results (i am using my custom user_id field ).
i know i can use the fl=field1,field2. but this means specifying all my fields here. and i don't have a deep knowledge in solr and i fear this will hurt performance. ?
another question is it recommended to add another field user_id or overwrite the default id field ?
thank you very much.
If the value you have in your user_id field is unique, index that into your id column or define the user_id field as your unique key instead and don't use the id field.
The important thing is that there's a unique field in your document so that Solr knows when a document should be updated compared to when a new document should be added instead.
If the id field is not relevant / secret, I'm not sure why you'd be worried about including it.
In MySQL I used auto-increment to generate an id for every user. I would like to create a similar user table in Google Datastore where the id for a user will be unique. According to these docs:https://cloud.google.com/appengine/docs/java/datastore/entities
System-allocated ID values are guaranteed unique to the entity group.
But according to this post: Ever see duplicate IDs when using Google App Engine and ndb? the id's are not unique. I need this id to be unique. It is confusing because in the docs it says the id is unique, but from this post it says the id is not unique it is the key that is unique. My objective is for no two users to have the same id. How can I guarantee this? I would prefer for the database to take care of this form me opposed to me having to create large ids manually using things such as uuids.
As Igor correctly observed, IDs are always unique as long as the entity has no parent.
I can't think of any reason to make user entities children of some other entities, so you are safe.
Note that IDs will not be sequential, as it helps to spread the load equally across the entire dataset - it's a by-product of how the Datastore is designed.
I want to store urls in an index but I want unique url.
I'm making POST request to store my documents but I want to avoid duplicate document based on the url field.
Is there a way to specify a unique constraint on the url field ?
I have around 5 million of data so I don't want to make url as the document ID instead as it will slowdown my search query.
No, the _id is the only field that can have the uniqueness restriction. You probably know this but a new document with existing id would override the existing document with same id. You can use op_type=create or /my_index/my_type/ID/_create in order to get back an error if a document with same id already exists.
I am struggling with the overall view of how (whether possible) one might be able to index multiple different types of records in one single Solr core. Multiple records meaning that they have different unique keys.
We are inclined to want to use a single core because we want to be able to, at certain levels, search everything all at once and not have to cobble cores together.
So, for example, I have products that have the fields:
product_code <--- unique key
product_title
product_description
etc...
then there are job listings that have the fields:
job_id <---- unique key
job_description
job_title
etc...
there are multiple other entities, including a Nutch search index, which will have a unique id of 'id'
is it possible to include in the schema.xml more than one unique key? so that id do not have to send each different kind of record to a different solr core?
The main concern I have is that in identifying the <uniqueKey>s at least one of them has to be required, but not all records sent to the solr index will have the required key.
Is there an accepted way to get around this problem in Solr?
See https://wiki.apache.org/solr/MultipleIndexes#Flattening_Data_Into_a_Single_Index and https://wiki.apache.org/solr/UniqueKey
Solr does not need a uniqueKey. If you do not specify a unique key, then you need to do the following - when you post a new doc that has the same key as an existing doc, the new doc will not replace the old one, so you will have to delete the old one first manually and then add the new one (and commit, of course).
If you need a unique key, then append a prefix to the IDs which is based on the type. Then you can have two other fields like id and type. So, for example:
uniquekey: P1
product_code: 1
type: product
uniquekey: J1
job_id: 1
type: job
Hi I want to index a Solr Document and tag the document with multiple associated users. I want to enable searches like "give me the documents assocaited with userid 1000,1003...9300 containing the word X. More people will be added to the document during the lifetime of the document. I want to potentially associate thousands of users to one document. There is no need to show the associated users in the results, just for search, will indexing of userid or username be more performant and scalable. What field type would be more performant and scalable, appending to a text field, a multivalued field or any other approach?
I believe that using the userid (as an integer) would be the most performant. (At least from my experience so far). Also, using a multivalued field will allow you to use a filter query on the userid field to help improve the query response time.