Hi I want to index a Solr Document and tag the document with multiple associated users. I want to enable searches like "give me the documents assocaited with userid 1000,1003...9300 containing the word X. More people will be added to the document during the lifetime of the document. I want to potentially associate thousands of users to one document. There is no need to show the associated users in the results, just for search, will indexing of userid or username be more performant and scalable. What field type would be more performant and scalable, appending to a text field, a multivalued field or any other approach?
I believe that using the userid (as an integer) would be the most performant. (At least from my experience so far). Also, using a multivalued field will allow you to use a filter query on the userid field to help improve the query response time.
Related
Let's say you have a simple forms automation application, and you want to index every submitted form in a Solr collection. Let's also say that form content is open-ended so that the user can create custom fields on the form and so forth.
Since users can define custom forms, you can't really predefine fields to Solr, so we've been using Solr's "schema-less" or managed schema mode. It works well, except for one problem.
Let's say a form comes through with a field called "ID" and a value of "9". If this is the first time Solr has seen a field called "ID", it dutifully updates it's schema, and since the value of this field is numeric, Solr assigns it a data type of one of it's numeric data types (we see "plong" a lot).
Now, let's say that the next day, someone submits another instance of this same form, but in the ID field, they type their name instead of entering a number. Solr spits this out and won't index this record because the schema says ID should be numeric, but on this record, it's not.
The way we've been dealing with this so far is to trap the exception we get when a field's data type disagrees with the schema, and then we use the Solr API to alter the schema, making the field in question a text or string instead of a numeric.
Of course, when we do this, we need to reindex the entire collection since the schema changed, and so we need to persist all the original data just in case we need to re-index everything after one of these schema data-type collisions. We're big Solr fans, but at the same time, we wonder whether the benefits of using the search engine outweigh all this extra work that gets triggered if a user simply enters character data in a previously numeric field.
Is there a way to just have Solr always assign something like "text_general" for every field, or is there some other better way?
I would say that you might need to handle the Id values at your application end.
It would be good to add a validation for Id, that Id should be of either string or numberic.
This would resolve your issue permanently. If this type is decided you don't have to do anything on the solr side.
The alternative approach would be have a fixed schema.xml.
In this add a field Id with a fixed fieldType.
I would suggest you to go with string as a fieldType for ID if don't want it to tokenize the data and want the exact match in the search.
If you would like to have flexibility in search for the Id field then you can add a text_general field type for the field.
You can create your own fieldType as well with provided tokenizer and filter according to your requirement for you the field Id.
Also don't use the schemaless mode in production. You can also map your field names to a dynamic field definition. Create a dynamic field such as *_t for the text fields. All your fields with ending with _t will be mapped to this.
I'm in the process of writing a SuiteTalk integration, and I've hit an interesting data transformation issue. In the target system, we have a sort of notes table which has a category column and then the notes column. Data going into that table from NetSuite could be several different fields on a single entity in NetSuite terms, but several records of different categories in our terms.
If you take the example of a Sales Order, you might have two text fields that we need to bring across as notes. For each of those fields I need to create a row, with both the notes field in the same column but separate rows. This would allow me to add a dynamic column that give the category for each of those fields.
So instead of
SO number notes 1 notes 2
SO1234567 some text1 some text2
You’d get
SO Number Category Text
SO1234567 category 1 some text1
SO1234567 category 2 some text2
The two problems I’m really trying to solve here are:
Where can I store the category name? It can’t be the field name in NetSuite. It needs to be configurable per customer as the number of notes fields in each record type might vary across implementations. This is currently my main blocker.
Performance – I could create a saved search for each type of note, and bring one row across each time, but that’s not really an acceptable performance hit if I can do it all in one call.
I use Saved Searches in NetSuite to provide a configurable way of filtering the data to import into the target system.
If I were writing a SQL query, i would use the UNION clause, with the first column being a dynamic column denoting the category and the second column being the actual data field from NetSuite. My ideal would be if I could somehow do a similar thing either as a single saved search, or as one saved search per entity, without having to create any additional fields within NetSuite itself, so that from the SuiteTalk side I can just query the search and pull in the data.
As a temporary kludge, I now have multiple saved searches in NetSuite, one per category, and within the ID of the saved search I expect the category name and an indicator of the record type. I then have a parent search which gives me the searches for that record type - it's very clunky, and ultimately results in far too many round trips for me to be satisfied.
Any idea if something like this is at all possible?? Or if not, is there a way of solving this without hard-coding the category values in the front end? Even if I can bring back multiple recordsets in one call, that would be a performance enhancement.
I've asked the same question on the NetSuite forums but to no avail.
Thanks
At first read it sounds like you are trying to query a set of fields from entities. The fields may be custom fields or built in fields. Can you not just query the entities where your saved search has all the potential category columns and then transform the received data into categories?
Otherwise please provide more specifics in Netsuite terms about what you are trying to do.
I need to create a new collection on my Solr 6.1.0 cluster where every row is a content and every content can belong to one or many categories, which are specified in a multivalued field categories.
In my web app the user can search by categories, and if wanted it can even group results by category. If it wants to order by category, what about the contents which belong to more than one category?
In this case, the search results page should show the same content more times in different categories. I don't want the web application to filter and order results because in this case, it should ask Solr for every row (I know this is not advised for bad performance), so is there a way to let Solr make this? For example, repeating the same content in two categories if a flag is enabled or if I am asking Solr to sort by category?
Until now I bypassed the problem cloning one record for every category and specifying the category ID in a single int field. But this is not optimized, because in this case my index is much bigger than it could be, and every content metadata a part of category is just the same for every content, and because of this I would like to have 1 content = 1 Solr record.
First I want to say that the concept of a dedicated search engine is all new to me, so please be indulgent :-)
How does a transactional database entity with an Id and a Name does translate into an Azure Search Index field ?
Should we add only Name, or both Id and Name ?
For example, let's say I want the Client in my index.
I want both to search and have facets on Client.
Should I add only ClientName into the index ?
What if ClientName is renammed ?
What if ClientName is not unique ?
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
Is my reasoning ok ?
Is there any best practices / guidelines ?
EDIT
Here is a more concrete example.
Let say that in the transactional db, we have tables with Id and Name for Format, Location, Author, Genre, Region, ...
If we were to build those facets in Azure Search, would the recommended approach be to add both the Id and Name for each of them, and set the Id field as Facetable ?
It's probably a good idea to add both Id and Name, since potentially Name can change. Also, the Name field can contain arbitrary characters, while document id can only contain alphanumeric characters, dashes, underscores and equal signs (see Naming Rules).
Only id field must be unique (it has the same semantics as the primary key in a relational database). All other fields can have non-unique values. If a value changes, you just update the document (using merge or mergeOrUpload indexing action).
Azure Search supports batches of up to 1000 documents. If you want to update more documents than that, you'll have to break your updates into multiple batches. See Indexing API. The links shows REST API, but of course the same functionality is available in .NET SDK, if you're on .NET.
Should I add both fields into the index and have:
ClientName: Searchable
ClientId: Facetable, Filterable
I understand having ClientId Facetable (instead of ClientName) will make it more work to show the facets since i'll have to fetch myself the names corresponding the the ClientId returned by Azure Search.
We do not recommend making ClientId facetable. Facets work best on fields with a relatively small number of unique values. Since ClientId by definition must be unique, faceting will not be useful and any faceting queries that reference ClientId will probably perform poorly if you have many documents in your index. It is reasonable to make ClientId filterable though, since there may be situations when you want to retrieve or exclude certain documents by ClientId.
Also, having the ClientId Filterable, I assume it would allow me to perform a batch rename of ClientName.
This is not necessary. Making ClientId filterable allows you to filter by ClientId, nothing more. You always need to specify document IDs when updating fields using the Index API, but that doesn't require the ID field to be filterable.
I hope this gets you started, and as you have more specific questions, you can post them here.
We have a situation where we are keeping two indexes with different schemas.
For example: suppose we have an index for seller where the key value is seller id and other attributes are seller information. Now another index is book where book id is unique key and it keeps book related information.
Is it possible to query both these indexes in a single query and get collective results?
I have checked Solr but as per my findings we can do this through distributed search in Solr but it works on same kind of schema being distributed in at max 3 indexes.
I am a newbie to Solr so please ignore if this is a stupid question.
You need to think about what makes sense for a search query but there are some rules.
The first requirement is that the unique keys need to have the same name and be unique across collections or Solr cannot collate results.
If you are then hoping to get some kind of sensible ranking of your results you need some common fields. For example I have two collections: one of product data and one containing product related documents. I have a unique key: id and I have common title and contents fields for when I want to query across the two collections. I also have an advanced search interface where I can query on specific fields like product id.
A "unification core" is a typical way of handling search across two or more cores, see this Stack Overflow answer on how to set that up
Query multiple collections with different fields in solr
Other techniques are to use federated search with something like Carrot or to issue two queries and show the results in different tabs in the search results.