Azure cognitive search index size estimation - azure-cognitive-search

I have a Azure search index with many fields (1000 fields). However for any given document only a few fields have values - maybe 50. Does that matter when determining how much storage will be consumed by the index? Do only populated values take up space in the index?
Also a similar question related to the suggester/autocomplete. If most of my fields are defined to use the suggester, but only a few have values per document, is performance of the index still negatively impacted?

I have several indexes similar to yours. The index schema has 1000-1200 properties and usually only 50 or so properties are populated. In my case 50.000 items takes about 1 GB of storage which should mean about 20kB per item.
My conclusion is that the storage taken by the extra properties is negligible.

Related

Does having more non-searchable retrievable fields reduce azure search performance?

I have included maximum possible fields while creating the search index to be safe in case I need to retrieve those fields in future. Many fields are not currently searchable and retrievable. Does having more fields not retrievable, searchable, filterable, sortable and facetable in Index reduces the search performance?
Yes, both indexing and search will be affected by having more fields. Obviously, the less features those fields support the less work the indexer has to do. If the field is only retrievable you only have to store that data. If it's searchable you have to build the index.
As an example, I tested the same data with 2 fields and 1000 fields. Indexing performance went from around 1300 documents per second to 30 documents per second. I'm expecting the total amount of data to have the biggest effect on indexing performance.
The query performance is negligible.

Get only N months data indexed in a Collection. It should be on rolling based

Currently I am facing an issue that a MongoDB collection might have billion records which contains document based on some rapid event happening in the system. These events get logged in the DB collection.
Since we have some 2-3 composite indexing on the same collection, the search definitely becomes slow.
The escape point to this is our customer has agreed if we can index only N months data in the MongoDB, then the efficiency for read can increase instead of having 2-3 years data indexed and we perform read operation.
My thoughts on solution 1: we can do TTL indexes and set expiry. After this expiry the data gets deleted from main collection. we can some how do backup for that expired records. This way we can only have specific data required in main collection.
My thoughts on solution 2: I can remove all the indexes, create the indexes again based on time frame, for example, Drop current indexes and again create indexes based on condition that indexes must be created only till past N months data only. This way I can maintain limited index. But I am not sure how much is it possible.
Question: I need more help on this on how can I achieve selective indexing. Also it must be rolling as everyday records gets added so does indexing.
If you're on Mongo 3.2 or above, you should be able to use a partial index to create the "selective index" that you want -- https://docs.mongodb.com/manual/core/index-partial/#index-type-partial You'll just need to be sure that your queries share the same partial filter expression that the index has.
(I suspect that there might also be issues with the indexes you currently have, and that reducing index size won't necessarily have a huge impact on search duration. Mongo indexes are stored in a B-tree, so the time to navigate the tree to find a single item is going to scale relative to the log of the number of items. It might be worth examining the explain output for the queries that you have to see what mongo is actually doing.)

How to reduce the size of a generated Lucene/Solr index?

I am working on a prototype of a search system.
I have a table in oracle with some fields. I generated data that looks real. Around 300.000 rows.
For example:
PaymentNo|Datetime |AmountEuro|PayersName |PayersPhoneNo|ReceiversLegal|ReceiversAcc
2314 |2015-07-21T15:14|15.63 |Clinton, Barack Anjela|1.918.0060657|Nasa |5555569778664190000
230338 |2015-08-01T15:14|34.87 |Merkel, George Donald |1.653.0060658|PepsiCo |7777828443194736000
( actually there are more columns)
The size of table in oracle 62 MB (Toad reports)
I imported table into Solr 5.2.1 (in Windows).
The size of index with data is 88 MB (on disk).
The size of index without data is 67 MB.
My question is: Can I decrease the size of index?
These options are already tested:
Decreasing the amount of indexed table columns. Switching off data storage in Solr. Excluding some part of rows from index.
I need an extra opportunity to decrease a size of an index.
Do you know any?
You can use all the insights provided here. Some additional points I wanted to share.
Solr does duplication of the data for providing the fast search over indexed data. One important thing about solr is, it uses immutable data structure for storing all the data.
Term Dictionary : Dictionary of indexed terms along with their frequency and offset to posting lists.
Term Vectors: Solr stores the term vector for each document indexed. This is essentially a separate inverted index for each document. This is usually storage heavy.
Stored Docs : stores each document with their fields in sequential order.
Doc values : stores fields for all the document together. This is similar to columnar storage of data.
You can disable the document level Term Vectors storage if you are not using solr highlighting feature of the solr.
Additionally, Solr uses many different compression techniques for different type of data. It uses bit packing/vint compression for posting lists and numerical values. LZ4 compression for stored fields and term vectors. It uses FST data structure for storing the Term Dictionary. FST is an special implementation of Trie data structure.

GAE — Performance of queries on indexed properties

If I had an entity with an indexed property, say "name," what would the performance of == queries on that property be like?
Of course, I understand that no exact answers are possible, but how does the performance correlate with the total number of entities for which name == x for some x, the total number of entities in the datastore, etc.?
How much slower would a query on name == x be if I had 1000 entities with name equalling x, versus 100 entities? Has any sort of benchmarking been done on this?
Some not very strenuous testing on my part indicated response times increased roughly linearly with the number of results returned. Note that even if you have 1000 entities, if you add a limit=100 to your query, it'll perform the same as if you only had 100 entities.
This is in line with the documentation which indicates that perf varies with the number of entities returned.
When I say not very strenuous, I mean that the response times were all over the place, and it was a very very rough estimate to draw a line through. I'd often see an order of magnitude difference in perf on the same request.
AppEngine does queries in a very optimized way, so it is virtually irrelevant from a performance stand-point whether you do a query on the name property vs. just doing a batch-get with the keys only. Either will be linear in the number of entities returned. The total number of entities stored in your database does not make a difference. What does make a tiny difference, though, is the number of different values for "name" that occur in your database (so, 1000 entities returned will be pretty much exactly 10 times slower than 100 entities returned).
The way this is done is via the indices (or indexes as preferred) stored along with your data. An index for the "name" property consists of a table that has all names sorted in alphabetical order (and a second one sorted in reverse alphabetical order, if you use descending order in any of your queries) and a query will then simply find the first occurrence of the name you are querying in the table and start returning results in order. This is called a "scan".
This video is a bit technical, but it explains in detail how all this works and if you're concerned about coding for maximum performance, might be a good time investment:
Google I/O 2008: Under the Covers of the Google App Engine Datastore
(the video quality is fairly bad, but they also have the slides online (see link above video))

Help with index on database

Is it a good idea to create an index on a field which is VARCHAR (500) ? I am going to do a lot of search in it, but I am not sure if creating an index on such a 'big' field is a good idea?
What do you think?
It is usually not a good idea since the index files will be huge and the search relatively slow. It is better to use a prefix of the field such as the first 32 or 64 characters of the field as an index. Another possibility is that if it makes sense use a full text index,.
In general it's a good idea to create indexes on fields that you'll use for search. But, depending on the use, there are better options:
Full text search (from wikipedia): In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
Partial index: (again, from wikipedia): In databases, a partial index, also known as filtered index is an index which has some condition applied to it so that it includes a subset of rows in the table.
Maybe you should consider giving more information on the use that index will have.
You should put indexes where often used queries will run faster, however, there are a number of issues to contemplate
Indexes have a very limited size, eg. mssql has a 900 byte limit
Many index may incur an overhead while writing (although minimal last time i benchmarked inserting a million entries on a table with 9 indexes
Many indexes takes up precious space in the db
many indexes may create deadlocks when inserting data
Also take a look at the documentation for the database you use. Most databases has support for text columns with efficient searching in them

Resources