Full text indexing on large files (more than 32k)

Full text indexing on large files (more than 32k) - azure-cognitive-search

Is it possible to use Azure Search on blobs over 32kB size? I have around 500GB of text files stored as blobs on Azure. Average blob size is around 1MB. I was so exited to try Azure Search to have full text search on files. However, it looks like index field Edm.String cannot be more than 32kB. I couldn't find this exact limit anywhere, I extracted this information from error message in the portal.
Is there any out of the box solution on Azure that I can use to add full text search functionality on Blobs? Does Azure team plan to remove 32kB field size?

Two different limits are potentially relevant here:
Azure Search has a limit on how many characters it will extract from a blob, depending on the pricing tier. For free tier, that limit is 32*1024 characters. For the Standard S1 and S2 pricing tiers, it's 4 million characters.
Separately, there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
We realize that the first limit especially isn't documented now; we'll reflect this in our Quotas and Limits page soon.

Related

Azure cognitive search index size estimation

I have a Azure search index with many fields (1000 fields). However for any given document only a few fields have values - maybe 50. Does that matter when determining how much storage will be consumed by the index? Do only populated values take up space in the index?
Also a similar question related to the suggester/autocomplete. If most of my fields are defined to use the suggester, but only a few have values per document, is performance of the index still negatively impacted?

I have several indexes similar to yours. The index schema has 1000-1200 properties and usually only 50 or so properties are populated. In my case 50.000 items takes about 1 GB of storage which should mean about 20kB per item.
My conclusion is that the storage taken by the extra properties is negligible.

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.

Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Is there a limit on the number of files in Google Cloud Storage ( GCS )?

I believe there should not be any limit but just wanted to confirm (as no mention in official docs):
Is there a limit on the number of files in Google Cloud Storage (GCS)?
Is there a performance impact (in access and write operation) if I have a very large number of files in GCS?
Is there a limit on file name length (since I could use the filename to create pseudo directory structure)?

Re (3): per https://cloud.google.com/storage/docs/bucket-naming, bucket names are limited to 222 characters (and with several other limitations); per https://cloud.google.com/storage/docs/naming-objects, object names are limited to 1024 characters (when utf8-encoded), with one mandatory limitation ("must not contain Carriage Return or Line Feed characters") and several "strongly recommended" conventions (no control characters, avoid certain punctuation characters).
Re (1) and (2), to the best of my knowledge there are no limitations on numbers of objects you can store in GCS, nor performance implications depending on such numbers. Google's online docs do specifically say "any amount of data".
However, if you need a firm commitment for a Project of Unusual Size (many petabytes, not the mere terabytes mentioned at https://cloud.google.com/storage/docs/overview) you may be best advised to get such a commitment "officially", by contacting Sales at https://cloud.google.com/contact/ .
http://googlecloudplatform.blogspot.com/2013/11/justdevelopit-migrates-petabytes-of-data-to-google-cloud-storage.html specifically interviews a customer using Cloud Storage for "over 10 petabytes [[growing]] at a rate of 800 terabytes a month", so, at least up to such orders of magnitude, you should definitely be fine.

There might be a limit
I am doing backups of a large number of files using HyperBackup from a Synology diskstation to Google S3. While backup jobs with less numbers of files work well, it always fails with the error "Authorization failed" for bigger task.
Synology support told me, that is because of too much files on the side of Google S3.
I am using legacy S3 compatible access - not the nativ google access, maybe it is due to this.

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the result.
Note: You can assume that hardware resourses are not a problem.

if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.
This sounds a lot like Logly and they use SolrCloud to handle such amount of data.
ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

sql server text fields seem to be taking up a lot of space

I don't have full-text indexing set up, but my text fields seem to be taking up a lot of space. It just doesn't 'feel' right.
Is it better to move text fields offline as files in the file system?

Unless your text fields are indexed, they've no reason to take up much more space than what they contain (i.e. byte length + some overhead, minus applicable compression applied by your database engine).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Full text indexing on large files (more than 32k) - azure-cognitive-search

Related

Azure cognitive search index size estimation

Best Practice to Combine both DB and Lucene Search

Is there a limit on the number of files in Google Cloud Storage ( GCS )?

Can Apache Solr Handle TeraByte Large Data

sql server text fields seem to be taking up a lot of space

Categories

Resources