Azure Search - size maxed - any options? - azure-cognitive-search

Azure Search service maxes out at 300GB of data. As of today, we've exceeded that. Our database table consists mainly of unstructured text from website news articles around the world.
Do we have any options at all here? We like Azure Search and have built our entire back-end infrastructure around it, but now we're dead in the water with being able to add any more documents to it. Does Azure Search allow compression on the documents?

Azure Search offers a variety of SKUs. The biggest one allows you to index up to 2.4 TB per service. You can find more details here.
Note, changing SKUs requires re-indexing the data.
We don't provide data compression. If you'd like to talk to Azure Search program managers about your capacity requirements, feel free to reach out to #liamca.

Related

azure cognitive search migration to other product?

We have some products that heavily rely on azure cognitive search, it is a good product, but gradually we got quite a lot of problems with it, including:
You can't scale up or scale down without deleting the whole instance.
You can't use pipeline to create/update indexes unless you call it using web api. Modifying/delete field in an index is also not straightforward.
No data replication between search instances.
No cross index query even in the same instance.
No case insensitive search
Suggestions for above have been sitting in Microsoft's suggestion site for years and nothing was every done to address it. I really have no idea when or ever Microsoft will bother to provide better service.
As a result, I am starting to look around for alternative products (looking at elastic search at the moment). Is there any product that supports search syntax translation that make the migration easier (so we don't have to break so many things)?

Elasticsearch as primary data source for chat application

As per many questions from stackoverflow, to choose elasticsearch as a primary datasource is solely depends on the usecase. So if I want to implement chat application similar to Microsoft teams or any, Will it be a good idea to use it as a primary source of truth?
I want to implement app-wide search in the application.
reads are more than writes. around 1000 reads per sec while writes could just be just 20 per sec plus files upto size 10MB
another point worth mentioning would be that the data type is json data and the search is done on all textual data from a given set of fields
I recommend to use apache cassandra. it is developed by facebook engineer as chat server and it was donated as open source project to apache at 2008.
elasticsearch is a search engine and it is suitable if you want all kind of search on user's chat content.
cassandra is very faster in read/write and you can support more user than elasticsearch.

Is it possible to get data from other companies' databases?

I was wondering how so many job sites have so many job offers/information regarding other companies' offers. For instance, if I were to start my own job searching engine, how would I be able to get the information that sites like indeed.com have in my own databases? One site (jobmaps.us) says that it's "powered by indeed" and seems to be following the same format as indeed.com (as do all other job searching websites). Is there some universal job searching template that I can use?
Thanks in advance.
Some services offer an API which allows you to "federate" searches (relay them to multiple data sources, then gather all the results together for display in one place). Alternatively, some offer a mechanism that would allow you to download/retrieve data, so you can load it into your own search index.
The latter approach is usually faster and gives you total control but requires you to maintain a search index and track when data items are updated/added/deleted on remote systems. That's not always trivial.
In either case, some APIs will be open/free and some will require registration and/or a license. Most will have rate limits. It's all down to whoever owns the data.
It's possible to emulate a user browsing a website, sending HTTP requests and analysing the response from a web server. By knowing the structure of the HTML, it's possible to extract ("scrape") the information you need.
This approach is often against site policies and is likely to get you blocked. If you do go for this approach, ensure that you respect any robots.txt policies to avoid being blacklisted.

Storing 100k map markers in App Engine

I'm designing yet another "Find Objects near my location" web site and mobile app.
My requirements are:
Store up to 100k objects;
Query for objects that are close to the point (my location, city, etc). And other search criteria (like object type);
Display results on the Google Maps with smooth performance.
Let user filter objects by object time.
I'm thinking about using Google App Engine for this project.
Could You recommend what would be the best data storage option for this?
And couple of words about dynamic data loading strategy.
I kinda feel overwhelmed with options at the moment and looking for hints where should I continue my research.
Thanks a lot!
I'm going to to assume that you are using the datastore. I'm not familiar with Google Cloud SQL (which I believe aims to offer MySQL-like features in the cloud), so I can't speak if it can do geospatial queries.
I've been looking into the whole "get locations in proximity of a location" problem for a while now. I have some good and bad news for you, unfortunately.
The best way to do the proximity search in the Google Environment is via the Search Service (https://developers.google.com/appengine/docs/python/search/ or find the JAVA link ). Reason being is that it supports a "Geopoint Field", and allows you to query in such a way.
Ok, cool, so there is support, right? However, "A query is complex if its query string includes the name of a geopoint field or at least one OR or NOT boolean operator". The free quota for Complex Search Queries are 100/day. Per 10,000 queries, it costs 60 cents. Depending on your application, this may be an issue.
I'm not too familar with the Google Maps API you might be able to pull off something like this :(https://developers.google.com/maps/articles/phpsqlsearch_v3)
My current project/problem involves moving locations, and not "static" ones (stores, landmarks,etc). I've decided to go with Amazon's Dynamodb and they have a library which supports geospatial indexing : http://aws.amazon.com/about-aws/whats-new/2013/09/05/announcing-amazon-dynamodb-geospatial-indexing/

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Resources