Big Query vs Text Search API - google-app-engine

I wonder if Big Query is going to replace/compete with Text Search API? It is kinda stupid question, but Text Search API is in beta for few months and has very strict API calls limit. Bug Big Query is already there and looks very promising. Any hints what to chose to search over constantly coming error logs?

Google BigQuery and the App Engine Search API fulfill the needs of different types of applications.
BigQuery is excellent for aggregate queries (think: full table scans) over fixed schema data structures in very very large tables. The aim is speed and flexibility. BigQuery lacks the concept of indexes (by design). While it can be used for "needle in a haystack" type searches, it really shines over large, structured datasets with a fixed schema. In terms of document type searches, BigQuery records have a fixed maximum size, and so are not ideal for document search engines. So, I would use BigQuery for queries such as: In my 200Gb log files, what are the 10 most common referral domains, and how often did I see them?
The Search API provides sorted search results over various types of document data (text, HTML, geopoint etc). Search API is really great for queries such as finding particular occurrences of documents that contain a particular string. In general, the Search API is great for document retrieval based on a query input.

Related

What's the difference between GAE Search API and Datastore queries?

I'm trying to understand which one of the search API or querying the datastore will be the most accurate for a search engine in my app.
I want something very scalable and fast. That's mean be able to find something among up to millions results and very quickly. I mean as soon as a data has been registered , this one must be immediately searchable.
I'm seeking to make an autocomplete search system inspired by the google search system (with suggested results in real time).
So, what is the more appropriate option to use for A google app engine user ?
Thanks
Both Search API and the Datastore are very fast, and their performance is not dependent on the number of documents that you index.
Search API was specifically designed to index text documents. This means you can search for text inside these text documents, and any other fields that you store in the index. Search API only supports full-word search.
The Datastore will not allow you to index text inside your documents. The Datastore, however, allows searching for entries that start with a specific character sequence.
Neither of these platforms will meet your requirements out of the box. Personally, I used the Datastore to build a search engine for music meta-data (artist names, album and recording titles) which supports both search inside a text string and suggestions (entries that contain words at any position which start with a specific character sequence). I had to implement my own inverted index to accomplish that. It works very well, especially in combination with Memcache.

Azure Search Keeping Record of Documents Satisfying Queries

Is there any way through Elastic Search or Lucene metadata, to store a count of how many times a particular document has satisfied queries even though one has not recalled the document for processing.
For example, say you issue a query and get 100 results. You process the first 10 and not go any further. We would like to flag ALL the documents (100) that satisfied the search criteria for later analysis.
Thanks
Currently, Azure Search does not expose this information (and neither does Elasticsearch or Lucene). However, we're working on building better ranking models, and we're thinking about capturing (and potentially exposing) this type of data.
We'd be very interested in learning more about your scenario. Could you email me at eugenesh at the usual Microsoft domain? Thanks!

Using Neo4j and Lucene in a distributed system

I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.

GAE and Prospective search: empty query

I want to create prospective search subscription with empty query, but GAE raises exception
QuerySyntaxError: query:'' detail:'Query is empty.'
which is not compatible with Search API, which allows empty queries. Any workarounds? Should I file an issue?
The Prospective Search Service is intended to support applications that filter a stream of documents; applications that want less than all documents matched. In such an application, an "empty query" would normally be considered evidence of a bug. Admittedly, empty queries might sometimes be useful for various debugging purposes, however, the decision was made to design the interface's contracts with production use in mind.
As suggested by Will Brown, if you want a subscription that will match all documents, then insert some dummy field with a constant value into your documents and then create a query that matches just that field and value. Given that there is such an easy work-around available for those rare cases when "all documents" are needed, I think it unlikely that we would provide support for empty queries. It might also be interesting to note that the prohibition against empty queries is not just in the AppEngine code but also in the backend servers that AppEngine accesses to provide the Prospective Search Service.
Although the "Search API" (which really should be called the "Retrospective Search API") may support empty queries, it is important to realize that resource utilization patterns for prospective search are very, very different from those of retrospective search. For instance, you might have an application that is streaming hundreds of documents per second into both a document index (using retrospective search) and through a query index (using prospective seach). In such a system, an empty retrospective query is only going to return just a few documents whenever that query is submitted. On the other hand, a prospective query would generate a real-time stream of all documents. The presence of just a few prospective queries could thus generate significant loads on your application. In general, if you want a firehose, real-time push feed of everything published, it is best to code that up explicitly.
You can file a feature request for this, but it is by design (I don't know why). If you know that incoming documents will have something in common, you can write a query for those; for example, if you add a field "alldocuments" with content "yes" to the document when you send the request, you could register a query like "alldocuments:yes" to match all documents.

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Resources