I just wanted to know know what all databases do Google, Yahoo or Bing use for natural language query processing. I mean they can take in so complex queries in natural language. Do they programmatically do the processing of our query and break down in some kind of hash and then map to result.
Please don't mind if the question is silly. I am just a newbie. I just wanted to know what all kind of databases are used for such purposes.
Type Lucene OR
Type Solr OR
Type Sphinx
on Google or Bing.
You would then also come to know apart from indexing data structures something about stemming, thesaurus, synonyms, query expansion in search engines, metaphone etc. This all stuff would help you in getting answer to your question.
After you are through with above then later you can read more about establishing semantic relationship between keywords, collective intelligence, wisdom of crowd stuff that would help you in establishing similarity between say for example keywords like java and jee and jsp and servlets.
Related
What i am doing and what i did so far:
i'm developing an question and answering system using Solr,i took product reviews as my data-set(contains product id and its reviews from different users) which is in json format.i have performed indexing on my data-set and successfully got the response of indexed data.
Requirements:
In my Q/A system i will provide query in Natural language for example, "why should i buy X(product name)" and my Q/A should be capable of recognizing the words in reviews like "its ease to use, flexible product" and it should frame its answer depending on those words.
I would like to know the following
How can i process natural language query into solr executable query,
How can i prepare my answer to the query,
What kind of nlp models should i use,
How should i train my Q/A system
and any other information which can help me to achieve the requirements.
You are nowhere near Solr yet. You have to go back and look for the actual NLP (Natural Language Processing) system. If it uses Solr (or OpenNLP that integrates with Solr) - great. If not, you have to invent this bridge, it does not just come with Solr, as this is still a cutting edge of research.
I've seen threads discussing both of these topics:
Does Azure Search handle synonyms
Fuzzy Search in the Search API
I see that Liam Cavanagh from the Azure Search team seems to be the guy who's answered queries on these threads.
Liam, are you able to confirm the following yet please:
When full synonym support will be added to Azure Search
Do you definitely plan to support synonyms with Azure Search, or is it possible that you will recommend that customers use the Bing Synonyms product instead?
Do you have any plans to go beyond fuzzy logic and offer more advanced support for misspellings (i.e. multiple letters missing or in the wrong order, which stemming won't cover)?
Many thanks,
Ali
I don't know why you got a negative vote as I think these are really good questions. Let me answer your questions as best as I can:
You are correct that we do not have implemented "full synonym support", and this is one of the next most highly requested features, so it is definitely something that we have on our near term list although I am sorry that I can not provide a date yet. If you have time, please cast your vote for this here: http://feedback.azure.com/forums/263029-azure-search/suggestions/8410635-support-custom-dictionary In the meantime, there are "hacks" that can be done which are far from perfect, but can help get part of the way. One example is to add a Collection field and then populate it with the relevant synonyms for each document.
I can not say that this is a "definite" feature, but given how often we hear this request, hopefully I have given you an insight into the likelihood that it will be implemented.
I am curious if you have tried our brand new Lucene Query Expression support (https://msdn.microsoft.com/en-us/library/mt589323.aspx)? There is some really great capabilities for fuzzy search and also capabilities to do things like RegEx searches, etc. This is pretty awesome (IMO).
I hope this helps, and I am sorry that I am not yet able to give more definitive dates on some of these questions.
Liam
I am looking into Neo4j as a stripped-down document store. A key aspect of document storage is search, and I know Neo4j includes full text search via legacy indices provided by Lucene.
I would be very interested in hearing the limitations of Neo4j search capabilities in a distributed environment. Does it provide a distributed index? In what ways is it inferior to Solr or ElasticSearch? How far can I take it before I must install Solr?
-- EDIT --
We are trying to integrate two distinct search efforts. The first is standard text content search. For instance, using the Enron emails, we want to search for every email that matches "bananas" or "going to the store" and get those document bodies in response. This is where people often turn to Solr.
The second case is more complicated, we have attached a great deal of meta-data to each document. We may have decided that "these" emails were the result of late-night drunk-dialing. Now I want to search for all emails that may have been the result of late-night drunk-dialing. For this kind of meta-data, we believe a graph database is in order.
In a perfect world, I can use one platform to perform both queries. I appreciate that Neo4j (nor OrientDB, Arango, etc) are designed as full text search databases, but I'm trying to understand the limitations thereof.
In terms of volume, we are dealing at a very large scale with batch-style nightly updates. The data is content heavy, with some documents running into hundreds of pages of text, but mostly on the order of a page or two.
I once worked on a health social network where we needed some sort of search and connection search functionalities we first went on neo4j we were very impressed by the cypher query language we could get and express any request however when you throw there billion of nodes you start to pay the price and we started considering another graph db, this time we've made a lot of research, tests and OrientDB was clearly the winner, OrientDB is highly scalable but the thing is that you have to code by yourself, your "search algorithm" if you want to do some advanced things (what is the common point between this two nodes) otherwise you have the SQL like query language (i don't know/remember if he has a name) but you can do some interesting stuff with it
So in conclusion i would definitely go on OrientDB
Neo4j can provide a "distributed index" in the sense that the high availability cluster can make your index available on more than one machine, but I'm pretty sure that's not what you're after. Related to this issue is a different answer I wrote about graph partitioning, and what it takes to distribute a really large number of nodes/relationships across multiple machines. (It's not terribly simple)
Solr and Lucene do two different things (although Solr is built on top of Lucene). I think solr and neo4j are not comparable because they're trying to do completely different things. This site isn't about software recommendations so I can't tell you what you should use other than to say you should read up on solr and neo4j, and figure out which set of functionality you want. As far as I know, this is an exclusive decision as I'm not aware of people integrating solr with neo4j.
Your question is very difficult to answer, I'd recommend expanding on what you are trying to do and what you have tried, you'll probably get better responses.
Are either HBase/Hive suitable replacements as your traditional (non)relational database? Will they be able to serve up web-requests from web clients and respond in a timely manner? Are HBase/Hive only suitable for large dataset analysis? Sorry I'm a noob at this subject. Thanks in advance!
Hive is not at all suitable for any real time need such as timely web responses. You can use HBase though. But don't think about either HBase or Hive as a replacement of traditional RDBMSs. Both were meant to serve different needs. If your data is not huge enough better go with a RDBMS. RDBMSs are still the best choice(if they fit into your requirements). Technically speaking, HBase is really more a DataStore than DataBase because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
And the most important thing which could struck a newbie is the lack of SQL support by HBase, since it belongs to NoSQL family of stores.
And HBase/Hive are not the only options to handle large datasets. You have several options like Cassandra, Hypertable, MongoDB, Accumulo etc etc. But each one is meant for solving some specific problem. For example, MongoDB is used handling document data. So, you need to analyze your use case first and based on that you have to choose the datastore which suits your requirements.
You might find this list useful which compares different NoSQL datastores.
HTH
Hive is data warehouse tool, and it is mainly used for batch processing.
HBase is NoSQL database which allows random access based on rowkey (primary key). It is used for transactional access. It doesn't have indexing support which could be limitation for your needs.
Thanks,
Dino
I'm familiar with developing desktop apps in Clojure (written a multithreaded interactive visualization system). However, I'm fairly new to Web development using Clojure.
I plan to use Clojure on the server for handling logic; and ClojureScript for handing client side work. However, I don't know what to use for my database server. Should I use something like Monogodb? or Hadoop? Or .... ?
The app is something very simple; a basic forum. Total number of concurrent users will be < 100 at a given time. One thing that is important to me is the ability to easily backup / data consistency -- it's very very important to me that I can easily make daily backups (and not lose all the data.)
Thanks!
You can use many databases; if the database has an API for Java, you should be good to go. MySQL, MongoDB, Postgres, Hadoop… and more.
For a nice overview of the webstack in Clojure, check out brehaut's article on the matter.
For getting up and running quickly with Clojure and ClojureScript, try ClojureScriptOne.
There are many ways to write what you want to write; if you're already familiar with Clojure, it shouldn't be too hard to get going.
Haven't used it myself, but Datomic ( http://datomic.com/ ) looks great for anyone coming from Clojure.
Datomic is an amazing database, and I'd highly recommend it. It has many features which set it apart from other database systems:
Like Clojure's data structures, it's persistent, meaning that by default, adding new facts to the database doesn't delete old facts, allowing you to query the state of the database at a previous point in time, enhancing audit-ability and assistance in debugging.
The underlying Entity Attribute Value (EAV/triple) data model (at least partly inspired by RDF & the Semantic Web), is extremely flexible, allowing you to express arbitrary graph structures and effortlessly deal with polymorphism.
The query language is flavor of Datalog, a sort of pattern matching based query language strictly more expressive than SQL and the like in that it can do recursive queries, making it particularly well suited for dealing with graph data/queries.
In addition to Datalog queries, there's a pull api, which let's you pull data out of the database more simply using a GraphQL like expression which specifies the shape of a document-like structure you'd like to pull out of the database. These queries can even be used from within the :find clause of a Datalog query.
You can use Clojure functions from within your queries.
The indexing system is very smart and more or less automatic, in stark contrast with the work that typically goes into tuning SQL databases for performance.
Transactions go through a different API/function call than queries, meaning that the number one security risk identified by OWASP (SQL injection) is literally impossible in Datomic.
The transactor/read-replica design makes it super easy to scale reads/queries, while keeping pressure off the transactor.
It's fun as hell.
One of the things worth pointing out here is that by embracing the EAV data model and datalog/pull queries, Datomic ends up having structural flexibility closer to that of a NoSQL database, while still being fundamentally relational, and even more expressive in it's relational queries than SQL.
It's amazing and you should absolutely give it a shot. It will melt your brain a little. In the good way.
It's also worth noting that it's popularity has inspired a number of successful open source projects, so the underlying approach is not going anywhere any time soon:
DataScript: In memory clj/cljs partial implementation
Datahike: Fork of DataScript which queries over on disk indices, meaning you don't have to keep everything in memory to query
Mentat: Mozilla project trying to make a Datomic-alike for a Mozilla project