What programming language is used to IMPLEMENT google algorithm? - database

It is known that google has best searching & indexing algorithm.
The also have good relevancy.
They are also quicker in getting down the latest results.
All that's fine.
What programming language (c, c++, java, etc...) & database (oracle, MySQL, etc...) have they used in achieving this (since they have to manipulate with volume of data quickly and effectively)?.
Though I'm not looking for their in-depth architecture (if in case violates their company policies) an overview of all such things could be useful.
Anybody please add you valuable suggestions and insight on this?

Google internally use C++, Java and Python. See Rhino on Rails:
One of the (hundreds of) cool things
about working for Google is that they
let teams experiment, as long as it's
done within certain broad and
well-defined boundaries. One of the
fences in this big playground is your
choice of programming language. You
have to play inside the fence defined
by C++, Java, Python, and JavaScript.
Google's search algorithm is essentially MapReduce, which stems from functional programming techniques, implemented in C++.
Google has its own storage mechanism for this called the Google File System.

Mainly pigeons:
PigeonRank's success relies primarily on the superior trainability of the domestic pigeon (Columba livia) and its unique capacity to recognize objects regardless of spatial orientation. The common gray pigeon can easily distinguish among items displaying only the minutest differences, an ability that enables it to select relevant web sites from among thousands of similar pages.

Relevance of search results is governed by quality of information retrieval algorithms they use, not the programming language.
But C++ is what most of their backend code is written in (for most services).
They don't use any off-the-shelf RDBMS products for data storage. All of that is written in-house.

Check it out, the Bigtable.

Related

Find topics in Machine learning in gcloud

I'm new in machine learning I see some services on Google Cloud platform related to A.I I think these are easy to use.
Here is what I need I have around 20K paragraphs (3 or 4 line) I need to find the most matching paragraph according to user question. User ask any question or type any sentence I need to find the most similar paragraph related to this user sentence how can I do that. What Services I need to achieve this I want to use Google Cloud platform. is it possible in gcloud if yes then how.
I think you can start your journey looking through the GCP AI & ML product list. Narrowing down the initial request and seeking for a best match affording your custom use scenario, I would advice to get more details about GCP AutoML products, offering a variety of complete solution for a generic machine learning models such as AutoML Natural Language model specifically designed for a document and text analysis tasks.
I would encourage you to start with AutoML Natural Language beginner's guide to get more context, having look at features and capabilities like of Classification, Entity extraction, Sentiment analysis training approaches.
As from the developers perspective, Cloud AutoML Natural Language supports a client libraries for most known programming languages and offers good REST API documentation though.

comments in multilingual support

I am making a website application to support factory automation, which will have users from various countries knowing different languages. I have internationalized all the string in the website so it is understandable by all users. However users have to write comments on the website related to factory operations, which they will write in their own language and it may not be understandable by users in other countries.
I wanted to know what are the best practices to help with this scenario.
One way I was thinking to not let users write comments- rather I provide possibilities of comments in a drop down which they can select. And I can internationalize those possibilities. But this is not an elegant solution, since the 'possible comments' may not be comprehensive.
There isn't really a solid no-fail solution available for this kind of problem, but here are some possibilities:
Leverage a translation engine and computer-translate the comments. How well this works depends on the engine used and the language, but it gives the reader a gist of the meaning. This solution loses a lot of use when there are a lot of technical or proprietary terms used. A lot of international webshops actually use this technique.
Encourage your users to post comments in a common language, or a language that most of your users will know, like English, Chinese of Spanish, depending on your markets.
Employ translators to regularly translate essential comments
The solution you mentioned is also pretty decent when the possible text is limited, otherwise it will spin out of control very fast.

What factors to consider when choosing a Multi-model DBMS? (OrientDB vs ArangoDB)

I am looking to dip my hands into the world of Multi-Model DBMS, I have no particular use cases, just want to start learning.
I find that there are two prominent ones - OrientDB vs ArangoDB, but was unable to find any meaningful comparison, unopinionated between them. Can someone shed some light on the difference in features between the two, and any caveats in using one over the other? If I learn one would I be able to easily transition to the other?
(I tagged FoundationDB as well, but it is proprietary and I probably won't consider it)
This question asks for a general comparison between OrientDB vs ArangoDB for someone looking to learn about Multi-model DBMS, and not an opinionated answer about which is better.
Disclaimer: I would no longer recommend OrientDB, see my comments below.
I can provide a slightly less biased opinion, having used both ArangoDB and OrientDB. It's still biased as I'm the author of OrientDB's node.js driver - oriento but I don't have a vested interest in either company or product, I've just necessarily used OrientDB more.
ArangoDB and OrientDB are both targeting a similar market and have a lot of similarities:
Both are multi-model, you can use them to store documents, graphs and simple key / values.
Both have support for Gremlin, but it's firmly a second class citizen compared to their own preferred query languages.
Both support server-side "stored procedures" in JavaScript. In both systems this comes via a slightly less than idiomatic JavaScript API, although ArangoDB's is a lot better. This is getting fixed in a forthcoming version of OrientDB.
Both offer REST APIs, both aim to be usable as an "API Server" via JavaScript request handlers. This is a lot more practical in ArangoDB than OrientDB.
Both are distributed under a permissive license.
Both are ACID and have transaction support, but in both the transactions are server-side operations - they're more like atomic batches of commands rather than the kinds of transactions you might be used to in a traditional RDBMS.
However, there are a lot of differences:
ArangoDB has no concept of "links", which are a very useful feature in OrientDB. They allow unidirectional relationships (just like a hyperlink on the web), without the overhead of edges.
ArangoDB is written in C++ (and JavaScript), whereas OrientDB is written in Java. Both have their advantages:
Being written in C++ means ArangoDB uses V8, the same high performance JavaScript engine that powers node.js and Google Chrome. Whereas being written in Java means OrientDB uses Nashorn, which is still fast but not the fastest. This means that ArangoDB can offer a greater level of compatibility with the node.js ecosystem compared to OrientDB.
Being written in Java means that OrientDB runs on more platforms, including e.g. Raspberry PI. It also means that OrientDB can leverage a lot of other technologies written in Java, e.g. OrientDB has superb full text / geospatial search support via Lucene, which is not available to ArangoDB.
OrientDB uses a dialect of SQL as its query language, whereas ArangoDB uses its own custom language called AQL. In theory, AQL is better because it's designed explicitly for the problem, in practise though it feels quite similar to SQL but with different keywords, and is yet another language to learn while OrientDB's implementation feels a lot more comfortable if you're used to SQL. SQL is declarative whereas AQL is imperative - YMMV here.
ArangoDB is a "mostly-memory" database, it works best when most of your data fits in RAM. This may or may not be suitable for your needs. OrientDB doesn't have this restriction (but also loves RAM).
OrientDB is fully object oriented - it supports classes with properties and inheritance. This is exceptionally useful because it means that your database structure can map 1-1 to your application structure, with no need for ugly hacks like ActiveRecord. ArangoDB supports something fairly similar via models in Foxx, but it's more like an optional addon rather than a core part of how the database works.
ArangoDB offers a lot of flexibility via Foxx, but it has not been designed by people with strong server-side JS backgrounds and reinvents the wheel a lot of the time. Rather than leveraging frameworks like express for their request handling, they created their own clone of Sinatra, which of course makes it almost the same as express (express is also a Sinatra clone), but subtly different, and means that none of express's middleware or plugins can be reused. Similarly, they embed V8, but not libuv, which means they do not offer the same non blocking APIs as node.js and therefore users cannot be sure about whether a given npm module will work there. This means that non trivial applications cannot use ArangoDB as a replacement for the backend, which negates a lot of the potential usefulness of Foxx.
OrientDB supports first class property level and database level indices. You can query and insert into specific indexes directly for maximum efficiency. I've not seen support for this in ArangoDB.
OrientDB is the more established option, with many high profile users. ArangoDB is newer, less well known, but growing fast.
ArangoDB's documentation is excellent, and they offer official drivers for many different programming languages. OrientDB's documentation is not quite as good, and while there are drivers for most platforms, they're community powered and therefore not always kept up to date with bleeding edge OrientDB features.
If you're using Java (or a Java bridge), you can embed OrientDB directly within your application, as a library. This use case is not possible in ArangoDB.
OrientDB has the concept of users and roles, as well as Record Level Security. This may be a killer feature for you, it is for me. It also supports token based authentication, so it's possible to use OrientDB as your primary means of authorizing/authenticating users. OrientDB also has LDAP integration. In contrast, ArangoDB support only a very simple auth option.
Both systems have their own advantages, so choosing between them comes down to your own situation:
If you're building a small application, and you're a web developer optimizing for developer productivity, it will probably be easier to get up and running quickly with ArangoDB.
If you're building a larger application, which could potentially store many gigabytes or terabytes of data, or have many thousands of concurrent users, or have "enterprise" use cases, or need fine grained security controls, OrientDB is the one for you.
If you're storing RDF or similarly structured linked data, choose OrientDB.
If you're using Java, just choose OrientDB.
Note: This is (my opinion of) the state of play today, things change quickly and I would not underestimate the ruthless efficiency of the awesome team behind ArangoDB, I just think that it's not quite there yet :)
Charles Pick (codemix.com)

Semantic Search Engine

I want to design a Semantic Search engine for my final year Master's degree. I have been doing a fair amount of reading both casually on the web and academic papers so I am not a total noob in this field.
My aim is to build a semantic search engine, which parses out the HTML content into its equivatlent RDF triples,stores the triples in a triplestore, through which the engine will try to respond to the query fired using SPARQL. I want to do something out of the box unlike the other students . So, I decided to build a semantic search engine.
Right now, I had a running search engine using Solr which performs keyword search, what I want to do is the semantic search. I know some open source tools regarding Web 3.0 but not sure whether they will be compatible with Solr or not.
So, can you please provide me some help for building the same.
Thanks.
Regards
Although it sounds hard, but you will not be able to capture everything.
You need a lot of data. Of course, there already is a lot of data arranged in formats like owl and rdf which you may use (e.g. WordNet, Yago, GeoNames etc), but although they are of huge size, they only focus on very small portions of a possible discourse universe.
Developing a good semantic search takes a lot of resources and brain power. Projects, like for example KompParse at the German Research Center for Artificial Intelligence, which only focus on a small part of human conversation (gossip or buying furniture) have been running for several years with several employees by now and are still just "ok".
Understanding semantics has already been implemented in different search engines, take google for example, or wolfram alpha. So this topic might not even be as much "out of the box" as you think.
So I will go with user723630 and strongly advise you, to focus on a smaller topic. You will still achieve a lot, but you will not get frustrated.

Which programming language Google app engine is most likely to work with next and why?

Their roadmap says their next release will be in March 2009, and that they'll be adding a new 'runtime language'. I'm hoping its either Java or PHP but realy not sure, and would like to know which language is the most probable so i can plan accordingly for a project I plan on hosting with google app engine.
Any ideas?
I'd say Java, if only for the reason Android (or, at least, the SDK) is written in Java and they went to the trouble of writing their own interpreter/VM.
If not Java, then Ruby would be my guess. Not sure why, but it feels like a good fit.
I would say that you have to look at a few factors:
The language needs to:
be sandboxable
be controllable
be expandable
be different from python
appeal to people who want to write massively scalable applications
can be run on developer computers easily
run on Linux
Sandboxable
The language must be safe to run on Google servers. Portions of the language/VM/modules|libraries must be able to be disabled and/or replaced.
Controllable
Notice how Google uses languages that are not controlled by companies?
Python's BDFL GvR works for Google.
Dunno about Javascript.
Java is open-sourced enough for their taste I suppose.
So the language evolution must allow Google's input at the very least.
Expandable
Google needs to be able to add stuff to the language, and that nearly implies an open-source language. I don't think they are interested in doing an internal fork of an existing language.
Different from Python
Python is mature, easy to learn, and powerful. The new language would have to have significant differences with python, otherwise, why not just use Python. Maybe a very functional language?
Appeal to massive scalability
Execution time would not be necessarily critical, but the language must be able to support easy start and stop, easy provisioning to other servers, and appeal to the sort of people who are into writing massively scalable applications.
Developer computers
The language needs to be able to be easy to install, maintain, and develop for on Windows, Mac, and Linux. It has to be either fully manageable with text editors or already have rock solid tools for editing and managing on these platforms.
Linux
Google servers would run the programs, so these must be able to be safely transferred on google servers and run there, and must be able to be controllable by the Google App Engine load-balancer, so they need to be unixy.
Brainstorming
I don't think it will be Java (too heavy, hard to modify VM), php (too leaky), ruby (hard to modify VM), C++ (can't be sandboxed(that I know of)). I don't think it would be JavaScript either, because it's hard to modularize, and it's not an easy language to learn. That rules out Lisp as well--the hard-to-learn part.
So something else.
Remember though that they want adoption of the tool, and they need a language that would be adoptable by a lot of people and a lot of businesses.
So I lean to C# with mono. I think that makes the most sense. I know it sounds scary but lately the developers of the language are looking at changing C# quite a bit, to incorporate python-like dynamic typing, that sort of thing.
Conclusion
So that's what I think. And if they can pull that off, they will be able to leapfrog the competition. Mono is under MIT X11 license (as of April 2008), and I guess Miguel de Icaza can be hired by Google in the future, along with key team members.
So my prediction is C#.
Languages used for production code inside Google are limited to C++, Java, Python, and JavaScript.
Apps Engine already runs Python, so what's next?
It's most likely JavaScript. I recall Steve Yegge working on a Rails equivalent for JavaScript. See Stevey's Blog Rants: Rhino on Rails.
Java is less likely, but possible. Java servlet containers tend to be heavy-weight.
C++ is possible (Native Client and Chrome are two examples of sandboxed C++ code), but unlikely at this point.
I would say Java too, so they can support Ruby with JRuby, compatible with Python with Jython, Groovy and so on.
My guess is C# just to stick it to Microsoft.
Yup, JavaScript.
Why?
First, it fits. While there are obvious architectural differences (notably the OOP system) between Python and JavaScript, they are closer than they are farther apart, so converting the GAE Python API to A JS API should not be a dramatic leap in design or implementation. In the end, the JS API will likely have much the same flavor of the Python API.
Second, safety. The JS runtime idiom is identical to the Python idiom in that effectively you're going to have JS processes running independently from each other for each request. That is, the classic Apache forking model.
As a hosting service, this model is extremely robust and much, much easier to control than something like Java. What you lose in efficiency via a threaded implementation, you gain by simply being Google with a gazillion machines. At Googles scale, administrative overhead trumps performance every day of the week. Simpler and more robust is better, and that's what the process model is.
Third, technology speed. JS is moving VERY quickly right now. Look at the larger number of commercial enterprises writing JS interpreter/compiler/runtimes, as well as the advancements of the language itself. JS script has rushed to the front with a vengeance.
Finally, popularity.
While not popular on the server side, JS is still likely the most deployed language in the world, and thereby the most accessible language in the world. Every hack web designer on the planet is becoming a JS programmer, whether they like it or not.
Now, I don't know how many web designers you've met, but most of the ones I have met are NOT programmers. So, adopting JS for them is going to be a cut and paste and painful experience for them, but it's pretty much a requirement for the modern web. Taking that skill to push back and do some lightweight processing on the back end, in the SAME LANGUAGE, will be a boon to these people. Do not discount the power of familiarity in a normally scary environment (and despite the advances, computers are still "scary" to the vast majority of the population).
JS, it's not a toy any more, it's a sleeping giant. Really.
JRuby on Rails.
Already works with Python. There have been rumors about PHP, which is logical choice considering it's popularity.
I'm going to throw in my 2 cents on Java as well. They have a heavy number of tools already written in Java (GWT anyone? etc. etc.)
Though, Javascript would be most intriguing.
I`ve heard once that Google likes Python the most!

Resources