High performance site - database

What technologies should I use when designing for a large social website (with a lot of transactions, like twitter)? using open source solutions
- database
- webserver
- os

Twitter uses Ruby-on-rails and Scala
Facebook uses PHP
StackOverflow uses asp.net mvc
As you can see, it doesn't really matter what you choose; all of these sites have lots of traffic, but are based on very different technologies.

What matters most in a social networking sites is the backend, since most of the bottleneck will be from there. You might want to consider No-SQL databases.
Facebook and Twitter use Cassandra
LinkedIn uses Voldemort
There are a few others like:
Hypertable
MongoDB, used by Sourceforge.
CouchDB
As for the programming language, as others have said, it does not matter that much. But if you really can not decide, you might want to consider a non-blocking webserver like Tornado.

Doesn't matter what kind of scripting language you'll choose, as long as you'll heavily utilize memcached. Having the right caching hierarchy is a must.

At the end of the day, this is a matter of personal preference. Twitter uses Ruby on Rails. Wikipedia runs on PHP. Reddit uses a Python library called web.py, but intitially, it was written in Lisp. I would say pick the technologies you are most familiar with.

A good book on optimizing for high performance websites from the Yahoo engineers is High Performance Web Sites: Essential Knowledge for Front-End Engineers. It is nice and short and basically a bulleted guide on the steps to take to make websites faster by optimizing the less well explored front-end.

As Joel says
People all over the world are constantly building web applications using .NET, using Java, and using PHP all the time. None of them are failing because of the choice of technology.
Choose whichever of the "big 3" (.Net, Java or PHP) that you know best - these technologies are known to be scalable, the real question of whether or not your site will scale is how the site is structured and the quality of the code - using whichever framework you are most familiar with gives you the best chance of achieving that.

Any technologies that suite your taste, In your situation I think algorithms is more important.

Technologies, techniques ,
research what other scaled sites have used and done and what the problems they had were less than he successes, there are podcasts on iTunes, talks and interviews on Youtube
look at industry best practices and follow them to a degree
don't take peoples word for it, make sure you see the problem or the success as opposed to the pr glitz about it
avoid obvious things that do not scale vertically or horizontally, database connectivity, sessions - cookies and the like
look at nosql storage as an sql alternative less overhead but less functionality
take care when looking at the language/framework. frameworks come with lots of baggage you do not need, they speed you up initially and slow you down eventually, i.e. you spend more time hacking the framework than building the site, same with languages does it do what you want rather than be trendy, cool to programme in etc.

If you are building something like Facebook, then your choices are a little limited, Facebook made their own PHP Runtime, check HipHop For PHP

Related

In retrospect, best open source stack/tools to build facebook

I have been doing high performance scientific computing in c++ most of my life. I am trying to learn to developing AJAXy web applications. As an exercise, I would like to build something that has a subset of functionality of facebook (profiles, posts with comment threads, friend lists) + the ability to search any post/comment.
I have no experience developing these kind of apps, except minor amount of toying with Google Appengine with GWT+Java and little bit of python. What tools/stack would you suggest using for it? I understand that this a very vague question, but I'd like to get a few opinions and your thought process about how would you go about using it.
How does the choice change, if you want a fast prototype as fast as possible, vs if you are trying to build something that can scale and last a few cycles of feature requests.
To be more specific, I'm lost in questions like, should I consider Drupal, should I consider Lucene for search, Would GWT get me what I want in the UI or would python+django be faster to develop. Probably I should not over think and pick something. But some perspective from others would be nice.
If you have started out with Python, that might be the easiest to get going with, especially since you have some experience with Google's App Engine already. However, if you have spent most of your time working with C++ ... did you know that C++ has at least two different full-stack web frameworks?
CppCMS
Wt (WebToolkit)
Remember, it's what you develop fastest in that makes the difference in the long run. What will slow you down most of all is dealing with what you dislike. So, if long compile times kill you, then try Python, Ruby, PHP, or some other dynamic language. If having code that is less than perfectly optimized (and slower that it could be) is what bothers you most, use C++, C#, or Java instead.
One disadvantage with google app engine is that there is no CMS like drupal or Joomla for google app engine so you're going to have to write your own if you want some of that functionality. The advantages of google app engine however outweigh the disadvantage since you have easier development, easier deployment, won't have to fiddle with phpmyadmin or other ugly sql interfaces, with app engine you also leverage google's huge infrastructure and since it's cloud computing you only pay for what you use. If you want something you as a developer will be most happy about - then I recommend you choose Google App Engine.

I know the big picture but can't put it in place

I'm interested in web development and by that I mean the bigger projects like facebook or twitter. I know the basics of java, css, php and mysql. I know there is a lot more out there. I read about it. But I don't know what the purpose is and how to put in place.
Things like: Scribe, thrift, casandra, Unix/Linux, shell/perl/python scripting, PostgreSQL, MongoDB, non-relational NoSQL datastores, JVM, nginx
I want to know why they need it, how they use it and what te purpose is.
What I need is a book like technical background of facebook for dummies or so.
Are there any books or websites that explain this from scratch?
Thank you!
EDIT:
Thank you for your answers! You have been very helpful. I was in the assumption, experienced programmers know almost anything about the technology there's used today. But as I read, you can only know so much and I need to figure out which technology to use. I take on the encouragement to start building small. And will take on php and improve my skills from there.
Thanks again!
http://highscalability.com/
This is one of the best sites out there. There are several case studies describing what and why many websites use, and pointers to further references. I would also look at the Google Scalability Conference 2007 talks
http://www.google.com/search?q=Google+Scalability&hl=en&client=firefox-a&hs=YUg&rls=org.mozilla:en-US:official&prmd=v&source=univ&tbs=vid:1&tbo=u&ei=fl4OTPUkorIwueCQxQw&sa=X&oi=video_result_group&ct=title&resnum=4&ved=0CDIQqwQwAw
It's all about choosing the right tool for the job in my eyes. There is so much technology out there it's impossible to learn it all. Just choose the subset that will work for you.
The best place to start is by building small simple websites, and as you come accross problems that you need solved you research the tools needed to solve those problems.
If you attack all of the areas at once, it's going to be overwhelming and you will not get anywhere.
For a general overview on what each of the technologies does, Wikipedia gives a good overview on most technologies.
If you are interested in database content which it seems like you are, a good place to start is reading up on normalisation.
Scribe, thrift, casandra, Unix/Linux, shell/perl/python scripting, PostgreSQL, MongoDB, non-relational NoSQL datastores, JVM, nginx
Those I would search on Wikipedia for to get a quick overview. Facebook is written in PHP/MySQL. There are some books on the subject of creating social networking sites, and some books have gotten decent reviews on Amazon.com, however, I have not read any of them myself.
If I were you, I'd start with PHP/MySQL and sit down and write a simple social network. Break the project down into components and tasks and Google for each challenge you encounter such as sessions, database structure, security, friend structure, and processing POST and GET requests.
You'll learn a lot and you get the big picture. Once you see the big picture, you can take another look at different technologies that are available and then decide which component you could have developed better with other tools. I personally don't think that looking too much into the technology available is good for someone who is still in the beginning stages. Start doing, learn from it, and then your questions become much more specific and a lot of things will make more sense.
The problem you're having is you're looking at smaller, specialty products, and not at larger, more mature technologies. Wikipedia will actually give you a decent overview of most of the medium-and-large projects out there.
Cassandra, Hadoop, Mongo, and NoSQL are all lovely... but they're specialty tools. SQL is a general purpose solution that works for 99% of the sites on the net.
Unix/Linux isn't a specialty tool; you might want to try going to Ubuntu's website and installing Linux, and just using it day-to-day, the way you'd use Windows. When you need to figure out something new, like setting up a webserver, do it on the Linux box and a Windows box, and you'll eventually learn linux pretty darn well.
As far as scripting, O'Reilly makes a great line of books on Bash, Perl, and Python.
JVM is a Java Virtual Machine, which is a core of getting Java code to go. Sun's website has a great set of tutorials on learning Java.
It might be much, much easier to pick a project (or three) that you'd like to learn, and learn some of these by doing. I'd probably suggest learning some SQL before learning the newly established alternatives; that lets you learn the rest of the system, as SQL is pretty easy. Once you've got the rest of the thing solid, try swapping in a NoSQL solution at that point.
There are a lot of frameworks that do a lot of different things. You've named a lot of different things from a lot of different areas. The best way to think of these things is to group them by category. Here's an example:
Suppose you have a laptop and you want to host a website. You'll need the following at a minimum:
1) Web Server software. Two popular options are Microsoft's IIS and Apache Web Server.
That's really all you need. You can set up your www_root folder and load files into it. Assuming everything is configured properly, you can now load HTML pages into that folder and access them through your IP address. Every page you view in your web browser is in HTML format. CSS is a stylesheet language that defines how your HTML will be formatted. You can also start writing Javascript, as most modern browsers support the client-side scripting language.
Chances are you'll want the following as well:
2) Database software. Two popular options are Microsoft's SQL Server and MySQL
3) Server-side scripting. PHP is very popular, as is ASP. You'll need the runtime deployed on your server. Python, Ruby, Perl, etc all fall under this category.
4) Web Application Framework(s). This will provide you with libraries for your language of choice to help develop web applications and websites. CakePHP, Ruby on Rails, and the Google Web Toolkit are examples of web application frameworks.
Additionally, you may want to utilize:
5) Additional libraries. JQuery, for example, is quickly becoming a popular library for Javascript that handles a lot of common tasks for you. Instead of writing complex effects code and what-not yourself, just use the pre-written code in the JQuery library.
6) Data interchange technology. If you are passing a lot of information back and forth, you will likely want to encapsulate this data in a logical format. Ideally, this format would describe the data and allow your applications to easily read/process it following a standard. This is where XML and JSON come into play.
I can't recommend a good book for you to learn this stuff, but I feel that the collective replies to your question here should be more than enough to get you started.
Ultimately, what you need to do is determine what technologies you need, and then choose the right one for the job. Don't go building an application using Ruby on Rails just because it's what Twitter used, but rather choose it because it provides some advantage to you over the other options.

Which web solution should I use for my project?

I'm going to create a fairly large (from my point of view anyway) web project with a friend. We will create a site with roads and other road related info.
Our calculations is that we will have around 100k items in our database. Each item will contain some information like location, name etc. (about 30 thing each). We are counting on having a few hundred thousand unique visitors per month.
The 100k items and their locations (that will be searchable) will be the main part of the page but we will also have some articles, comments, news and later on some more social functions (accounts, forums, picture uploads etc.).
We were going to use Google AppEngine to develop our project since it is really scalable and free (at least for a while). But I'm actually starting to doubt that AppEngine is right for us. It seems to be for webbapps and not sites like ours.
Which system (language/framework etc.) would you guys recommend us to use? It doesn't really mater if we know the language since before (we like learning new stuff) but it would be good if it's something that is future proof.
I think that GAE can do the job. Google claims that Google App Engine is able to handle 5 million visitors for free and you will have to start paying only if you exceed their free quota.
It's also pretty easy to get started. If you don't have experience on administrating websites and choose a regular hosting service, you will have to worry about several things that you don't even imagine now.
My only concern would be with respect of the kind of data and queries you will have to do, since it does not have a relational database. Anyway, there is an open source project for GAE, called GeoModel that gives GAE the ability to do complex geo spacial queries, like proximity fetch. Have a look at their tutorial and the demo app.
About your impression that GAE was intended only for small web apps, there are a couple of CMS that run on it.
Good luck!
If once of your concerns is scalability, and you don't want to depend on expensive or commercial tools, I would recommend that you take a look at this tech stack:
Erlang - A programming language designed for concurrency and distribution.
Nitrogen - An Erlang web framework with a lot of cool stuff, like transparent AJAX.
NoSQL scalable databases, such as CouchDB or Riak - Save the the hassle of SQL code and are more scalable than plain MySQL. Both has direct native Erlang API.
To be honest, I don't know if this tool set is your cup of tea; These are not mainstream solutions. I just suggest these to everyone who ask about size-sensitive web applications.
All serious web frameworks will provide you with what you need. The real issues (for example scalability) might be tackled in a different way depending on what you use, but you wont be limited if you choose a well-known one. The choice of database system might be more important for that (sql vs nosql), even if both of those will do fine too.
It's all about
knowing how to use
enjoying to use
the tool(s) you've chosen.
In either case, name-dropping some suggestions:
Rails (Ruby)
Django (Python)
Nitrogen (Erlang)
ASP.NET MVC (C#)
And please note, if you really want to learn everything from the bottom, you'd be fine with any of these (or one of the other gazillion out there). But if you want to perform your best, choose one that supports a language you know well or uses techniques/tools you have experience of etc. Think twice about how you value this is fun and we learn a lot against we want to be productive and do a really good job.

Pointers towards developing a quick and dirty business app

Some people have approached me lately about creating a business app for them (I'm a computer tech student specializing in programming, with a bit of experience in systems and driver programming) and it does sound simple, but I don't really have much of an idea how or where to start.
It should be a small-ish app with a database backend. Basically keeping track of invoices, clients, products and the attached data.
Are there any APIs that would make creating such an application much faster and easier? Platform isn't really an issue. I have a Mac, a Windows PC, and I am somewhat well-versed in linux in general, and the client will move to a platform of my choice.
I know very little MySQL, I know Objective C, C and a few others, but building a database product this way seems like a very complicated endeavour considering that a large amount of the code I'll be writing has probably been written before and by better programmers than I.
EDIT : If possible, I would definitely like not having to play around with web frameworks. This is not to say I don't want to see them, it's just that I'm not used at all to the web development model.
I would suggest that you look into Ruby on Rails for soemthing like this. It will take care of a lot of the low level details of database access for you and because it is built around the Model-View-Controller paradigm, it will take away some of the architectural decision from you and make you focus on getting the app done. Using Ruby on Rails, I've built a couple of sites of smallish scale that sound like what you have done in no time at all.
For quick and dirty, I suggest Ruby on Rails (if you fancy a bit of Ruby), or Grails (if you fancy a bit of Java/Groovy, and is essentially the Java platform equivalent of RoR).

Web Scraping with Google App Engine

I am trying to scrape some website and republish the data as a RSS feed. How hard is this to setup with Google App Engine? Disadvantages and Advantages using GAE. Any recommendations and guidelines greatly appreciated!
Google AppEngine offers much more functionality (and complexity) than you will need if truly all you will want to do is republish some structured data as RSS.
Personally, I would use something like Yahoo pipes for a task like this.
That being said... if you want/need to get your feet wet with GAE, go for it!
Working with Google App Engine is pretty straight forward. I would recommend going through the Getting Started guide. It's short and simple and touches on essential GAE topics. There are more pros and cons than I will list here.
Pros:
In general, App Engine is designed for high traffic web applications that need to scale. Furthermore, it is designed from a programmer's perspective. Much of the scalability issues (database optimization, server administration, etc) are dealt with by Google. Having said that, I find it to be a nice platform. It is still being actively developed by Google engineers, and scheduling of tasks (a feature that has been long requested) is in the current road map.
Cons:
Perhaps the biggest downside right now is again the lack of official scheduling support and the quota limits currently set for free accounts. However you can't complain much if its free. Currently it only supports Python as a programming interface (although a new language [Java I predict] is coming soon). Furthermore, Python 2.6 (and 3.0 for that matter) are not yet supported. In addition, Django 1.0 is not officially supported in App Engine (although you can package Django 1.0 with your application).
Harder than it would be in most other technologies.
GAE can sort of do scheduled batch stuff like this now, but it's really not intended for that type of thing. Pick pretty much any other language and platform for this particular task, and you'll make your life a lot easier.
I think BeautifulSoup could run on GAE, so all your scraping needs are handled :D
Also, GAE has a geturl thingy. The only problem I think you might have is not having enough time to get the data (30 secs limitation).
I am working on a same project and I've decided that it's easier to prepare the data on another server and push them to GAE.
You might also want to look into Yahoo! Query Language (YQL)

Resources