serve huge static files with horizontal scale - database

I hope I can found a distributed filesystem which is easy to configure, easy to use, easy to learn.
Any one can help on this?

As the details relating the usage is not mentioned and as much i can infer from the question, you must try MogileFS (Easy in setting it up and maintaining). Its is from the maker's of memcached and is used to server images etc.
Please refer to the below mentioned link for better explanation.
http://code.google.com/p/mogilefs/
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming

I suggest you consider of using Apache Hadoop. It has a lot of services and technologies to work with (Cassandra, HBase, etc). Quote from official site:
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Basically, Hadoop is a large framework. You can use Karmasphere studio with Hadoop. I suppose, with its help you can learn Hadoop much quicker get deeper into distibuted systems.
About HDFS: read the article "GridGain and Hadoop". Short quote from there:
today HDFS is probably the most economical way to keep very large static data set of TB and PB scale in distributed file system for a long term storage

Check out Amazon Simple Storage Service (Amazon S3).
It has (practically) unlimited storage, 100% uptime and tick most of the boxes needed for most situations. It isn't free, but is very cheap considering what you get.

Related

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

hadoop vs teradata what is the difference

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support

Riak vs GlusterFS

I need to setup a data storage which can store PB level of files (files are mostly small json, images and csv files, but some of them can be ~100MB binary files).
I am looking into distributed data storage which is master-less and no-single-point-of-failure.
And I found Riak and GlusterFS.
I want to ask anyone of you have used both of them before?
I know that there interface (DB/Map) is very different.
But seems to me that they are both use hashing and similar distributed tech.
Will they have similar performance, consistency and availability?
We are running a 17 node (24GB RAM, 2T disk) Riak cluster with a Bitcask backend, storing around 1 billion 3k objects. This setup is performant but very resource intensive. We are considering moving away from Riak to GlusterFS as performance is not that important for us. Perhaps using LevelDB as a backend would also mitigate our worries.
ATM the self healing properties of Riak seem stronger and the configuration seem a tad easier. In your case I'd be more comfortable storing 100MB files on GlusterFS.
Storing larger files like the 100MB files you mention would not be the right choice for plain OSS Riak.
What you'd really should use in that case is the newly announced RiakCS http://basho.com/products/riakcs/ from Basho instead.
The choice depends mostly on requirements. Generally I'd recommend Riak if you do not actually need a real filesystem (with mounting points, ACLs management and so on) and just gonna use or serve files programatically, and GlusterFS otherwise.

Social Networking backend architecture

Ideally, where would an application like Facebook store its "Friends" data?
In a database table? in an xml file?
From Facebooks engineering page:
"Already, we are the second most-trafficked PHP site in the world (Yahoo is #1), and one of the largest MySQL installations anywhere, running thousands of databases."
and
"We've built a lightweight but powerful multi-language RPC framework that allows us to seamlessly and easily tie together subsystems written in any language, running on any platform. Facebook is built in PHP, C++, Perl, Python, Erlang, Java, and even a little bit of ML—and it all works together.
* We are the largest user in the world of memcached, an open-source caching system. Originally developed by LiveJournal, we've since made so many scalability improvements and performance upgrades that we will be the primary contributor of features in the next major release.
* We've created a custom-built search engine serving millions of queries a day, completely distributed and entirely in-memory, with real-time updates."
Relational databases?
check out this blog: http://highscalability.com/ many real-world examples of systems architecures to learn from
"Friends" data is well-described in a graph database. Neo4j is an example, though I know it's not the way Facebook stores this information.
Facebook uses a number of database technologies that may be involved:
a patched version of MySQL
Cassandra
Hadoop
... others
Most probably it should contain some other mechanism. As an example a search engine does not keep its index as a database or XML file. To obtain a maximum performance generally they keep some tree (Binary search tree or more complicated one) and store them on disk in performance effective manner. So I guess such mechanism.
Certainly not in a XML file.
Yes, in a database, in one or several tables. And for the precise exemple of facebook, on several server.

Which embedded database capable of 100 million records has an efficient C or C++ API

I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.

Resources