I am studying various distributed file systems.
Does IBM General Parallel File System(GPFS) support Map/Reduce jobs on its own? Without using 3rd party software(like Hadoop Map/reduce)?
Thanks!
In 2009, GPFS was extended to work seamlessly with Hadoop as GPFS-Shared Nothing Cluster architecture, which is now available under the name of GPFS File Placement Optimizer (FPO). FPO allows complete control over the data placements for all replicas, if applications so desires. Of course, you can easily configure to match HDFS allocation.
Check out details at http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r5.gpfs200.doc%2Fbl1adv_fposettings.htm
GPFS has been developed years nearly decades before Map/Reduce as invented as distributed computing paradigma. GPFS by itself has not Map/Reduce capability. As is mainly aimed at HPC and the storage nodes are distinct from the compute nodes.
Therefore Map/Reduce can be done with 3rd party software (mounting GPFS on all Hadoop nodes), but it would not be very effective as all data is far away. No data locality can be used. Caches are more or less useless, etc.
Related
More specifically, what usecases does Hazelcast Jet solve that Flink does not solve (equally well) and vice versa?
NOTE: I belong to Hazelcast Jet's core engineering team.
I'd say the main advantage of Hazelcast Jet isn't in offering a brand-new computing model, but in bringing the same level of convenience that Hazelcast is known for to the realm of DAG-based distributed computing.
If you currently have a Java application running in a cluster, adding Jet will be a snap: add the Maven dependency and write one line of code to start a Jet instance on the local member. The instances will self-discover to form their own cluster, and you can now submit your job to it.
If you want a dedicated distributed computing cluster, you can download the distribution ZIP to the cluster machines. Jet has native support for the most popular cloud environments, allowing the nodes you start to self-discover. You can then connect to the cluster using a Jet client.
Needless to say, Jet makes it very convenient to use a Hazelcast IMap or IList as a data source. Jet cluster can host Hazelcast structures directly; then you benefit from data locality and get the data with no network traffic. On the other hand, the choice of data source is completely unconstrained and there is public API dedicated to implementing fast, arbitrarily partitioned, custom data sources.
Jet solves the concerns of infinite streams processing like aggregating over time-based windows, dealing with reordered events and resilience to changes in the cluster topology (e.g., failure of individual Jet nodes) while maintaining the Exactly-Once processing guarantee.
Jet's main programming paradigm is the Pipeline API which is quite similar to java.util.stream API but adapted to the specifics of distributed computing (lambda serialization and other concerns).
Pipeline API builds upon a lower-level DAG-based model that is also exposed as public API.
in my opinion, flink seems to offer some very useful streaming features, wich are not yet offered by hatecast jet.
Different flexible window operators, wich also can handle out-of-order and late items.
fault tolerance on the cluster and delivery guarantees
Beside this it also seems to be more stable and well known at the moment.
For example, you can use it as a runtime for Apache Beam and then migrate easily between Google Data Flow on the cloud and your own deployment.
So I would currently use flink.
Best
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently searching for a good distributed file system.
It should:
be open-source
be horizontally scalable (replication and sharding)
have no single point of failure
have a relatively small footprint
Here are the four most promising candidates in my opinion:
GridFS (based on MongoDB)
GlusterFS
Ceph
HekaFS
The filesystem will be used mainly for media files (images and audio). There are very small as well as medium sized files (1 KB - 10 MB). The amount of files should be around several millions.
Are there any benchmarks regarding performance, CPU-load, memory-consumption and scalability? What are your experiences using these or other distributed filesystems?
I'm not sure your list is quite correct. It depends on what you mean by a file system.
If you mean a file system that is mountable in an operating system and usable by any application that reads and writes files using POSIX calls, then GridFS doesn't really qualify. It is just how MongoDB stores BSON-formatted objects. It is an Object system rather than a File system.
There is a project to make GridFS mountable, but it is a little weird because GridFS doesn't have concepts for things like hierarchical directories, although paths are allowed. Also, I'm not sure how distributed writes on gridfs-fuse would be.
GlusterFS and Ceph are comparable and are distributed, replicable mountable file systems. You can read a comparison between the two here (and followup update of comparison), although keep in mind that the benchmarks are done by someone who is a little biased. You can also watch this debate on the topic.
As for HekaFS, it is GlusterFS that is set up for cloud computing, adding encryption and multitenancy as well as an administrative UI.
After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.
I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.
2019 update: situation has changed and LizardFS is not actively maintained any more.
MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.
RozoFS has matured and maybe worth a try.
GfarmFS have its niche but today I would have chosen MooseFS for most applications.
OrangeFS, anyone?
I am looking for a HPC DFS and found this discussion here:
http://forums.gentoo.org/viewtopic-t-901744-start-0.html
Lots of good data and comparisons :)
After some talk the OP decided for OrangeFS, quoting:
"OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway. "
I do not know about the other systems you posted but I have made a comparison of 3 PHP CMS/Frameworks on local storage vs GlusterFS to see if it does better on real world tests than raw benchmarks. Sadly not.
http://blog.lavoie.sl/2013/12/glusterfs-performance-on-different-frameworks.html
I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.
Has anybody experience with both of them? What is the serious difference between them?
Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?
I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.
Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison
I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.
Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS).
One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.
Some functionality/properties that RDBMS have but not native MapReduce:
Declaritive query languages -(Pig, HIVE)
Schemas (Hive, Pig, DyradLINQ, Hadapt)
Logical Data Independence
Indexing (Hbase)
Algebraic Optimization (Pig, Dryad, HIVE)
Caching/Materialized Views
ACID/Transactions
MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)
High Scalability
Fault-tolerance
“One-person deployment”
I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)
Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.
Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc).
Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.
To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option.
On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.
IMHO, Most, if not all organisations need both.
I hope this helps :-)
To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.
Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.
Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.
For your question ETL systems read these slides where you will see.
Ok now Why Hadoop?
Open Source
Proven Storage and Analysis model for Large Quantities of data
Minimum Hardware Requirement to setup and run.
Ok now Why TD?
Commercial Support
Are there any enterprise-grade database engines (Oracle, MS SQL...etc) that can handle large RDF datasets (320 million) and SPARQL queries? I guess my question is also: is SPARQL/RDF/OWL ready for serving large real-world data warehouses for an enterprise? If not, are there efficient mechanisms for adapting SPARQL/RDF against a typical data warehouse star schema.
Thanks!
Virtuoso - is the datastore used by Bio2RDF and DBPedia
Following from Kaarel's suggestion one of the entries this year presented at ISWC used 4store which does scale that far though the competitor set it up in some weird configuration which the CTO of Gralik (who develop 4store) described to me and colleagues as 'crazy' but 4store would be capable of that scale - http://4store.org
Also Virtuoso supports stores at this scale, they have a live application that you can use to SPARQL query over the majority of the major LOD (Linked Open Data) data sources which total around 9 billion Triples
Virtuoso - http://virtuoso.openlinksw.com
LOD Application - http://lod.openlinksw.com/sparql
I maintain this list of large triplestores on the W3C wiki:
http://esw.w3.org/topic/LargeTripleStores
There are 7 seven triplestores that are known to be able to hold over a billion triples. Four of them are open source. Please update the above-mentioned wiki page if you have more information.
Obviously, performance depends on what you use it for. I used Virtuoso in a large-scale industrial project, and it is quite fast.
Neo4j handles around 1+ Billion triples out of the box, SAIL API here, while still have the whole graph to do advanced stuff with things like Gremlin, or SPARQL.
Disclaimer: I am part of the Neo4j team.
Intellidimension provides a solution called Semantic Server that is developed on top of Microsoft's SQL Server 2005 or 2008. It easily scales to the hundreds of millions of triples and I know they have at least one customer happily running an enterprise deployment with over a billion statements.
I am one of their customers working with datasets > 100 million. Our plans are to move towards the 10s of billions of statements.
4store looks to be a good solution however the documentation is pretty sparse at this time and when I last looked at it there was no ability to delete an individual triple from the graph.
I would also take a look at BigData
Here is a quote from their main page summarizing their offering.
Bigdata(R) is an open-source scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. Bigdata was designed from the ground up as a distributed database architecture optimized for very high aggregate IO rates running over clusters of 100s to 1000s of machines, but can also run in a single-server mode. Bigdata offers a distributed file system, similar to the Google File System but also useful for workflow queues, a data extensible sparse row store, similar to Googles widely recognized bigtable project, and map/reduce processing for parallelizing data intensive workflows over a cluster.
Bigdata(R) comes packaged with a very high-performance RDF store supporting RDF(S) and OWL Lite inference. The Bigdata RDF Store is currently the only RDF database capable of operating distributed on a cluster with dynamic key-range partitioning of indices. The Bigdata RDF Store was designed specifically to meet requirements for very large scale semantic alignment and federation. RDF is a Semantic Web technology particularly well-suited to modeling graph-shaped data and metadata, such as an associative entity-link model, whereby actors are linked to one another in an ad-hoc fashion within the context of an evolving ontology of concepts for entity types and link types related to a particular problem domain. The Bigdata RDF Store is used operationally in data harvesting systems to create mash-ups of structured, semi-structured, and unstructured data from myriad sources in a schema-flexible manner.
I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.