Columnar Database Comparisons and DBA efforts

Columnar Database Comparisons and DBA efforts - database

I'm trying to find a database solution and I came across Infobright and Amazon Redshift as potential solutions. Both are columnar databases. Infobright has been around for quite sometime whereas Amazon Redshift is newer.
What is the DBA effort between Infobright and Amazon Redshift?
How accessible is Infobright (API, query interface, etc.) vs AWS?
Where do both sit in your system architecture? Do the operate as a layer on top of your traditional RDBMS?
What is the DevOps effort to setting up both Infobright and Redshift?
I'm leaning a bit more towards Redshift because my application is hosted on AWS and I thought this would create tangible benefits in the long-run since everything is in AWS. Thank you in advance!

Firstly, I'll admit that I work for Infobright. I've done significant research into Redshift, and I feel I can give an honest opinion. I just wrote up a comparison between the two technologies; it can be found here: https://www.infobright.com/wp-content/plugins/download-monitor/download.php?id=37
DBA Effort - Infobright requires very little administration. You cannot index; you don't need to partition/etc. It's an SMP architecture and scales well. Thus, you won't be dealing with multiple nodes. Redshift is also fairly simple. You will need to maintain sorts as well as ensure Analyse is run enough.
Infobright uses a MySQL Shell. Thus, any tool that can utilize MySQL can utilize Infobright. Therefore, you have the same set of tools/interfaces/APIs for Infobright as you do with MySQL. AWS does have an SQL interface, and it does have some API capabilities. It does require that you load directly from S3. Infobright loads from flat files and named pipes from local or remote servers.
Both databases are analytic databases. You would not want to use either as a transactional database. Instead, you typically push data from your transactional system to your analytic database.
DevOps to setup Infobright will be lower than Redshift. However, Redshift is not that overly complicated either. Maintenance of the environment is more of a requirement for Redshift, though.
Infobright does have many AWS-specific installations. In fact, we have implementations that approach nearly 100TB of raw storage on one server. That said, Redshift with many nodes can achieve petabyte scale on an implementation.
There are other factors that can impact your choice. For example, Redshift has very nice failover/HA options already built-in. On the flipside, Infobright can support many concurrent queries and users; Redshift limits queries to 15 regardless of cluster size.
Take a look at the document, and feel free to contact me if you have any specific questions about either technology.

Related

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.

Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.

You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.

I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.

Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Cloud/hosted database/datastore services to replace local SQL Server instance

As a .NET web developer, I've always used SQL Server as my database store because it's already in the MSFT ecosystem and easy to work with from the .NET platform.
Recently, however, I had a computer almost literally blow up, and consequently lost all my data in SQL Server on that machine.
Now that I've got a new computer, I want to start using an off-site database so that this doesn't happen again. A database hosted by a third-party (i.e. hosting company) or cloud service.
It doesn't have to be SQL Server or even RMDBS necessarily, but if it's not, it'd be be something cutting-edge (e.g. redis, Cassandra, MongoDB, CouchDB, etc.) and not just MySQL or Postgre or something.
Does anyone have an recommendations for those with little financial means?
I'd like to be able to use it during development of projects, and if they ever go live, not have to migrate the data anywhere to a new service--keep the data right there where it is and point my live domain requiring the data to the same service it pointed while in development.

It's not so much a question of available hosted services as of what setup you want for your standard development environment. If one of the cloud datastores doesn't work for you, you can always get a virtual server and install whatever you need.
However, you may want to rethink the idea of putting dev databases in the cloud. Performance will not be as good as something running locally (particularly if you are working with things like bulk import), and turning a dev database into a production database isn't a particularly good idea. I think what you are really looking for is a combination of easy backup, schema management and data setup.
Backup on a live server is easy enough - either you are backing up the entire server or have a script that uploads the backup file somewhere. For dev I don't bother as I prefer to set up disposable environments - have code that can set up the database if it doesn't already exist and add any necessary default data. Most apps don't need much data unless there is some sort of import process involved, and the same code works quite nicely when you first set up the live environment.
Schema management is one of the more painful aspects of working with SQL and where NoSQL systems can make life a lot easier as most have the schema defined entirely by the code that is using it - I mostly use redis myself, but whether or not it is appropriate for you will depend on the type of project you work on - if you need a lot of joins or transactions you probably need SQL, but if you just need basic data storage most NoSQL platforms would be better.

May I suggest looking into Windows Azure table storage? It is quiet different from pure relational play of SQL Server, is the "next big thing" from Microsoft and is in general a somewhat of a paradigm shift for folks used to relational databases.
If you're ever going to come face to face with Azure in the future (and I suspect many .NET people will), it maybe a beneficial of an experience to have.
With respect to costs, they're negligible for individual use. 10,000 transactions a month cost a penny. A gigabyte per month of storage costs 15 cents, and data transfers are 10-15cents per gigabyte.
If you have only "development" projects that store their data in the cloud, I'll be damned if you pay more than $2-3/month to MS... if that :)

Google Cloud Datastore is in beta now and could be a good option for you. It's free up to 1GB and 50K requests per day. The API is rather low level. However, I wrote a high level ORM for GCD called Pogo that serializes and deserializes plain old objects into GCD entities.
Take a look at the documentation and open source here - http://code.thecodeprose.com/pogo
It's also available on Nuget called "Pogo".

Suggestions for a hosted database

I would like to have a SQL database online, but don't want to deal with its care and feeding. There are some commercial offerings out there for hosted DBs, for example Amazon SimpleDB. Can anybody suggest others, and if they used any of these services what their impressions were? Anything that helps me make an informed decision would be appreciated.
Edit: Since there's no one true answer, I've made this a community wiki.

Did you take a look the Amazon Relational Database Service. It is a MySql instance, and it is priced in a similar fashion to the EC2 products.

Google's AppEngine also has a SQL Database: http://code.google.com/appengine that is free, but it doesn't scale very well.
Amazon's SimpleDB is lacking a large chunk of the MySQL API, so if you want to go this route try and stick to SQL92 as much as possible. Also, keep in mind that you are changed per query. This means you want to make every query count. One way of doing that is by using relative updates:
UPDATE persondata SET age=age+1;
To be honest SimpleDB is a waste of money unless you need a large SQL cluster. I'd start off with a local sql db, when your load starts to get out of hand, move the sql db to its own server. After that, you will be looking at clustering your SQL db, and then SimpleDB starts to become an attractive solution.

What database to use for big data storage and manipulation?

I have to make a decision of which database server to use for my next project, but the simple decision to use MySQL like almost all the projects I did is harder now, because I expect very much records.
The database will store a user list, some other irrelevant tables, and the last one, some user-collected data. Let's say, if I have 6000 users responding to a quiz about each other. Simple math shows that from those users, if each one completes the quiz about everyone (and in my project that is 99% sure that will happen) I'll end up with 35.99million records(they will exclude themselves and in this particular situation the operation is 6000*5999). Unfortunately 6000 maybe is a small number, the real one growing day by day.
What to choose? MySQL and maybe if things go well and the project grows to expand it in a cluster? PostgreSQL, MSSQL? Oracle?
I've read about all of them, each one has it's pros and cons, but still don't know what to choose. The advantage of MySQL and PostgreSQL is of course, the starting price of $0 which is pretty nice in a usual self-funded startup.
Any opinions, pieces of advice? If you encountered this situation in your experience as developers, I'd love to hear from you.

These days, free isn't something that differenciates between databases any more. Both Oracle and SQL Server have free versions, but the limitations is resources - 4 GB database, RAM & single CPU utilization. Millions of records is not a concern - it's what datatypes you're using.
I saw the OPs comment about not liking MS software - that's your prerogative, but using the free versions of either Oracle or SQL Server do benefit from seamless transition to upscale versions of the respective database.
Personally, my choice would be either Oracle or SQL Server because of IMHO, real feature considerations like hierarchical query support, subquery factoring/CTE, packages (long before I get concerned with functions/procedures), full text searching, xml support, etc.

MySQL will handle 35 million records no problem. Worry about scalability when you get there. You can easily add raid hard disks backing your database tables, and if you really start getting big you can get a compellant SAN that will scream... Don't worry about the DB engine as much as the underlying hardware.. MySQL rocks for us with millions of records.

I've had no problems handling tables as large as 36,000,000 rows on MySQL and Oracle.
Just be sure that you index the proper columns, run EXPLAINs for your queries, and maintain proper design principles.

Most of the truly large scale web properties use a distributed key-value store. That said, 35 million is large, but not that large. With most modern databases, your main two scaling worries should be throughput and what happens when no single box can contain your entire database anymore. And both of these problems can be solved to some degree for any database you choose to use. (Caching, replication, sharding, etc.)
Use MySQL until you can't anymore. At that point, you ought to be rolling in dough anyways and you now have a very desirable problem.

Use MySQL as it's free and you have experience with it.
Besides in my opinion it matters more on how you design the tables than which database you use.

35 million records can be easily handled by MS SQL Server (assuming proper database design, indices, etc.). You can start with the free SQL Server Express edition and later, if you need, you can upgrade to the full version which supports clustering, etc.
SQL Server Express does have some limitations - single CPU, 1 GB memory, max 4 GB database size and a few other things. I'm not sure how quickly these limitations will become a problem but you can always move to the full version when you run into them.

MySQL(i) & Postgre
0$ of costs
large community
many tutorials
well documentated
MSSQL
You can get "money" from MS if you promote that you are using MSSQL (secret information from some companies I worked for)
MS tools work very well
Complete tool set from C# IDE over .NET lib to Windows Server 2003
Oracle
Professional and commercial provider
Used by many large companies (I also heard about Blizzard (World of Warcraft) using Oracle)
- expensive
The final decision depends on the very special requirements of your project.
Make yourself a quick list of things , that ARE IMPORTANT for your project (e.g. quick performed queries) and look up which Database pros are matching the most to your requirements.
Everything is about design. SQL Database are some kind of cars, you just have to know which component has to be placed here and which there.
Make a clear design and you won't struggle with any of them.

May be you can test Firebird
Blog post about big Firebird database here
MySQL licence is here (not allways free).
Postgresql and Firebird are free.

First of all, don't think about performance. Premature optimization being the root of all evil and all that. You can always throw more hardware and/or tuning at it later.
All of the mentioned should perform nicely if tuned/maintained correctly. I'd focus on manageability and familiarity. IMHO open source databases excels on manageability (perhaps not the best GUIs, but the CLI has been my home for a long long time).
And if the database becomes the bottleneck, why limit yourself to those choices? How about a key-value distributed database? Or perhaps serialize data directly to disk? Storing data outside of a RDBMS, while often frowned upon, might be the correct path. Or simply use the common route of denormalization.
Always remember not to optimize prematurely.
As far as opinions go (since you specifically asked for it) I favor open source databases, specifically PostgreSQL. It's rock solid, fast and very well-featured. And even with (relatively) large datasets it has performed superbly on mediocre hardware (some tuning involved, of course, but you can't skip that step no matter which db you end up choosing).

SQL Server and Oracle, which one is better in terms of scalability? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
MS SQL Server and Oracle, which one is better in terms of scalability?
For example, if the data size reach 500 TB etc.

Both Oracle and SQL Server are shared-disk databases so they are constrained by disk bandwidth for queries that table scan over large volumes of data. Products such as Teradata, Netezza or DB/2 Parallel Edition are 'shared nothing' architectures where the database stores horizontal partitions on the individual nodes. This type of architecture gives the best parallel query performance as the local disks on each node are not constrained through a central bottleneck on a SAN.
Shared disk systems (such as Oracle Real Application Clusters or Clustered SQL Server installations still require a shared SAN, which has constrained bandwidth for streaming. On a VLDB this can seriously restrict the table-scanning performance that is possible to achieve. Most data warehouse queries run table or range scans across large blocks of data. If the query will hit more than a few percent of rows a single table scan is often the optimal query plan.
Multiple local direct-attach disk arrays on nodes gives more disk bandwidth.
Having said that I am aware of an Oracle DW shop (a major european telco) that has an oracle based data warehouse that loads 600 GB per day, so the shared disk architecture does not appear to impose unsurmountable limitations.
Between MS-SQL and Oracle there are some differences. IMHO Oracle has better VLDB support than SQL server for the following reasons:
Oracle has native support for bitmap indexes, which are an index structure suitable for high speed data warehouse queries. They essentially do a CPU for I/O tradeoff as they are run-length encoded and use relatively little space. On the other hand, Microsoft claim that Index Intersection is not appreciably slower.
Oracle has better table partitioning facilities than SQL Server. IIRC The table partitioning in SQL Server 2005 can only be done on a single column.
Oracle can be run on somewhat larger hardware than SQL Server, although one can run SQL server on some quite respectably large systems.
Oracle has more mature support for Materialized views and Query rewrite to optimise relational queries. SQL2005 does have some query rewrite capability but it is poorly documented and I haven't seen it used in a production system. However, Microsoft will suggest that you use Analysis Services, which does actually support shared nothing configurations.
Unless you have truly biblical data volumes and are choosing between Oracle and a shared nothing architecture such as Teradata you will probably see little practical difference between Oracle and SQL Server. Particularly since the introduction of SQL2005 the partitioning facilities in SQL Server are viewed as good enough and there are plenty of examples of multi-terabyte systems that have been successfully implemented on it.

When you are talking 500TB, that is (a) big and (b) specialized.
I'd be going to a consultancy firm with appropriate specialists to look at the existing skill sets, integration with existing technology stacks, expected usage, backup/recovery/DR requirements....
In short, it's not the sort of project I'd be heading into based on opinions from stackoverflow. No offence intended, but there's simply too many factors to take into account, a lot of which would be business confidential.

Whether Oracle or MSSQL will scale / perform better is question #15. The data model is the first make-it or break-it item regardless of if you're running Oracle, MSSQL, Informix or anything else. Data model structure, what kind of applicaiton, how it accesses the db etc, which platform your developers know well enough to target for a large system etc are the first questions you should ask yourself.

I've worked as a DBA on Oracle (although some years back) and I use MSSQL extensively now, although not as a formal DBA. My advice would be that in the vast majority of cases both will meet everything you can throw at them and your performance issues will be much more dependent upon database design and deployment than the underlying characteristics of the products, which in both cases are absolutely and utterly solid (MSSQL is the best product that MS makes in many peoples opinion so don't let the usual perception of MS blind you on that).
Myself I would tend towards MSSQL unless your system is going to be very large and truly enterprise level (massive numbers of users, multiple 9's uptime etc.) simply because in my experience Oracle tends to require a higher level of DBA knowledge and maintenance than MSSQL to get the best out of it. Oracle also tends to be more expensive, both for initial deployment and in the cost to hire DBAs for it. OTOH if you are looking at an enterprise system then Oracle would have the edge, not least because if you can afford it their support is second to none.

I have to agree with those who said deisgn was more important.
I've worked with superfast and super slow databases of many different flavors (the absolute worst being an Oracle database, but it wasn't Oracle's fault). Design of the database and how you decide to index it and partition it and query it have far more to do with the scalability than whether the product is from MSSQL Server or Oracle.
I think you may more easily find more Oracle dbas with terrabyte database experience (running a large database is a specialty just like knowing a particular flavor of SQL) but that could depend on your local area.

oracle people will tell you oracle is better, sql server peopele will tell you sql server is better.
i say they scale pretty much the same. use what you know better. you have databases out there that are that size on oracle as well as sql server

When you get to OBSCENE database sizes (where over 1TB is really big enough, and 500TB is frigging massive), then operational support must come very high up on the list of requirements. With that much data, you don't mess about with penny pinching system specifications.
How are you going to backup that size of system? Upgrade the OS and patch the database? Scalability and reliability a concern?
I have experience of both Oracle and MS SQL, and for the really really big systems (users, data or importance) then Oracle is better designed for operational support and data management.
Every tried to backup and restore a 1TB+ SQL Server database split over multiple databases on multiple instances with transaction log files being spat out everywhere by each database and trying to keep it all in sync? Good luck with that.
With Oracle, you have ONE database (so I disagree with the "shared nothing" approach is better) with ONE set of REDO logs(1) and one set of archive logs(2) and you can just add extra hardware nodes without changing (i.e. repartitioning) you application and data.
(1) Redo logs are, of course, mirrored.
(2) Archive logs are, of course, stored in multiple locations.

It would also depend on what is your application meant for. If it uses only Inserts with very few updates, then I think MSSQL would be more scalable and better in terms of performance. However if one has lots of updates, then Oracle would scaleup better

I very much doubt that you are going to get an objective answer to that particular question, until you come across anyone that has implemented the same database (schema, data, etc.) on both platforms.
However given the fact that you can find millions of happy users of both databases, I dare say it's not too much of a stretch to say either will scale just fine (I've seen a snappy Sql 2005 implementation of 300 TB that seemed pretty responsive)

Oracle like a high-quality manual film camera, which needs the best photographer to take the best picture while MS SQL like an automatic digital camera. In old days, of course, all professional photographers will use film camera, now think about how many professional photographers use automatic digital camera.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight