Data migration from Sybase to Teradata - sybase

I am trying to do a case study on migrating our data from Sybase to Teradata. We would than try to implement the same in our organization. Based on my study what I know is Teradata is excellent when it comes down to allowing maximum number of concurrent queries, more number of active sessions per system, maximum number of transactions including ETL jobs etc. But are there any other added advantages we can get (where Sybase lacks) by moving to Teradata? Also what all key factors or constraints I should consider while performing such migration?

Actually, Michael, Sybase IQ can be compared to TD as an OLAP solution.
To OP: Cost comparison would be a huge advantage to Sybase. Also, you can maintain Syb yourself, whilst for TD you're likely to contract TD staff to maintain the hardware.
But no matter the size, you can bring Teradata to its knees just as easy with quite a few concurrent queries or poorly written ETLs. Teradata performs very well though on heavy aggregation of huge dimensions, if skewed well across controllers.

Related

Choosing the right database engine for relational data with billions of records

My Python application data structure is pure relational.
My estimation for the biggest table is around 10 billion rows each year (all the other tables are very small).
Each row size is about 20-30 bytes
What is the right database engine for me?
You might consider the following that I have used, but of course this will depend on what your data looks like and how your APP/Users need to interact with it. This is not an exhaustive list, it's only the stuff I have used.
Greenplum database is a open source distributed Postgres database. http://greenplum.org/
It scales nicely and supports pretty much all Postgres stuff except for full text indexing last I knew
Apache Phoenix: An open source sql layer on top of Hadoop/HBase. It scales nicely, but the ecosystem is a bit complex (as Per Hadoop). Cloudera's Impala is similar. https://phoenix.apache.org/
Oracle Partitioning (preferably on RAC). If you can afford the license, Oracle partitioning allows for sharding of your data in various ways. If you have it with RAC, that will also provide parallel query execution
Just partition your data (on any RDBMS) and put the partitions on good disk
Those are the 4 ideas I have actually used, and remember, on good hardware, with some table partitioning, 10B rows isn't really all that much, so you might just need to get a better box[s] and hook it to a SAN with SSD of some kind over 10G network or better. ALso think about putting indexes on a separate disk from where the db files are, and always use SSD if you can afford it.
Anyway, HTH
MG
At 30 bytes per row that's less than 300GB, which is a small database, well within the capabilities of Oracle or SQL Server Enterprise editions. You won't need Oracle RAC.
You'll need to pay attention to application design and indexing/partitioning. Query and storage optimization will have a greater impact on performance than the choice of DBMS will.

Hive vs SQL Server performance

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.
After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables).
e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.
Why it is happening?
2) If I think to speed up my execution in Hive then what should I do for it?
(Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)
Please suggest me some solution or some good practices so that I can do it keenly.
Thanks.
Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
Hive (via hadoop) has a lot of overhead for starting up a job.
Hive and hadoop will not cache data in memory like sql server does.
Hive has only recent added indexes so most queries end up being a table scan.
If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.
Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.
Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.
But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.
More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)
If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.
And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

Best database for multi million row store/query

We have a database that has been growing for around 5 years. The main table has near 100 columns and 700 million rows (and growing).
The common use case is to count how many rows match a given criteria, that is:
select count(*) where column1='TypeA' and column2='BlockC'.
The other use case is to retrieve the rows that match a criteria.
The queries started by taking a bit of time, now they take a couple of minutes.
I want to find some DBMS that allows me to make the two use cases as fast as possible.
I've been looking into some Column store databases and Apache Cassandra but still have no idea what is the best option. Any ideas?
Update: these days I'd recommend Hive 3 or PrestoDB for big data analysis
I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.
You are going to want a few features to help speed up analysis:
Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.
easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.
you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247/ref=pd_sim_b_1
If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.
If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation.
Devices only insert data and the J2EE application will need to read then occasionally.
Now I'm looking to software alternatives to support this kind of operation.
Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.
I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.
I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.
What do you think?
Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.
You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.
Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.
When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.
Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.
Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.
The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).
Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.
Maybe it is time for a db that you can shard over many machines? Cassandra? Redis? Don't limit yourself to sql db's.
Database partition management can be automated; time-based partitioning of the data is a standard way of dealihg with this type of problem, and I'm not sure that I can see any reason why this can't be done with PostgreSQL.
You have approximately 72m rows per day - assuming a device ID, datestamp and two floats for coordinates you will have (say) 16-20 bytes per row plus some minor page metadata overhead. A back-of-fag-packet capacity plan suggests around 1-1.5GB of data per day, or 400-500GB per year, plus indexes if necessary.
If you can live with periodically refreshed data (i.e. not completely up to date) you could build a separate reporting table and periodically update this with an ETL process. If this table is stored on separate physical disk volumes it can be queried without significantly affecting the performance of your transactional data.
A separate reporting database for historical data would also allow you to prune your operational table by dropping older partitions, which would probably help with application performance. You could also index the reporting tables and create summary tables to optimise reporting performance.
If you need low latency data (i.e. reporting on up-to-date data), it may also be possible to build a view where the lead partitions are reported off the operational system and the historical data is reported from the data mart. This would allow the bulk queries to take place on reporting tables optimised for this, while relatively small volumes of current data can be read directly from the operational system.
Most low-latency reporting systems use some variation of this approach - a leading partition can be updated by a real-time process (perhaps triggers) and contains relatively little data, so it can be queried quickly, but contains no baggage that slows down the update. The rest of the historical data can be heavily indexed for reporting. Partitioning by date means that the system will automatically start populating the next partition, and a periodic process can move, re-index or do whatever needs to be done for the historical data to optimise it for reporting.
Note: If your budget runs to PostgreSQL rather than Oracle, you will probably find that direct-attach storage is appreciably faster than a SAN unless you want to spend a lot of money on SAN hardware.
That is a bit of a vague question you are asking. And I think you are not facing a choice of database software, but an architectural problem.
Some considerations:
How reliable are the devices, and how
well are they connected to the
querying software?
How failsafe do
you need the storage to be?
How much extra processing power do the devices
have to process your queries?
Basically, your idea of a spatial partitioning is a good idea. That does not exclude a temporal partition, if necessary. Whether you do that in postgres or sqlite depends on other factors, like the processing power and available libraries.
Another consideration would be whether your devices are reliable and powerful enough to handle your queries. Otherwise, you might want to work with a centralized cluster of databases instead, which you can still query in parallel.

RDBMS data-relation burden

Our in-house system is built on SQL Server 2008 with a 40-table 6NF schema. Most of the tables FK to 3 others, a key few as many as 7. The system will ultimately support 100s of employees working with 10s of 1000s of customers and store 100s of 1000s of transactional records -- prime-time access should peak at 1000 rows per second.
Is there any reason to think that this depth of RDBMS inter-relation would overburden a system built using modern hardware with ample RAM? I'm attempting to evaluate whether we need to adjust our design or project direction/goals before we approach the final development phase (in a couple of months).
In SQl Server terms what you describe is a smallish database. With correct design SQL Server can handle terrabytes of data.
This is not to guarantee that your current design may perform well. There are many ways to construct poorly performing t-SQL and many bad database design choices.
If I were you I would load test data to twice the size you expect the tables to have and then start testing your code. Load testing might also be a good idea. It is far easier to fix database performance problems before they go to production. Far, far easier!

Resources