How to take Hbase table snapshot? - export

I am new to Hbase and I was looking for Hbase backup and restore solution. CAn you please mention how to take snapshot of Hbase or Hbase table and restoring it as part of recovery solution?
Thanks in advance!!!

Login to hbase shell
snapshot 'table_name','snapshot_name'
you can see whether it create a snapshot or not
from hbase shell, type list_snapshots

There is not currently a way to do this in HBase directly (though there is work going on for this even as we speak: see HBASE-6055. That's targeted for 0.96, meaning late 2012 or early 2013.
In the mean time, if you use another underlying file system besides HDFS, some of them (like the closed-source commercial product, MapR) offer a snapshot feature (for a price).

Creating a snapshot of a table is as simple as running this command from the HBase shell:
hbase(main)> snapshot 'myTable', 'MySnapShot'
Restoring is as simple as issuing these commands from the shell:
hbase(main)> disable 'myTable'
hbase(main)> restore_snapshot 'MySnapShot'
hbase(main)> enable 'myTable'

Related

Should I still truncate a postgresql database if I am truncating and refilling the tables instead of deleting/inserting/upserting?

The production database at my company is running significantly slower than the test database (local ~5ms, test ~18ms, production ~1-2 sec). We've been trying to look into why and will be doing some EXPLAIN ANALYZE on key queries on a secure shell psql instance in our cloud.
I've been trying to read up on database optimization and have come across postgresql's VACUUM and am wondering if running this might help. We don't update the production database often -- once each release, though the migrations involve dropping or truncating tables as necessary and then importing new data. I'm curious if VACUUM would be potentially helpful here? If it would be, would we be seeing similar slowdowns in a spiped instance of our test database?
VACUUM could help if there are a lots of DELETE and UPDATE in the database.
See https://dba.stackexchange.com/questions/36984/how-to-determine-if-a-postgres-database-needs-to-be-vaccumed

AWS RDS Backups

So I recently started using AWS and Elastic Beanstalk with RDS.
I wonder whats the best practices for creating database backups?
So far my setup is this.
Enable automatic backups.
bash script that creates manual snapshots everyday and removes manual snapshots older than 8 days.
bash script that creates a sql dump of the database and uploads it to S3.
Reason why I am creating the manual snapshots is if I were to delete the database by misstake, then I still have snapshots.
The bash scripts is on an EC2 instance launched with a IAM role which is allowed to execute these scripts.
Am I on the right track here?
I really appreciate answers, thanks.
A bit of context...
That automated backups are not saved after DB deletion is a very important technical gotcha. I've seen it catch devs on the team unawares, so thanks for bringing this up.
After the DB instance is deleted, RDS retains this final DB snapshot and all other manual DB snapshots indefinitely. However, all automated backups are deleted and cannot be recovered when you delete a DB instance. source
I suspect for most folks, the final snapshot is sufficient.
Onto the question at hand...
Yes. 110%. Absolutely.
I wouldn't create manual snapshots; rather, copy the automated ones.
Option 1: You already have the automated snapshots available. Why not just copy the automated snapshot (less unnecessary DB load; though, admittedly less of an issue if you're multi-AZ since you'll be snapshoting from the replica), which created a manual snapshot. I'd automate this using the aws sdk and a cron job.
Option 2: Requires manual adherence. Simply copy your automated snapshots (to create manual snapshots) before terminating a DB.
Unclear why you'd need the s3 dump if you have the snapshots.
For schema changes?: If you're doing it for schema changes, these should really be handled with migrations (we use knex.js, but pick your poison). If that's a bridge too far, remember that there's an option for schema-only dumps (pg_dump --schema-only). Much more manageable.
Getting at the data?: Your snapshots are already on s3, see faq. You can always load a snapshot, and sql dump it if you choose. I don't see an immediately obvious reason for purposely duplicating your data.

Copy database between two PostgreSQL servers

Is there some tool to copy database from one PostgreSQL to other on the fly NOT INVOLVING BACKUPS/RESTORES? The tool which automatically keeps database structure on slave server in sync with master server. Probably the tool with differential mode looking at records' primary keys.
I could use replication, but the problem is that it ties two servers in a permanent manner, and I do not need a continuous replication. I need to start it manually. It should terminate when finishes.
I had started to write my own .NET tool using reflection etc, but thought that may be somebody has already written such a tool.
Replication is the term you are looking for.
There are many variations on how to do this. Start by reading the manual and then google a little.
If the whole-system replication built-in to recent versions of PostgreSQL isn't to your taste then try searching for "slony" or "pg-pool" or "bucardo" (among others).

How to backup a Solr database?

I wonder how to backup (dump) a Solr database?
If it is only to copy some files, then please specify which files (filename, location etc).
Thanks
We use Solr Replication to do our backup.
You can either have a slave that is dedicated to be a backup or use the "backup" command to make a backup on the master (I never used that last method).
Typically, the index is stored in $SOLR_HOME/data.
Back up that entire folder.
In Solr 8/9 version solr backup and restore is available via its replication handler.
It will create a snapshot of the data which you can also restore later.
Here in the solr documentation page you can find more useful information:
https://solr.apache.org/guide/8_9/making-and-restoring-backups.html#standalone-mode-backups
So this can be used with new 8/9 version if someone is looking for it.

What is Multiversion Concurrency Control (MVCC) and who supports it? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Recently Jeff has posted regarding his trouble with database deadlocks related to reading. Multiversion Concurrency Control (MVCC) claims to solve this problem. What is it, and what databases support it?
updated: these support it (which others?)
oracle
postgresql
Oracle has had an excellent multi version control system in place since very long(at least since oracle 8.0)
Following should help.
User A starts a transaction and is updating 1000 rows with some value At Time T1
User B reads the same 1000 rows at time T2.
User A updates row 543 with value Y (original value X)
User B reaches row 543 and finds that a transaction is in operation since Time T1.
The database returns the unmodified record from the Logs. The returned value is the value that was committed at the time less than or equal to T2.
If the record could not be retreived from the redo logs it means the database is not setup appropriately. There needs to be more space allocated to the logs.
This way the read consitency is achieved. The returned results are always the same with respect to the start time of transaction. So within a transaction the read consistency is achieved.
I have tried to explain in the simplest terms possible...there is a lot to multiversioning in databases.
PostgreSQL's Multi-Version Concurrency Control
As well as this article which features diagrams of how MVCC works when issuing INSERT, UPDATE, and DELETE statements.
The following have an implementation of MVCC:
SQL Server 2005 (Non-default, SET READ_COMMITTED_SNAPSHOT ON)
http://msdn.microsoft.com/en-us/library/ms345124.aspx
Oracle (since version 8)
MySQL 5 (only with InnoDB tables)
PostgreSQL
Firebird
Informix
I'm pretty sure Sybase and IBM DB2 Mainframe/LUW do not have an implementation of MVCC
Firebird does it, they call it MGA (Multi Generational Architecture).
They keep the original version intact, and add a new version that only the session using it can see, when committed the older version is disabled, and the newer version is enabled for everybody(the file piles-up with data and needs regular cleanup).
Oracle overwrites the data itself, and uses a rollback segments/undo tablespaces for other sessions and to rollback.
XtremeData dbX supports MVCC.
In addition, dbX can make use of SQL primitives implemented in FPGA hardware.
SAP HANA also uses MVCC.
SAP HANA is a full In-Memory Computing System, so MVCC costs for select is very low... :)
Here is a link to the PostgreSQL doc page on MVCC. The choice quote (emphasis mine):
The main advantage to using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading.
This is why Jeff was so confounded by his deadlocks. A read should never be able to cause them.
SQL Server 2005 and up offer MVCC as an option; it isn't the default, however. MS calls it snapshot isolation, if memory serves.
MVCC can also be implemented manually, by adding a version number column to your tables, and always doing inserts instead of updates.
The cost of this is a much larger database, and slower selects since each one needs a subquery to find the latest record.
It's an excellent solution for systems that require 100% auditing for all changes.
MySQL also uses MVCC by default if you use InnoDB tables:
http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html
McObject announced in 11/09 that it has added an optional MVCC transaction manager to its eXtremeDB embedded database:
http://www.mcobject.com/november9/2009
eXtremeDB, originally developed as an in-memory database system (IMDS), is now available in editions with hybrid (in-memory/on-disk) storage, High Availability, 64-bit support and more.
There's a good explanation of MVCC -- with diagrams -- and some performance numbers for eXtremeDB in this article, written by McObject's co-founder and CEO, in RTC Magazine:
http://www.rtcmagazine.com/articles/view/101612
Clearly MVCC is increasingly beneficial as an application scales to include many tasks executing on multiple CPU cores.
DB2 version 9.7 has a licensed version of postgress plus in it. This means that this feature (in the right mode) supports this feature.
Berkeley DB also supports MVCC.
And when BDB storage engine is used in MySQL, MySQL also supports MVCC.
Berkeley DB is a very powerful, customizable fully ACID conform DBMS. It supports several different methods for indexing, master-slave replication, can be used as a pure key value store with it's own dynamic API or queried with SQL if wanted. Worth taking a look at.
Another document oriented DBMS embracing MVCC would be CouchDB. MVCC here also is a big plus for the built in peer-to-peer replication.
From http://vschart.com/list/multiversion-concurrency-control/
Couchbase,
OrientDB,
CouchDB,
PostgreSQL,
Project Voldemort,
BigTable,
Percona Server,
HyperGraphDB,
Drizzle,
Cloudant,
IBM DB2,
InterSystems Caché,
InterBase

Resources