Understanding Background merges in ClickHouse - database

I wanted to understand background merges better. I understand that background merges keep happening until it reaches 150GB as defined by max_bytes_to_merge_at_max_space_in_pool.
I would like to understand following.
Are all parts ordered by ORDER BY? So for e.g. if ORDER BY is by date and if new inserts have older date it will eventually get merged with corresponding older part that has matching date (even if this older part is of 150GB and needs no merging) or is ordering not maintained across parts so this new insert can be merged in a new 150 GB part?
In table engines like ReplacingMergeTree if duplicate entry to be removed are part of large parts having 150GB of size, do all of them gets reprocessed (as this deduplication only occurs during merge and there is no need to merge parts that have reached 150GB)?
I am trying to understand all of this as I would like ClickHouse to make no (or minimal) changes to parts once they reach size defined by max_bytes_to_merge_at_max_space_in_pool. Reason for this is because as our data would grow we would want to use slower-larger disks and maybe S3 for older data. I understand I can also create multiple volumes, have ClickHouse move parts to this volume once it reaches 150GB using max_data_part_size_bytes and set prefer_not_to_merge on this volume. What would be disadvantages of doing this?

Related

Deleting Huge Data In Cassandra Cluster

I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years.
Currently I can think of one approach:
1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.
Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.
Please help me in finding out the best solution.
The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.
So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.
You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.

Visual Studio/IIS changes the order of Access file's table data

I encountered weird problem when trying to use Access Database file in my project.
I have a table named 'CarGears'. When I add some rows to it, Visual Studio shows them at the start of the table, instead of putting it at the button.
I am including pics of the problem
Any solution for it?
Thanks
Access is in fact sorting the table by the primary key, and thus is IS sorting the data every time.
Most modern database systems consider all data un-ordered and you cannot assume order of data. The data is considered an un-ordered bucket of data. You can’t assume order since 5 users might be adding data, and now such order can’t be known since records can be added or removed by other users.
If you are using punched card computers, or reading a text file, you can assume order, but a database is not paper, not punched cards, and order of data can NEVER be assumed.
Bottom line:
If you want ordered data, then you query pull has to sort the data – any other assumption is NOT valid and cannot be assumed.
Access can sort 100,000 records with an index in about 1000th of a second or less – performance of such sorts is a non-issue, and even if it was, assumption of order cannot be assumed with a modern database system. This lesson and rule is usually the first day in any computer class that involves database theory.
You are doomed in the use of computers and data processing if you ignore the above simple fact of computers and how database systems work.

detecting when data has changed

Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

database archiving vs timeperiod based tables/fields

I am working on an employee objectives web application.
Lead/Manager sets objectives for team members after discussing with them. This is an yearly/half-yearly/quarterly depending on appraisal cycle the organization follows.
Now question is is better approach to add time period based fields or archive previous quarter's/year's data. When a user want to see previous objectives (not so frequent activity), the archive that belongs to that date may be restored in some temp table and shown to employee.
Points to start with
archiving: reduces db size, results in simpler db queries, adds an overhead when someone tried to see old data.
time-period based field/tables: one or more extra joins in queries, previous data is treated similar to current data so no overhead in retrieving old data.
PS: it is not space cost, my point is if we can achieve some optimization in terms of performance, as this is a web app and at peak times all the employees in an organization will be looking/updating it. so removing time period makes my queries a lot simpler.
Thanks
Assuming you're talking about data that changes over time, as opposed to logging-type data, then my preferred approach is to keep only the "latest" version of the data in your primary table(s), and to automatically copy the previous version of the data into a archive table. This archive table would mirror the primary, with the addition of versioned fields, such as timestamps. This archiving can be done with a trigger.
The main benefit that I see with this approach is that it doesn't compromise your database design. In particular, you don't have to worry about using composite keys that incorporate the version fields (in fact using time-based fields as keys may not even be permitted by your database).
If you need to go and look at the old data, you can run a select against the archive table and add version constraints to the query.
I would start off adding your time period fields and waiting until size becomes an issue. The kind of data you are describing does not sound like it is going to consume a lot of storage space.
Should it grow uncontrollably you can always look at the archive approach later - but the coding is going to take much longer than simply storing the relevant period with your data.
It seems to me that if you have the requirement that a user can look arbitrarily far back in the past, then you really must keep the data accessible.
This just won't be sustainable:
the archive that belongs to that date may be restored in some temp table and shown to employee.
My recommendation would be to periodically (read when absolutely necessary) move 'very old' data to another table for this purpose. Disk space is extremely cheap at this point, so keeping that data around is not nearly as expensive as implementing the system that can go back to an arbitrary time and restore an archive.

How do I model data that slowly changes over time?

Let's say I'm getting a large (2 million rows?) amount of data that's supposed to be static and unchanging. Supposed to be. And this data gets republished monthly. What methods are available to 1) be aware of what data points have changed from month to month and 2) consume the data given a point in time?
Solution 1) Naively save every snapshot of data, annotated by date. Diff awareness is handled by some in-house program, but consumption of the data by date is trivial. Cons, space requirements balloon by an order of magnitude.
Solution 2A) Using an in-house program, track when the diffs happen and store them in an EAV table, annotated by date. Space requirements are low, but consumption integrated with the original data becomes unwieldly.
Solution 2B) Using an in-house program, track when the diffs happen and store them in a sparsely filled table that looks much like the original table, filled only with the data that's changed and the date when changed. Cons, model is sparse and consumption integrated with the original data is non-trivial.
I guess, basically, how do I integrate the dimension of time into a relational database, keeping in mind both the viewing of the data and awareness of differences between time periods?
Does this relate to data warehousing at all?
Smells like... Slowly changing dimension?
I had a similar problem - big flat files imported to the database once per day. Most of the data is unchanging.
Add two extra columns to the table, starting_date and ending_date. The default value for ending_date should be sometime in the future.
To compare one file to the next, sort them both by the key columns, then read one row from each file.
If the keys are equal: compare the rest of the columns to see if the data has changed. If the row data is equal, the row is already in the database and there's nothing to do; if it's different, update the existing row in the database with an ending_date of today and insert a new row with a starting_date of today. Read a new row from both files.
If the key from the old file is smaller: the row was deleted. Update ending_date to today. Read a new row from the old file.
If the key from the new file is smaller: a row was inserted. Insert the row into the database with a starting_date of today. Read a new row from the new file.
Repeat until you've read everything from both files.
Now to query for the rows that were valid at any date, just select with a where clause test_date between start_date and end_date.
You could also take a leaf from the datawarehousing book. There are basically three ways of of dealing with changing data.
Have a look at this wikipedia article for SCD's but it is in essence tables:
http://en.wikipedia.org/wiki/Slowly_changing_dimension
A lot of this depends on how you're storing the data. There are two factors to consider:
How oftne does the data change?
How much does the data change?
The distinction is important. If it changes often but not much then annotated snapshots are going to be extremely inefficient. If it changes infrequently but a lot then they're a better solution.
It also depends on if you need to see what the data looked like at a specific point in time.
If you're using Oracle, for example, you can use flashback queries to see a consistent view of the data at some arbitrary point.
Personally I think you're better off storing it incrementally and, at a minimum, using some form of auditing to track changes so you can recover an historic snapshot if it's ever required. But like I said, this depends on many factors.
If it was me, I'd save the whole thing every month (not necessarily in a database, but as a data file or text file off-line) - you will be glad you did. Even at a row size of 4096 bytes (wild ass guess), you are only talking about 8G of disk per month. You can save a LOT of months on a 300G drive. I did something similar for years, when I was getting over 1G per day in downloads to a datawarehouse.
This sounds to me rather like the problem faced by source code version control systems. These store patches which are used to create the changes as they occur. So if a file does not change, or only a few lines change, the patch that needs to be stored is relatively very small. The system also stores which version each patch contributes to. So, when viewing a particular version of a particular file, the initial version is recovered and all the patches, up to the version requested are applied.
In your, very general, situation, you need to divide up your data into chunks. Hopefully there are natural divisions you can use, but if this division has to be arbitrary that's should be OK. Whenever a change occurs, store the patch for the affected chunk and record a new version. Now, when you want to view a particular date, find the last version that predates the view date, apply the patches for the chunk that has been requested, and display.
Could you do the following:
1. Each month BCP all data into a temporary table
2. Run a script or stored procedure to update the primary table
(which has an additional DateTime column as part of a composite key),
with any changes made.
3. Repeat each month.
This should give you a table, which you can query data for at a particular date.
In addition each change will be logged, and the size of the table shouldn't change dramatically over time.
However, as a backup to this, I would store each data file as Brennan suggests.

Resources