Clickhouse-client insert optimization - database

I'm inserting a lot of CSV data files into remote Clickhouse database that already has a lot of data. I'm doing it using simple script like this:
...
for j in *.csv; do
clickhouse-client --max_insert_threads=32 --receive_timeout=30000 --input_format_allow_errors_num=999999999 --host "..." --database "..." --port 9000 --user "..." --password "..." --query "INSERT INTO ... FORMAT CSV" < "$j"
done
...
So my question is: how to optimize these inserts? I already used these options for optimization:
--max_insert_threads=32 --receive_timeout=30000
Are there any more options in clickhouse-client I should use for better performance and for what purpose? One file can be like 300-500mb (and sometimes more). According to this article using parallel processes won't help that's why I'm inserting one file at time.

max_insert_threads is not applicable here, it's about insert select inside CH server.
According to this article using parallel processes won't help
It should help (it depends on CPU and disk power), just try
# parallelism 6 (-P6)
find . -type f -name '*.csv' | xargs -P 6 -n 1 clickhouse-client --input_format_parallel_parsing=0 --receive_timeout=30000 --input_format_allow_errors_num=999999999 --host "..." --database "..." --port 9000 --user "..." --password "..." --query "INSERT INTO ... FORMAT CSV"
I set input_format_parallel_parsing=0 deliberately, it improves total performance in case of multiple parallel loads.

Related

Is there any way to specify clickhouse-client timeout?

I'm inserting a lot of CSV's into Clickhouse database. Sometimes it stucks on one of the files or something is wrong with the remote server I'm inserting to so it waits default amount of time and then outputs that Code: 209. DB::NetException: Timeout exceeded while reading from socket (ip, 300000 ms): while receiving packet from ip:9000: (in query: ...). (SOCKET_TIMEOUT)
Is there any way to specify this timeout so I don't need to wait for 5 minutes? I'm inserting with the script like this:
clickhouse-client --host "Host" --database "db" --port 9000 --user "User" --password "Password" --query "INSERT INTO table FORMAT CSV" < "file.csv"
You can try --receive_timeout.
There are bunch of timeout options available at clickhouse-client. Check this command:
clickhouse-client --help | grep timeout

What is the correct way to export/import data using cassandra-loader/cassandra-unloader for YugaByte DB on a table with JSONB column(s)

I tried to use the steps described here https://docs.yugabyte.com/v1.1/manage/data-migration/cassandra/bulk-export/
wget https://github.com/YugaByte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-loader
wget https://github.com/YugaByte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-unloader
chmod a+x cassandra-unloader
chmod a+x cassandra-loader
Since above tools are JVM based, installed open jdk
sudo yum install java-1.8.0-openjdk
Then exported the rows using:
% cd /home/yugabyte/entity
% ./cassandra-unloader -schema "my_ksp.my_table(id,type,details)" -host <tserver-ip> -f export.csv -numThreads 3
Total rows retrieved: 10000
Here details is a JSONB column. Next, I create a new table my_table_new in the same cluster, and try to load this data into
./cassandra-loader -schema "my_ksp.my_table_new(id,type,details)" -host <tserver-ip> -f /home/yugabyte/entity -numThreads 3 -progressRate 200000 -numFutures 256 -rate 5000 -queryTimeout 65
But get errors of the form:
Row has different number of fields (12) than expected (3)
It looks like the default delimiter “,” in the CSV file is causing the issue, since the JSONB data in the CSV file also has commas.
As an alternative tried passing -delim “\t” to cassandra-unloader-- but that seems to insert two characters “\” and “t” and not the single-tab character. Is that expected?
You are correct that, with cassandra-unloader/cassandra-loader, the default delimiter (",") doesn't work in the presence of YCQL JSONB columns in Yugabyte DB.
Regarding:
<< As an alternative tried passing -delim “\t” to cassandra-unloader-- but that seems to insert two characters “\” and “t” and not the single-tab character. Is that expected? >>
Using tab as the delimiter character should work right. But the unix shell needs some escaping to pass "\t" correctly to the program. Please see: https://superuser.com/questions/362235/how-do-i-enter-a-literal-tab-character-in-a-bash-shell
Use:
-delim $'\t'
instead of
-delim "\t"
So for example for the export, try:
./cassandra-unloader -schema "my_ksp.my_table(id,type,details)" -host <tserver-ip> -f export.csv -numThreads 3 -delim $'\t'
and for the import, try:
./cassandra-loader -schema "my_ksp.my_table_new(id,type,details)" -host <tserver-ip> -f /home/yugabyte/entity -numThreads 3 -progressRate 200000 -numFutures 256 -rate 5000 -queryTimeout 65 -delim $'\t'

SQL Server table swap

I have a table that is truncated and loaded with data everyday the problem is truncating the table is taking a while and users are noticing this. What I am wondering is, is there a way to have two of the same tables and truncate one then load the new data and then have the users user that new table and just keep switching between the two table.
If you're clearing out the old table, as well as populating new you could use the OUTPUT clause. Be mindful of the potential for log growth, consider a loop/batch approach if this may be a problem.
DELETE
OldDatabase.dbo.MyTable
OUTPUT
DELETED.col1
, DELETED.col2
, DELETED.col3
INTO
NewDatabase.dbo.MyTable
Or you can Use BCP which is a handy alternative to be aware of. Note this is using SQLCMD syntax.
:setvar SourceServer OldServer
:setvar SourceDatabase OldDatabase
:setvar DestinationServer NewServer
:setvar DestinationDatabase NewDatabase
:setvar BCPFilePath "C:\"
!!bcp "$(SourceDatabase).dbo.MyTable" FORMAT nul -S "$(SourceServer)" -T -n -q -f "$(BCPFilePath)MyTable.fmt"
!!bcp "SELECT * FROM $(SourceDatabase).dbo.MyTable WHERE col1=x AND col2=y" queryout "$(BCPFilePath)MyTable.dat" -S "$(SourceServer)" -T -q -f "$(BCPFilePath)MyTable.fmt" -> "$(BCPFilePath)MyTable.txt"
!!bcp "$(DestinationDatabase).dbo.MyTable" in $(BCPFilePath)MyTable.dat -S $(DestinationServer) -T -E -q -b 2500 -h "TABLOCK" -f $(BCPFilePath)MyTable.fmt

How Do I Generate Sybase BCP Fmt file?

I have a huge database which I want to dump out using BCP and then load it up elsewhere. I have done quite a bit of research on the Sybase version of BCP (being more familiar with the MSSQL one) and I see how to USE an Import file but I can't figure out for the life of me how to create one.
I am currently making my Sybase bcp out files of data like this:
bcp mytester.dbo.XTABLE out XTABLE.bcp -U sa -P mypass -T -n
and trying to import them back in like this:
bcp mytester.dbo.XTABLE in XTABLE.bcp -E -n -S Sybase_157 -U sa -P SyAdmin
Right now, the IN part gives me an error about IDENTITY_INSERT regardless of if the table has an identity or not:
Server Message: Sybase157 - Msg 7756, Level 16, State 1: Cannot use
'SET IDENTITY_INSERT' for table 'mytester.dbo.XTABLE' because the
table does not have the identity property.
I have often used the great info on this page for help, but this is the first time i've put in a question, so i humbly request any guidance you all can provide :)
In your BCP in, the -E flag tells bcp to take identity column values from the input file. I would try running it without that flag. fmt files in Sybase are a bit finicky, and I would try to avoid if possible. So as long as your schemas are the same between your systems the following command should work:
bcp mytester.dbo.XTABLE in XTABLE.bcp -n -S Sybase_157 -U sa -P SyAdmin
Also, the -T flag on your bcp out seems odd. I know SQLServer -T is a security setting, but in Sybase it indicates the max size of a text or image column, and is followed by a number..e.g -T 32000 (would be 32Kbytes)
But to answer the question in your title, if you run bcp out interactively (without specifying -c,-n, or -f) it will step through each column, prompting for information. At the end it will ask if you want to create a format file, and allow you to specify the name of the file.
For reference, here is the syntax and available flags:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc30191.1550/html/utility/X14951.htm
And the chapter in the Utility Guide:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc30191.1550/html/utility/BABGCCIC.htm

How do I restore one database from a mysqldump containing multiple databases?

I have a mysql dump with 5 databases and would like to know if there is a way to import just one of those (using mysqldump or other).
Suggestions appreciated.
You can use the mysql command line --one-database option.
mysql> mysql -u root -p --one-database YOURDBNAME < YOURFILE.SQL
Of course be careful when you do this.
You can also use a mysql dumpsplitter.
You can pipe the dumped SQL through sed and have it extract the database for you. Something like:
cat mysqldumped.sql | \
sed -n -e '/^CREATE DATABASE.*`the_database_you_want`/,/^CREATE DATABASE/ p' | \
sed -e '$d' | \
mysql
The two sed commands:
Only print the lines matching between the CREATE DATABASE lines (including both CREATE DATABASE lines), and
Delete the last CREATE DATABASE line from the output since we don't want mysqld to create a second database.
If your dump does not contain the CREATE DATABASE lines, you can also match against the USE lines.

Resources