Pentaho Kettle - CPU Utilization 100% for Table Input & Text File Output - sql-server

I have a basic Pentaho transformation in my job that reads 5,000 records from a stored procedure in SQL Server via a 'Table Input' step. This data has 5 columns one of which is an XML column. After the 'Table Input' a 'Text File Output' step is run which takes the path to save from one of the columns and the xml data to save as the only field provided in the fields tab. This then creates 5,000 XML files in the given location by streaming data from the 'Table Input' to 'Text File Output'.
When this job is executed it runs at 99-100% CPU utilization for the duration of the job and then drops back down to ~5-10% CPU utilization afterwards. Is there any way to control the CPU utilization either through Pentaho or command prompt? This is running on a Windows Server 2012 R2 machine with 4GB of RAM with a Intel Xeon CPU E5-2680 v2 # 2.8 GHz processor. I have seen that the memory usage can get controlled through Spoon.bat but haven't found anything online about controlling CPU usage.

In my experience, neither of those steps is CPU intensive under normal circumstances. Two causes I can think of are:
It's choking trying to format the XML. That would be fixed by checking the options Lazy conversion in the Table input step and Fast data dump (no formatting) in the text file output step. Then it should just stream the string data through.
The other is that you have huge XMLs and the CPU usage is actually garbage collection because Pentaho is running out of memory. Test this by increasing the maximum heap space (the -Xmx1024m option in the startup script.)

Related

How to configure the configuration file in /var/lib/taos/vnode/vnode2/wal/

I'm using TDengine 3.0. Now it is found that a large amount of 0000000.log is generated under /var/lib/taos/vnode/vnode2/wal/, which takes up a lot of space.
How should the log file be configured, and how should the file be cleaned up?
you could set WAL_RETENTION_PERIOD value to 0 then each WAL file is deleted immediately after its contents are written to disk. it would decrease the space immediately.
from https://docs.tdengine.com/taos-sql/database/

Script to alert a resize is necessary for database to fit into memory?

I want to create a bash script that will alert me that the size of my database is becoming too large to fit into memory. An effective bash script would run daily and only alert me via email when db_size / memory >= 0.5.
For example: You should to size up your Digital Ocean Droplet soon for your PostgreSQL server: db_size = 0.6 GB; system_memory = 1.0 GB;
The problem is I am not sure exactly how to accurately make this comparison. My first stab at this was to use psql command pg_size_pretty and the Linux command free. Is there a good/accurate comparison that I can make between db_size and total memory size (or another field) to alert me when I need to resize my droplet? Or are there any tools already in place that do this type of thing?
Bash Script
#!/usr/bin/env bash
db_size=`psql dbname username -c "SELECT pg_size_pretty( pg_database_size('dbname') );"`
echo $db_size
free
Output
user#postgres-server:~/gitrepo/scripts/cronjobs# bash postgres_db_size_check.sh
pg_size_pretty ---------------- 6976 kB (1 row)
total used free shared buff/cache available
Mem: 1016000 48648 142404 54492 824948 699852
Swap: 0 0 0
What is the correct comparison to use in the output above? In the output above the size of the PostgreSQL DB is only 6.976 MB and the server's available memory is 1,016.00 MB. So right now we are at 0.6% of RAM used (6.976 / 1,016.00) with < 20 users. As I scale my database I expect the size of the database to increase quickly and I want to stay ahead of it before RAM becomes an issue.
After doing some digging into PostgreSQL tuning I have gathered that giving PostgreSQL more memory to use is not as straightforward as just increasing the amount of RAM. Several variables likely need to be set to use the additional memory including work_mem, effective_cache_size, and shared_buffers. I say this to note that I realize increasing the size of my server's RAM is only one step in the process.
I asked this question on Database Administrators but no one got back to me over there.

Trying to transfer the output data from multiple python scripts to SQLite3

I'm running 12 different python scripts at the same time to screen for a given criteria out of thousands of data points each hour.
I would like to save the output data to a "master csv file" but figured it would be better to put the data into SQLite3 instead as I would be overwhelmed with csv files.
I am trying to transfer the output straight to SQLite3.
This is my script so far:
symbol = []
with open(r'c:\\Users\\Desktop\\Results.csv') as f:
for line in f:
symbol.append(line.strip())
f.close
path_out = (r'c:\\Users\\Desktop\\Results.csv')
i=0
While you can use a sqlite database concurrently, it might slow your processes down considerably.
Whenever a program wants to write to the database file, the whole database has to be locked:
When SQLite tries to access a file that is locked by another process, the default behavior is to return SQLITE_BUSY.
So you could end up with more than one of your twelve processes waiting for the database to become available because one of them is writing.
Basically, concurrent read/write access is what client/server databases like PostgreSQL are made for. It is not a primary use case for sqlite.
So having the twelve programs each write a separate CSV file and merging them later is probably not such a bad choice, in my opinion. It is much easier than setting up a PostgreSQL server, at any rate.

Read large .sql file in Java and write the data in database.

I have to read a large. Sql file (50 GB) in Java and write the content in a database. I tried it with small files (200 MB), it works but only for the first file, by the second file it becomes too slowly and doesn't terminate correctly (OOM Java heap space) I changed the xmx to 6144m but it still slowly by the first file. Can I refresh the memory after every iteration.

Solr search with multiple word in autocomplete

I try to make a search by Solr using a file txt specified in sourceLocation attribute of suggest searchComponet. I've used this example:
sample dict
hard disk hitachi
hard disk jjdd 3.0
hard disk wd 2.0
and make this query
host/solr/suggest?q=hard%20disk%20h&spellcheck=true&spellcheck.collate=true&spellcheck.build=true
but the response is
03build304hard disk
jjddhard disk wdhard disk
hitachi31011hard disk
jjddhard disk wdhard disk
hitachi hard disk jjdd disk
hard disk jjdd
I want to have only one result, hard disk hitachi.
If i write a query with param q=hard disk , i've the same result and in collation tag is put hard disk jjdd disk
it seems that search don't work on multi words
Can someone help me?

Resources