Can't start clickhouse service, too many files in ../data/default/<TableName> - database

I have a strange problem with my standalone clickhouse-server installation. Server was running for some time with nearly default config, except data and tmp directories was replaced to separate disk:
cat /etc/clickhouse-server/config.d/my_config.xml
<?xml version="1.0"?>
<yandex>
<path>/data/clickhouse/</path>
<tmp_path>/data/clickhouse/tmp/</tmp_path>
</yandex>
Today the server stopped responding with connection refused error. It was rebooted and after that the service couldn't completely start:
2018.05.28 13:15:44.248373 [ 2 ] <Information> DatabaseOrdinary (default): 42.86%
2018.05.28 13:15:44.259860 [ 2 ] <Debug> default.event_4648 (Data): Loading data parts
2018.05.28 13:16:02.531851 [ 2 ] <Debug> default.event_4648 (Data): Loaded data parts (2168 items)
2018.05.28 13:16:02.532130 [ 2 ] <Information> DatabaseOrdinary (default): 57.14%
2018.05.28 13:16:02.534622 [ 2 ] <Debug> default.event_5156 (Data): Loading data parts
2018.05.28 13:34:01.731053 [ 3 ] <Information> Application: Received termination signal (Terminated)
Really, I stopped process on 57%, because it starts too long(maybe it could start in an hour or two, I didn't try).
The log level by default is "trace", but I didn't show any reasons of such behavior.
I think the problem is in file count in /data/clickhouse/data/default/event_5156.
Now it is 626023 directories in it and ls -la command do not work in this catalog properly, I have to use find to count files:
# time find . -maxdepth 1 | wc -l
626023
real 5m0.302s
user 0m3.114s
sys 0m24.848s
I have two questions:
1)Why Clickhouse-Server generated so much files and directories, with default config?
2)How can I start the service without data loss in adequate time?

Issue was in data update method. I used script with jdbc connector and have been sending one string per request. After changing scheme to batch update, the issue was solved.

Related

Race Condition in Sagemaker Batch Transform Job

We are facing this bug in our production environment. I have been looking for a solution for a while and I cannot seem to solve it. Any help would be appreciated.
We are using Sagemaker Batch Transform to perform inference on our machine learning models. Each job is supposed to create one instance using a docker image from our ECR container. This job then consumes a payload and starts processing it using a pytorch script. When the job is done, the script calls an API to store the results.
The issue is that when we check the cloud watch logs for a SINGLE job, we see that it is repeated. After the job is repeated multiple times, the individual instances of the same job may or may not finish and the whole operation returns with an error.
Basically, we see the following issue in our cloud watch logs and cannot seem to figure out what is causing this:
2022-04-24 19:41:47,865 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 19:52:09,522 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 20:12:11,834 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There are no errors but the cloud watch logs stop here. Sagemaker returns an error to the client.]
The following sample code is what we are using to run the jobs:
def inference_batch(self):
batch_input = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-input/batch.csv"
batch_output = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-output/"
job_name = f"{self.cnf.SAGEMAKER_MODEL}-{str(datetime.datetime.now().strftime('%Y-%m-%d-%H-%m-%S'))}"
transform_input = {
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': batch_input
}
},
'ContentType': 'text/csv',
'SplitType': 'Line',
}
transform_output = {
'S3OutputPath': batch_output
}
transform_resources = {
'InstanceType': self.cnf.SAGEMAKER_BATCH_INSTANCE,
'InstanceCount': 1
}
# self.sm_boto_client is an instance of boto3.Session(region_name="some-region).client("sagemaker")
self.sm_boto_client.create_transform_job(
TransformJobName=job_name,
ModelName=self.cnf.SAGEMAKER_MODEL,
TransformInput=transform_input,
TransformOutput=transform_output,
TransformResources=transform_resources
)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
print(f'Executing transform job {job_name}...')
while status['TransformJobStatus'] == 'InProgress':
time.sleep(5)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
if status['TransformJobStatus'] == 'Completed':
print(f'Batch transform job {job_name} successfully completed.')
else:
raise Exception(f'Batch transform job {job_name} failed.')

Flink - unable to recover after yarn node termination

We are running flink on yarn. We were performing Disaster recovery Testing and as part of that, we manually terminated one of the nodes that had a flink application running. Once the instance was brought back up, the application went in for multiple attempts and each attempt had the following error :
AM Container for appattempt_1602902099413_0006_000027 exited with exitCode: -1000
Failing this attempt.Diagnostics: Could not obtain block: BP-986419965-xx.xx.xx.xx-1602902058651:blk_1073743332_2508
file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-986419965-10.61.71.85-1602902058651:blk_1073743332_2508 file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1053)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1036)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1015)at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647)at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926)at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982)at
java.io.DataInputStream.read(DataInputStream.java:100)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:90)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:64)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:125)at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)at
java.security.AccessController.doPrivileged(Native Method)at
javax.security.auth.Subject.doAs(Subject.java:422)at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)For
more detailed output, check the application tracking page: http://<>.compute.internal:8088/cluster/app/application_1602902099413_0006 Then click on links to logs of each attempt.
Could someone let us know what content is being stored in HDFS and if this could be redirected to S3?
Adding checkpoint related settings :
StateBackend rocksDbStateBackend = new RocksDBStateBackend("s3://Path", true);
streamExecutionEnvironment.setStateBackend(rocksDbStateBackend)
streamExecutionEnvironment.enableCheckpointing(10000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
streamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointTimeout(60000);
streamExecutionEnvironment.getCheckpointConfig().setMaxConcurrentCheckpoints(60000);
streamExecutionEnvironment.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
streamExecutionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);
I had the same issue,
I fixed with setting for hdfs-site
{
"Classification": "hdfs-site",
"Properties": {
"dfs.client.use.datanode.hostname": "true",
"dfs.replication": "2",
"dfs.namenode.replication.min": "2",
"dfs.namenode.maintenance.replication.min": "2"
}
}
I think hdfs lost data on node terminated, so i replication data on multi node.

Gatling not logging to influxdb?

I've tried following the guide at http://gatling.io/docs/2.2.3/realtime_monitoring/index.html to log my test results to influxdb and display the data in a grafana that I have previously set up. However I can't see any of the data that gatling is supposed to log anywhere in influxdb.
I've edited by influxdb.conf file so that it contains the following fields:
[[graphite]]
enabled = true
database = "gatlingdb"
bind-address = ":2003"
protocol = "tcp"
consistency-level = "one"
name-separator = "."
templates = [
"gatling.*.*.*.count measurement.simulation.request.status.field",
"gatling.*.*.*.min measurement.simulation.request.status.field",
"gatling.*.*.*.max measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles50 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles75 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles95 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles99 measurement.simulation.request.status.field"
]
and my gatling.conf file contains the following fields:
data {
writers = [console, file, graphite] # The list of DataWriters to which Gatling write simulation data (currently supported : console, file, graphite, jdbc)
console {
#light = false # When set to true, displays a light version without detailed request stats
}
graphite {
#light = false # only send the all* stats
host = "127.0.0.1" # The host where the Carbon server is located
port = 2003 # The port to which the Carbon server listens to (2003 is default for plaintext, 2004 is default for pickle)
protocol = "tcp" # The protocol used to send data to Carbon (currently supported : "tcp", "udp")
rootPathPrefix = "gatling" # The common prefix of all metrics sent to Graphite
#bufferSize = 8192 # GraphiteDataWriter's internal data buffer size, in bytes
#writeInterval = 1 # GraphiteDataWriter's write interval, in seconds
}
Whenever i run my gatling tests I see no error messages or anything that indicates that anything is wrong, but I cannot see anything in the influxd logs that indicates that anything has been logged to influxdb, nor can I see any data in the gatlingdb database. I am using influxdb v0.10 and gatling v2.2.3 on Ubuntu
Can anyone help me figure out what I am doing wrong?
Updated to influxdb v1.1 and the problem seemed to have resolved itself from doing that

Error enabling AD authentication in GPFS filesystem

Ive created a filesystem on a 2 linux node(both running RHEL 7) GPFS cluster. I am trying to enable AD authentication but am receiving an error and unable to find an answer to fix. Here is the process I am following on the manager node:
./spectrumscale file auth ad
Yes to edit template
I fill in the template with the following info:
[file_ad]
servers = bdtestdc01 <--- my test AD server
netbios_name = gpfscluser <--- the name I gave the cluster during setup Is this field looking for another name?
idmap_role = master
bind_username = administrator
bind_password = the domain password of the administrator account
unixmap_domains = bdtest.subdomain.company.com
I save the template and set the password. I then run:
./spectrumscale deploy
It errors at Installing Authentication. The log file says:
Error executing action run on resource 'execute[Configure file authentication]
2015-12-21 10:45:31,440 [ TRACE ] bdgpfs01.subdomain.company.com Chef Client failed. 1 resources updated in 3.641691552 seconds
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com [2015-12-21T10:45:31-08:00] ERROR: execute[Configure file authentication] (auth::auth_file_configure line 22) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com ---- Begin output of /usr/lpp/mmfs/bin/mmuserauth service create --data-access-method file --type ad --servers 'bdtestbluedc01' --netbios-name 'gpfscluster' --idmap-role 'master' --user-name 'administrator' --password XXXXXX --unixmap-domains 'bdtest.subdomain.company.com' --idmap-range '10000000-299999999' --idmap-range-size '1000000' --enable-nfs-kerberos ----
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com STDOUT:
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com STDERR: mmuserauth service create: Syntax error. The correct syntax is:
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com --unixmap-domains domain(lower value-higher value)
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com mmuserauth service create: Command failed. Examine previous error messages to determine cause.
2015-12-21 10:45:31,438 [ TRACE ] bdgpfs01.subdomain.company.com mmuserauth service create: Command failed. Examine previous error messages to determine cause.
The domain field was looking for nonwhitespace characters and an ID Map Range.
Instead of using the configuration template, I ran the following command which enabled the authentication successfully...
mmuserauth service create --type ad --data-access-method file --netbios-name bdtestnode --user-name administrator --idmap-role master --servers myADserver --password Passwr0rd --idmap-range-size 1000000 --idmap-range 10000000-299999999
I then ran the following command to test:
id "testdomain\administrator"
It returned the proper groups and IDs

How to load data from the online GAE datastore into the local development server?

I have previously used the approach described in GAE docs to download backups of my entities on the live datastore.
Currently, I have a csv file per entity kind, that I got by writing bulkloader.yaml and using this command:
appcfg.py download_data --config_file=bulkloader.yaml --filename=users.csv --kind=Permission --url=http://your_app_id.appspot.com/_ah/remote_api
I also have a sql3 dump file that I got using the command:
appcfg.py download_data --kind=<kind> --url=http://your_app_id.appspot.com/_ah/remote_api --filename=<data-filename>
Now if I try this command:
appcfg.py upload_data --url=http://your_app_id.appspot.com/_ah/remote_api --kind=<kind> --filename=<data-filename>
Replacing the URL by localhost:8080, it asks me for a username/password. Now even if provide a mock username (test#example.com) in http://localhost:8080/_ah/remote_api and check the "admin" checkbox, it always gives me an authentication error.
The other alternative mentioned in the docs is using this:
appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album --url=http://localhost:8080/_ah/remote_api <app-directory>
I wrote a loader, and tried it out, it also asks for a username and password, but it accepts anything here. The output is as follows:
/usr/local/google_appengine/google/appengine/api/search/search.py:232: UserWarning: DocumentOperationResult._code is deprecated. Use OperationResult._code instead.
'Use OperationResult.%s instead.' % (name, name))
/usr/local/google_appengine/google/appengine/api/search/search.py:232: UserWarning: DocumentOperationResult._CODES is deprecated. Use OperationResult._CODES instead.
'Use OperationResult.%s instead.' % (name, name))
Application: knowledgetestgame
Uploading data records.
[INFO ] Logging to bulkloader-log-20121113.210613
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
[INFO ] Opening database: bulkloader-progress-20121113.210613.sql3
Please enter login credentials for localhost
Email: test#example.com
Password for test#example.com:
[INFO ] Connecting to localhost:8080/_ah/remote_api
[INFO ] Starting import; maximum 10 entities per post
[ERROR ] [WorkerThread-4] WorkerThread:
Traceback (most recent call last):
File "/usr/local/google_appengine/google/appengine/tools/adaptive_thread_pool.py", line 176, in WorkOnItems
status, instruction = item.PerformWork(self.__thread_pool)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 764, in PerformWork
transfer_time = self._TransferItem(thread_pool)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 933, in _TransferItem
self.content = self.request_manager.EncodeContent(self.rows)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 1394, in EncodeContent
entity = loader.create_entity(values, key_name=key, parent=parent)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 2728, in create_entity
(len(self.__properties), len(values)))
AssertionError: Expected 17 columns, found 18.
[INFO ] [WorkerThread-5] Backing off due to errors: 1.0 seconds
[INFO ] Unexpected thread death: WorkerThread-4
[INFO ] An error occurred. Shutting down...
[ERROR ] Error in WorkerThread-4: Expected 17 columns, found 18.
[INFO ] 980 entities total, 0 previously transferred
[INFO ] 0 entities (278 bytes) transferred in 5.9 seconds
[INFO ] Some entities not successfully transferred
I have ~4000 entities in total, it says here that 980 are transferred, but actually I check the local datastore and I find none of them..
Below is the loader I use (I used NDB for the Guess entity)
import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader
from google.appengine.ext.ndb import key
class Guess(db.Model):
pass
class GuessLoader(bulkloader.Loader):
def __init__(self):
bulkloader.Loader.__init__(self, 'Guess',
[('selectedAssociation', lambda x: x.decode('utf-8')),
('suggestionsList', lambda x: x.decode('utf-8')),
('associationIndexInList', int),
('timeEntered',
lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
('rank', int),
('topicName', lambda x: x.decode('utf-8')),
('topic', int),
('player', int),
('game', int),
('guessString', lambda x: x.decode('utf-8')),
('guessTime',
lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
('accountType', lambda x: x.decode('utf-8')),
('nthGuess', int),
('score', float),
('cutByRoundEnd', bool),
('suggestionsListDelay', int),
('occurrences', float)
])
loaders = [GuessLoader]
Edit: I just noticed this part in the error message [ERROR ] Error in WorkerThread-0: Expected 17 columns, found 18. while actually I just went through the whole csv file, and made sure that every line has 18 columns. I checked the loader, and found that I was missing the key column, I gave it a type int but this doesn't work.
If you have problems with the authentication, put the following in your appengine_config.py:
if os.environ.get('SERVER_SOFTWARE','').startswith('Development'):
remoteapi_CUSTOM_ENVIRONMENT_AUTHENTICATION = (
'REMOTE_ADDR', ['127.0.0.1'])
then run
appcfg.py download_data --url=http://APPNAME.appspot.com/_ah/remote_api --filename=dump --kind=EntityName
appcfg.py upload_data --url=http://localhost:8080/_ah/remote_api --filename=dump --application=dev~APPNAME
Try just pressing Enter (no username/password). This seemed to do the trick for me. My command (wrapped in a bash script to prevent import errors that I occasionally received) is:
#!/bin/bash
# Modify path
export PYTHONPATH=$PYTHONPATH:.
# Load data
python /path/to/app/config/appcfg.py upload_data \
--config_file=<my_loader.py> \
--filename=<output.csv> \
--kind=<kind> \
--application=dev~<application_id> \
--url=http://localhost:8088/_ah/remote_api ./
When prompted for the Email, I hit enter and all is uploaded to the dev server. I am not using NDB in this case, although I do not believe that should make a difference.

Resources