We try to import our data into SolrCloud using MapReduce batch indexing. We face a problem at the reduce phase, that solr.xml cannot be found. We create a 'twitter' collection but looking at the logs, after it failed to load in solr.xml, it uses the default one and tries to create 'collection1' (failed) and 'core1' (success) SolrCore. I'm not sure if we need to create our own solr.xml and where to put it (we try to put it at several places but it seems not to load in). Below is the log:
2022 [main] INFO org.apache.solr.hadoop.HeartBeater - Heart beat reporting class is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
2025 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Using this unpacked directory as solr home: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip
2025 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Creating embedded Solr server with solrHomeDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip, fs: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1828461666_1, ugi=nguyen (auth:SIMPLE)]], outputShardDir: hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014
2029 [Thread-64] INFO org.apache.solr.hadoop.HeartBeater - HeartBeat thread running
2030 [Thread-64] INFO org.apache.solr.hadoop.HeartBeater - Issuing heart beat for 1 threads
2083 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/'
2259 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Constructed instance information solr.home /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip (/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip), instance dir /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/, conf dir /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/conf/, writing index to solr.data.dir hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data, with permdir hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014
2266 [main] INFO org.apache.solr.core.ConfigSolr - Loading container configuration from /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/solr.xml
2267 [main] INFO org.apache.solr.core.ConfigSolr - /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/solr.xml does not exist, using default configuration
2505 [main] INFO org.apache.solr.core.CoreContainer - New CoreContainer 696103669
2505 [main] INFO org.apache.solr.core.CoreContainer - Loading cores into CoreContainer [instanceDir=/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/]
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting socketTimeout to: 0
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting urlScheme to: http://
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting connTimeout to: 0
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxConnectionsPerHost to: 20
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting corePoolSize to: 0
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maximumPoolSize to: 2147483647
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxThreadIdleTime to: 5
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting sizeOfQueue to: -1
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting fairnessPolicy to: false
2527 [main] INFO org.apache.solr.client.solrj.impl.HttpClientUtil - Creating new http client, config:maxConnectionsPerHost=20&maxConnections=10000&socketTimeout=0&connTimeout=0&retry=false
2648 [main] INFO org.apache.solr.logging.LogWatcher - Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
2676 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.CoreContainer - Creating SolrCore 'collection1' using instanceDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1
2677 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/'
2691 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - Failed to load file /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/solrconfig.xml
2693 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - Unable to create core: collection1
org.apache.solr.common.SolrException: Could not load config for solrconfig.xml
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:596)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:661)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:368)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:360)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/conf/', cwd=/data/05/mapred/local/taskTracker/nguyen/jobcache/job_201311191613_0320/attempt_201311191613_0320_r_000014_0/work
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:322)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:287)
at org.apache.solr.core.Config.<init>(Config.java:116)
at org.apache.solr.core.Config.<init>(Config.java:86)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:593)
... 11 more
2695 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - null:org.apache.solr.common.SolrException: Unable to create core: collection1
at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1158)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:670)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:368)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:360)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.solr.common.SolrException: Could not load config for solrconfig.xml
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:596)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:661)
... 10 more
Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/conf/', cwd=/data/05/mapred/local/taskTracker/nguyen/jobcache/job_201311191613_0320/attempt_201311191613_0320_r_000014_0/work
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:322)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:287)
at org.apache.solr.core.Config.<init>(Config.java:116)
at org.apache.solr.core.Config.<init>(Config.java:86)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:593)
... 11 more
2697 [main] INFO org.apache.solr.core.CoreContainer - Creating SolrCore 'core1' using instanceDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip
2697 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/'
2751 [main] INFO org.apache.solr.core.SolrConfig - Adding specified lib dirs to ClassLoader
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/extraction/lib (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/extraction/lib).
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/clustering/lib/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/clustering/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/langid/lib/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/langid/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/velocity/lib (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/velocity/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2785 [main] INFO org.apache.solr.update.SolrIndexConfig - IndexWriter infoStream solr logging is enabled
2790 [main] INFO org.apache.solr.core.SolrConfig - Using Lucene MatchVersion: LUCENE_44
2869 [main] INFO org.apache.solr.core.Config - Loaded SolrConfig: solrconfig.xml
2879 [main] INFO org.apache.solr.schema.IndexSchema - Reading Solr Schema from schema.xml
2937 [main] INFO org.apache.solr.schema.IndexSchema - [core1] Schema name=twitter
3352 [main] INFO org.apache.solr.schema.IndexSchema - unique key field: id
3471 [main] INFO org.apache.solr.schema.FileExchangeRateProvider - Reloading exchange rates from file currency.xml
3478 [main] INFO org.apache.solr.schema.FileExchangeRateProvider - Reloading exchange rates from file currency.xml
3635 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Solr Kerberos Authentication disabled
3636 [main] INFO org.apache.solr.core.JmxMonitoredMap - No JMX servers found, not exposing Solr information with JMX.
3652 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - creating directory factory for path hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data
3686 [main] INFO org.apache.solr.core.CachingDirectoryFactory - return new directory for hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data
3711 [main] WARN org.apache.solr.core.SolrCore - [core1] Solr index directory 'hdfs:/master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index' doesn't exist. Creating new index...
3719 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - creating directory factory for path hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index
3719 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Number of slabs of block cache [1] with direct memory allocation set to [true]
3720 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Block cache target memory usage, slab size of [134217728] will allocate [1] slabs and use ~[134217728] bytes
3721 [main] INFO org.apache.solr.store.blockcache.BufferStore - Initializing the 1024 buffers with [8192] buffers.
3740 [main] INFO org.apache.solr.store.blockcache.BufferStore - Initializing the 8192 buffers with [8192] buffers.
3891 [main] INFO org.apache.solr.core.CachingDirectoryFactory - return new directory for hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index
3988 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: init: current segments file is "null"; deletionPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper#65b01d5d
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: now checkpoint "" [0 segments ; isCommit = false]
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: 0 msec to checkpoint
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: init: create=true
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]:
dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory#17e5a6d8 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory#7f117668; maxCacheMB=192.0 maxMergeSizeMB=16.0)
solr looks for solr.home parameter and searchs solrConfig.xml file there. if there is none it tries to load default configuration.
it looks like your solr home is
/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/
check that folder for solrconfig.xml file
if there is none, copy one from example directory of solr
if there is one, match the file/folder permissions with the server instance
Related
I am executing Camel code in Windows using Eclipse and it is working fine.
However, when I execute the same code in standalone from Linux, the route has print first log but when fetching file it stops without any error.
Here is my code:
from("timer://alertstrigtimer?period=90s&repeatCount=1")
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: alertstrigtimer******************************" + getFileURI(getWorkFilePath(), getWorkFileName()))
.pollEnrich(getFileURI(getWorkFilePath(), getWorkFileName()))
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: alertstrigtimer******************************" + getFileURI(getWorkFilePath(), getWorkFileName()))
.choice()
.when(header("CamelFileName").isNull())
.log(LoggingLevel.INFO, "No File")
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
log.info("Job-Alert-System: No Date File Exist!!!! Calculate 15 Minutes Back and fetching data from Masterdata");
// Do something
}
})
.otherwise()
.log(LoggingLevel.INFO, "Job Alert System: Date File Loaded: ${header.CamelFileName} at ${header.CamelFileLastModified}")
.process(new Processor() {
// Do something by a processor
})
public static String getFileURI(String filePath, String fileName) {
return "file://" + filePath + "?fileName=" + fileName
+ "&preMove=$simple{file:onlyname.noext}.$simple{date:now:yyyy-MM-dd'T'hh-mm-ss}";
}
Here are my logs from the Linux environment:
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) is starting
[main] INFO org.apache.camel.management.ManagedManagementStrategy - JMX is enabled
[main] INFO org.apache.camel.impl.converter.DefaultTypeConverter - Type converters loaded (core: 194, classpath: 0)
[main] INFO org.apache.camel.impl.DefaultCamelContext - StreamCaching is not in use. If using streams then its recommended to enable stream caching. See more details at http://camel.apache.org/stream-caching.html
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route1 started and consuming from: timer://alertstrigtimer?period=90s&repeatCount=1
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: loadDataAndAlerts started and consuming from: direct://loadDataAndAlerts
[main] INFO org.apache.camel.impl.DefaultCamelContext - Total 2 routes, of which 2 are started
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) started in 0.664 seconds
[Camel (camel-1) thread #1 - timer://alertstrigtimer] INFO route1 - *******************************Job-Alert-System: Started: alertstrigtimer******************************file:///shared/wildfly/work-files/alerts?fileName=LastExecutionTime_JobAlerts.txt&preMove=.2020-10-12T06-48-16
It stops here. It creates a directory structure, but does not move forward.
Logs from My Local Machine:
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) is starting
[main] INFO org.apache.camel.management.ManagedManagementStrategy - JMX is enabled
[main] INFO org.apache.camel.impl.converter.DefaultTypeConverter - Type converters loaded (core: 194, classpath: 5)
[main] INFO org.apache.camel.impl.DefaultCamelContext - StreamCaching is not in use. If using streams then its recommended to enable stream caching. See more details at http://camel.apache.org/stream-caching.html
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route1 started and consuming from: timer://alertstrigtimer?period=90s&repeatCount=1
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: loadDataAndAlerts started and consuming from: direct://loadDataAndAlerts
[main] INFO org.apache.camel.impl.DefaultCamelContext - Total 2 routes, of which 2 are started
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) started in 0.845 seconds
[Camel (camel-1) thread #1 - timer://alertstrigtimer] INFO route1 - *******************************Job-Alert-System: Started: alertstrigtimer******************************file://null?fileName=null&preMove=null.2020-10-12T10-28-51
[Camel (camel-1) thread #1 - timer://alertstrigtimer] INFO route1 - Job Alert System: Date File Loaded: null.2020-10-12T10-28-51 at 0
It creates directory structure in addition to a file, but the file is not present and it moves forward.
The problem:
The flink job-manager couldn't recovery from a checkpoint.
Caused by: java.lang.IllegalStateException: There is no operator for the state
Background:
I'm running a flink 1.6.3 over k8s. and I'm using incremental checkpoint on rocksdb.
I tryied to pass the parameter --allowNonRestoredState in order to skip savepoint state that cannot be restored
From my log:
2019-02-06 08:51:08.068 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
--allowNonRestoredState
2019-02-06 08:51:22.827 [flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper. 2019-02-06 08:51:22.883
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1
checkpoints in ZooKeeper. 2019-02-06 08:51:22.883
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 1 checkpoints from storage. 2019-02-06 08:51:22.884
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 1612. 2019-02-06 08:51:22.977
[flink-akka.actor.default-dispatcher-14] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring
job 00000000000000000000000000000000 from latest valid checkpoint:
Checkpoint 1612 # 1549376250641 for 00000000000000000000000000000000.
2019-02-06 08:51:22.982 [flink-akka.actor.default-dispatcher-14] ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
occurred in the cluster entrypoint. java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set
up JobManager
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException:
Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 common frames omitted Caused by: java.lang.IllegalStateException: There is no operator for the state
b22e6e8baea7d7e562d5a233f3301ce1
at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:569)
at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:77)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1049)
at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1138)
at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:294)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:157)
... 10 common frames omitted 2019-02-06 08:51:23.013 [TransientBlobCache shutdown hook] INFO
org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB
cache 2019-02-06 08:51:23.033 [BlobServer shutdown hook] INFO
org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at
0.0.0.0:6124
Expected result:
The job will start running from the latest checkpoint and will skip the state that cannot be restored
I am running a yarn 3 node cluster on EMR(1 Master 2 Core nodes). I am using 1.6.0. I have check-pointing enabled(rocksdb), writing to S3. Check-pointing seems to work correctly in other tests. In the case where yarn crashes(In this case, I killed the yarn processes) on the master node, I an unable to resume my application from the last checkpoint. Here is the output when I try and restart:
[hadoop#emr flink-1.6.0]$ bin/flink run -s s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470 -p 1 -d ~/pipeline-assembly-0.2.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-11-08 19:01:06,069 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,069 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,488 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN properties set default parallelism to 1
2018-11-08 19:01:06,488 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN properties set default parallelism to 1
YARN properties set default parallelism to 1
2018-11-08 19:01:06,637 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at emr:8032
2018-11-08 19:01:06,745 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,745 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,845 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found application JobManager host name 'emr' and port '39541' from supplied application id 'application_1541703591281_0001'
Starting execution of program
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: c701b6511ad76b5e4faae703763f388e)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
... 12 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
... 10 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 4 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
... 5 more
Is this expected behavior, or am I doing something wrong in this situation?
Thank you
UPDATE: jobmanager.log
LogType:jobmanager.log
Log Upload Time:Tue Nov 20 16:37:52 +0000 2018
LogLength:49255
Log Contents:
2018-11-20 16:33:33,276 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-20 16:33:33,277 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 # 13:31:13 UTC)
2018-11-20 16:33:33,278 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: hadoop
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 13653 MiBytes
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-openjdk
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.8.3
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx15360m
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:logback.xml
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)
2018-11-20 16:33:33,674 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-20 16:33:33,675 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-11-20 16:33:33,678 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - YARN daemon is running as: hadoop Yarn client user obtainer: hadoop
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, s3://bucket/kinesis-checkpoint
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.timeout, 60000
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.cluster-id, application_1542731534971_0001
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: internal.cluster.execution-mode, NORMAL
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.9
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, rocksdb
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.ask.timeout, 60s
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 20480m
2018-11-20 16:33:33,682 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 20480m
2018-11-20 16:33:33,682 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, s3://bucket/kinesis-checkpoint
2018-11-20 16:33:33,695 INFO org.apache.flink.runtime.clusterframework.BootstrapTools - Setting directories for temporary files to: /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001
2018-11-20 16:33:33,708 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint.
2018-11-20 16:33:33,708 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2018-11-20 16:33:33,772 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to hadoop (auth:SIMPLE)
2018-11-20 16:33:33,786 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2018-11-20 16:33:33,791 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at ip-172-31-18-80.us-west-2.compute.internal:45751
2018-11-20 16:33:34,239 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-11-20 16:33:34,328 INFO akka.remote.Remoting - Starting remoting
2018-11-20 16:33:34,428 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751]
2018-11-20 16:33:34,437 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751
2018-11-20 16:33:34,469 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-1dc43ec8-8ed7-4342-adae-c8d20a691640
2018-11-20 16:33:34,473 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:39955 - max concurrent requests: 50 - max backlog: 1000
2018-11-20 16:33:34,488 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-11-20 16:33:34,492 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/executionGraphStore-0c4fd7ac-17d2-40d6-b279-dfef5041a76f, expiration time 3600000, maximum cache size 52428800 bytes.
2018-11-20 16:33:34,514 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-4c662c5c-afa5-4bf2-8a01-3acc0b9aa491
2018-11-20 16:33:34,521 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-11-20 16:33:34,522 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload for file uploads.
2018-11-20 16:33:34,525 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting rest endpoint.
2018-11-20 16:33:34,702 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component log file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
2018-11-20 16:33:34,702 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component stdout file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.out
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at ip-172-31-18-80.us-west-2.compute.internal:35939
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://ip-172-31-18-80.us-west-2.compute.internal:35939 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://ip-172-31-18-80.us-west-2.compute.internal:35939.
2018-11-20 16:33:34,857 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.yarn.YarnResourceManager at akka://flink/user/resourcemanager .
2018-11-20 16:33:34,948 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-11-20 16:33:34,981 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ip-172-31-30-52.us-west-2.compute.internal/172.31.30.52:8030
2018-11-20 16:33:35,234 INFO org.apache.flink.yarn.YarnResourceManager - Recovered 0 containers from previous attempts ([]).
2018-11-20 16:33:35,237 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - yarn.client.max-cached-nodemanagers-proxies : 0
2018-11-20 16:33:35,238 INFO org.apache.flink.yarn.YarnResourceManager - ResourceManager akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
2018-11-20 16:33:35,239 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager.
2018-11-20 16:33:35,252 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
2018-11-20 16:33:35,252 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2018-11-20 16:34:20,094 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Submitting job bd0d5dbaeba3990a3bef1eebee49cd79 (Data Session Pipeline v0.0.7).
2018-11-20 16:34:20,108 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/jobmanager_0 .
2018-11-20 16:34:20,115 INFO org.apache.flink.runtime.jobmaster.JobMaster - Initializing job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,124 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, delayBetweenRestartAttempts=0) for Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,127 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.slotpool.SlotPool at akka://flink/user/0e6f5de3-53ad-4bae-acf3-3c66106c0a54 .
2018-11-20 16:34:20,148 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart
2018-11-20 16:34:20,170 INFO org.apache.flink.runtime.jobmaster.JobMaster - Running initialization on master for job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,170 INFO org.apache.flink.runtime.jobmaster.JobMaster - Successfully ran initialization on master in 0 ms.
2018-11-20 16:34:20,203 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using application-defined state backend: RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 's3://bucket/kinesis-checkpoint', savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), localRocksDbDirectories=null, enableIncrementalCheckpointing=TRUE}
2018-11-20 16:34:20,203 INFO org.apache.flink.runtime.jobmaster.JobMaster - Configuring application-defined state backend with job/cluster config
2018-11-20 16:34:22,624 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job bd0d5dbaeba3990a3bef1eebee49cd79 from savepoint s3://bucket/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80 ()
2018-11-20 16:34:22,663 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Exception occurred in REST handler.
org.apache.flink.runtime.rest.handler.RestHandlerException: Job submission failed.
at org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$2(JobSubmitHandler.java:119)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:20)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:18)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$submitJob$2(Dispatcher.java:256)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
... 4 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
... 18 more
Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
... 19 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:266)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
... 21 more
Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 's3://sledfs/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory.
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:256)
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1102)
at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1220)
at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1144)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:295)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157)
... 26 more
2018-11-20 16:37:52,321 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2018-11-20 16:37:52,322 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2018-11-20 16:37:52,340 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:39955
The checkpoint you are referring to s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470 does not contain a valid _metadata file. This indicates that this checkpoint was started but could not be completed. Please choose a checkpoint which has been successfully completed.
I am trying to start Solr in SolrCloud mode. I have created a new collection from collection1 and changed its name in file core.properties by setting the property name=logmail.
But when I start Solr, I am getting the following error
$ java -Dcollection.configName=logmail -DzkRun -Dnumshards=2 -DBootstrap_confdir=./solr/logmail/conf -jar start.jar
2165 [main] INFO org.apache.solr.common.cloud.ZkStateReader –
Updating cluster state from ZooKeeper... 2179
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – Starting to work on the main
queue 2197 [main] INFO org.apache.solr.core.CoresLocator – Looking
for core definitions underneath /home/rahul/Desktop/dev/solrcloud/solr
2203 [main] INFO org.apache.solr.core.CoresLocator – Found core
logmail in /home/rahul/Desktop/dev/solrcloud/solr/logmail/ 2204 [main]
INFO org.apache.solr.core.CoresLocator – Found core collection1 in
/home/rahul/Desktop/dev/solrcloud/solr/collection1/ 2204 [main] INFO
org.apache.solr.core.CoresLocator – Found 2 core definitions 2207
[coreLoadExecutor-6-thread-1] INFO org.apache.solr.cloud.ZkController
– publishing core=logmail state=down collection=logmail 2207
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– publishing core=collection1 state=down collection=collection1 2208
[coreLoadExecutor-6-thread-1] INFO org.apache.solr.cloud.ZkController
– numShards not found on descriptor - reading it from system property
2208 [coreLoadExecutor-6-thread-2] INFO
org.apache.solr.cloud.ZkController – numShards not found on
descriptor - reading it from system property 2214
[coreLoadExecutor-6-thread-1] INFO org.apache.solr.cloud.ZkController
– look for our core node name 2214 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – waiting to find shard id in
clusterstate for logmail 2214 [zkCallback-2-thread-1] INFO
org.apache.solr.cloud.DistributedQueue – NodeChildrenChanged fired on
path /overseer/queue state SyncConnected 2215
[coreLoadExecutor-6-thread-1] INFO org.apache.solr.cloud.ZkController
– Check for collection zkNode:logmail 2222
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– look for our core node name 2222 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Creating collection in
ZooKeeper:logmail 2222 [coreLoadExecutor-6-thread-2] INFO
org.apache.solr.cloud.ZkController – waiting to find shard id in
clusterstate for collection1 2223 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Looking for collection
configName 2223 [coreLoadExecutor-6-thread-2] INFO
org.apache.solr.cloud.ZkController – Check for collection
zkNode:collection1 2224 [coreLoadExecutor-6-thread-2] INFO
org.apache.solr.cloud.ZkController – Creating collection in
ZooKeeper:collection1 2224 [coreLoadExecutor-6-thread-2] INFO
org.apache.solr.cloud.ZkController – Looking for collection
configName 2225 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Could not find collection
configName - pausing for 3 seconds and trying again - try: 1 2226
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– Could not find collection configName - pausing for 3 seconds and
trying again - try: 1 2226
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – Update state numShards=null
message={ "core":"logmail", "roles":null,
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr", "state":"down", "shard":null,
"collection":"logmail", "operation":"state"} 2226
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – node=core_node1 is already
registered 2227
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – shard=shard1 is already
registered 2255 [zkCallback-2-thread-1] INFO
org.apache.solr.common.cloud.ZkStateReader – A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size:
1) 2268
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – Update state numShards=null
message={ "core":"collection1", "roles":null,
"base_url":"http://127.0.1.1:8983/solr",
"node_name":"127.0.1.1:8983_solr", "state":"down", "shard":null,
"collection":"collection1", "operation":"state"} 2268
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – node=core_node1 is already
registered 2269
[OverseerStateUpdate-94955713964081152-127.0.1.1:8983_solr-n_0000000001]
INFO org.apache.solr.cloud.Overseer – shard=shard1 is already
registered 2288 [zkCallback-2-thread-1] INFO
org.apache.solr.cloud.DistributedQueue – NodeChildrenChanged fired on
path /overseer/queue state SyncConnected 2318 [zkCallback-2-thread-1]
INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size:
1) 5227 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Could not find collection
configName - pausing for 3 seconds and trying again - try: 2 5228
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– Could not find collection configName - pausing for 3 seconds and
trying again - try: 2 8229 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Could not find collection
configName - pausing for 3 seconds and trying again - try: 3 8229
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– Could not find collection configName - pausing for 3 seconds and
trying again - try: 3 11232 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Could not find collection
configName - pausing for 3 seconds and trying again - try: 4 11232
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– Could not find collection configName - pausing for 3 seconds and
trying again - try: 4 14237 [coreLoadExecutor-6-thread-1] INFO
org.apache.solr.cloud.ZkController – Could not find collection
configName - pausing for 3 seconds and trying again - try: 5 14237
[coreLoadExecutor-6-thread-2] INFO org.apache.solr.cloud.ZkController
– Could not find collection configName - pausing for 3 seconds and
trying again - try: 5 17237 [coreLoadExecutor-6-thread-1] ERROR
org.apache.solr.cloud.ZkController – Could not find configName for
collection logmail 17238 [coreLoadExecutor-6-thread-2] ERROR
org.apache.solr.cloud.ZkController – Could not find configName for
collection collection1 17240 [coreLoadExecutor-6-thread-1] ERROR
org.apache.solr.core.CoreContainer – Error creating core [logmail]:
Could not find configName for collection logmail found:null
org.apache.solr.common.cloud.ZooKeeperException: Could not find
configName for collection logmail found:null
It looks like there may be a discrepancy between what Solr has on the filesystem for collections based on your commands and what is in zookeeper.
These are hard to fix; if possible I would recommend to delete your configuration files out of zookeeper and reload them.
You have a typo in your command. This should do the trick:
$ java -Dcollection.configName=logmail -DzkRun -Dnumshards=2 -Dbootstrap_confdir=./solr/logmail/conf -jar start.jar
I wanted to run a solr cloud with solr 4.3.0.
(I am using aws ubuntu-12.04-lts micro instances)
So I followed this toturial:
which basically says, start the zookeeper and connect the solr instances to it.
Here's how I start the zookeeper.
First I copied the config like described in the tutorial
sudo cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg
Then I started the zookeeper
ubuntu#ip-10-48-159-36:/opt$ sudo zookeeper-3.4.5/bin/zkServer.sh start
JMX enabled by default
Using config: /opt/zookeeper-3.4.5/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
Looks fine so far.
I checked the status:
ubuntu#ip-10-48-159-36:/opt$ sudo zookeeper-3.4.5/bin/zkServer.sh status
JMX enabled by default
Using config: /opt/zookeeper-3.4.5/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
Which seems a bit weird already.
If I try to connect with the client (remote as well as local), its seems to work
ubuntu#ip-10-234-223-69:/opt$ zookeeper-3.4.5/bin/zkCli.sh -server ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181
Connecting to ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181
2013-06-07 11:07:01,996 [myid:] - INFO [main:Environment#100] - Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2013-06-07 11:07:02,000 [myid:] - INFO [main:Environment#100] - Client environment:host.name=ip-10-234-223-69.eu-west-1.compute.internal
2013-06-07 11:07:02,000 [myid:] - INFO [main:Environment#100] - Client environment:java.version=1.6.0_27
2013-06-07 11:07:02,002 [myid:] - INFO [main:Environment#100] - Client environment:java.vendor=Sun Microsystems Inc.
2013-06-07 11:07:02,003 [myid:] - INFO [main:Environment#100] - Client environment:java.home=/usr/lib/jvm/java-6-openjdk-amd64/jre
2013-06-07 11:07:02,003 [myid:] - INFO [main:Environment#100] - Client environment:java.class.path=/opt/zookeeper-3.4.5/bin/../build/classes:/opt/zookeeper-3.4.5/bin/../build/lib/*.jar:/opt/zookeeper-3.4.5/bin/../lib/slf4j-log4j12-1.6.1.jar:/opt/zookeeper-3.4.5/bin/../lib/slf4j-api-1.6.1.jar:/opt/zookeeper-3.4.5/bin/../lib/netty-3.2.2.Final.jar:/opt/zookeeper-3.4.5/bin/../lib/log4j-1.2.15.jar:/opt/zookeeper-3.4.5/bin/../lib/jline-0.9.94.jar:/opt/zookeeper-3.4.5/bin/../zookeeper-3.4.5.jar:/opt/zookeeper-3.4.5/bin/../src/java/lib/*.jar:/opt/zookeeper-3.4.5/bin/../conf:
2013-06-07 11:07:02,004 [myid:] - INFO [main:Environment#100] - Client environment:java.library.path=/usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/server:/usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64:/usr/lib/jvm/java-6-openjdk-amd64/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2013-06-07 11:07:02,008 [myid:] - INFO [main:Environment#100] - Client environment:java.io.tmpdir=/tmp
2013-06-07 11:07:02,009 [myid:] - INFO [main:Environment#100] - Client environment:java.compiler=<NA>
2013-06-07 11:07:02,018 [myid:] - INFO [main:Environment#100] - Client environment:os.name=Linux
2013-06-07 11:07:02,019 [myid:] - INFO [main:Environment#100] - Client environment:os.arch=amd64
2013-06-07 11:07:02,019 [myid:] - INFO [main:Environment#100] - Client environment:os.version=3.2.0-40-virtual
2013-06-07 11:07:02,020 [myid:] - INFO [main:Environment#100] - Client environment:user.name=ubuntu
2013-06-07 11:07:02,020 [myid:] - INFO [main:Environment#100] - Client environment:user.home=/home/ubuntu
2013-06-07 11:07:02,021 [myid:] - INFO [main:Environment#100] - Client environment:user.dir=/opt
2013-06-07 11:07:02,029 [myid:] - INFO [main:ZooKeeper#438] - Initiating client connection, connectString=ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher#182d9c06
Welcome to ZooKeeper!
2013-06-07 11:07:02,074 [myid:] - INFO [main-SendThread(ip-10-48-159-36.eu-west-1.compute.internal:2181):ClientCnxn$SendThread#966] - Opening socket connection to server ip-10-48-159-36.eu-west-1.compute.internal/10.48.159.36:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
[zk: ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181(CONNECTING) 0] 2013-06-07 11:07:32,100 [myid:] - INFO [main-SendThread(ip-10-48-159-36.eu-west-1.compute.internal:2181):ClientCnxn$SendThread#1083] - Client session timed out, have not heard from server in 30038ms for sessionid 0x0, closing socket connection and attempting reconnect
2013-06-07 11:07:33,204 [myid:] - INFO [main-SendThread(ip-10-48-159-36.eu-west-1.compute.internal:2181):ClientCnxn$SendThread#966] - Opening socket connection to server ip-10-48-159-36.eu-west-1.compute.internal/10.48.159.36:2181. Will not attempt to authenticate using SASL (unknown error)
Now I tried to connect a solr instance to it. In the web interface of tomcat7 it only tells me "503 - Server is shutting down", so I checked the solr logs
2013-06-07 11:16:36,065 [pool-2-thread-1] INFO org.apache.solr.servlet.SolrDispatchFilter . SolrDispatchFilter.init()
2013-06-07 11:16:36,100 [pool-2-thread-1] INFO org.apache.solr.core.SolrResourceLoader . Using JNDI solr.home: /opt/solr-4.3.0/example/solr
2013-06-07 11:16:36,132 [pool-2-thread-1] INFO org.apache.solr.core.CoreContainer . looking for solr config file: /opt/solr-4.3.0/example/solr/solr.xml
2013-06-07 11:16:36,138 [pool-2-thread-1] INFO org.apache.solr.core.CoreContainer . New CoreContainer 1285984216
2013-06-07 11:16:36,146 [pool-2-thread-1] INFO org.apache.solr.core.CoreContainer . Loading CoreContainer using Solr Home: '/opt/solr-4.3.0/example/solr/'
2013-06-07 11:16:36,152 [pool-2-thread-1] INFO org.apache.solr.core.SolrResourceLoader . new SolrResourceLoader for directory: '/opt/solr-4.3.0/example/solr/'
2013-06-07 11:16:36,567 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting socketTimeout to: 0
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting urlScheme to: http://
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting connTimeout to: 0
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting maxConnectionsPerHost to: 20
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting corePoolSize to: 0
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting maximumPoolSize to: 2147483647
2013-06-07 11:16:36,568 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting maxThreadIdleTime to: 5
2013-06-07 11:16:36,569 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting sizeOfQueue to: -1
2013-06-07 11:16:36,569 [pool-2-thread-1] INFO org.apache.solr.handler.component.HttpShardHandlerFactory . Setting fairnessPolicy to: false
2013-06-07 11:16:36,578 [pool-2-thread-1] INFO org.apache.solr.client.solrj.impl.HttpClientUtil . Creating new http client, config:maxConnectionsPerHost=20&maxConnections=10000&socketTimeout=0&connTimeout=0&retry=false
2013-06-07 11:16:36,879 [pool-2-thread-1] INFO org.apache.solr.core.CoreContainer . Registering Log Listener
2013-06-07 11:16:36,881 [pool-2-thread-1] INFO org.apache.solr.core.CoreContainer . Zookeeper client=ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181
2013-06-07 11:16:36,888 [pool-2-thread-1] INFO org.apache.solr.client.solrj.impl.HttpClientUtil . Creating new http client, config:maxConnections=500&maxConnectionsPerHost=16&socketTimeout=0&connTimeout=0
2013-06-07 11:16:37,040 [pool-2-thread-1] INFO org.apache.solr.common.cloud.ConnectionManager . Waiting for client to connect to ZooKeeper
2013-06-07 11:16:52,046 [pool-2-thread-1] ERROR org.apache.solr.servlet.SolrDispatchFilter . Could not start Solr. Check solr/home property and the logs
2013-06-07 11:16:52,103 [pool-2-thread-1] ERROR org.apache.solr.core.SolrCore . null:java.lang.RuntimeException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181 within 15000 ms
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:130)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:88)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:170)
at org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:242)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:495)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:358)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:326)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:124)
at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:103)
at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper ec2-54-247-144-120.eu-west-1.compute.amazonaws.com:2181 within 15000 ms
at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:173)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:127)
... 25 more
2013-06-07 11:16:52,104 [pool-2-thread-1] INFO org.apache.solr.servlet.SolrDispatchFilter . SolrDispatchFilter.init() done
What does it tell me?
On the same instance I just connected with the client successfully... :(
So where is the problem?
[Edit:]
Instead of using amazons ec**.amazon.* address I used the network addresses 10.X.X.X for telling solr where the zookeeper is.
It seems to work.
You have your answer - Your ZooKeeper in inaccessible!
Check your firewall configuration.
You can also check it with
zkCli.sh -server localhost:2181
There must have been some sort of connectivity problem.
I see you have it resolved now.
Next time you run into a situation like this, you should log onto the box that is having problems connecting and use telnet to see if you can connect.
eg: from your solr box:
telnet ec2-54-247-144-120.eu-west-1.compute.amazonaws.com 2181
and then try from the zk box too. It should start to illuminate where your issues are.
That eliminates any application layer issues and will tell you quite reliably wether or not you can connect. It you can't connect, then it's almost always some sort of security issue - either a firewall running somewhere (try - $service iptables stop) or it will be an issue with security group configuration in amazon.
The last potential problem is network availability. Despite what people think, the network is NOT reliable and should never be considered so. Anyone working in SOA/distributed systems will know this well :)
http://aphyr.com/posts/288-the-network-is-reliable
"A team from the University of Toronto and Microsoft Research studied the behavior of network failures in several of Microsoft’s datacenters. They found an average failure rate of 5.2 devices per day and 40.8 links per day with a median time to repair of approximately five minutes (and up to one week). "
While setting up SolrCloud and ZooKeeper I also ran into the "Error contacting service. It is probably not running." issue. The reason was a typo in a file name that ZooKeeper needs. The correct file name is "myid". I wrote "myip" by mistake. After the renaming of the file and restarting ZooKeeper (./zkServer.sh restart), my issue was resolved.
try to stop your solr instance solr.shutdown() so that you can create new CloudSolrServer instance for each thread