spark ssc.textFileStream is not streamining any files from directory - filesystems

I am trying to execute below code using eclipse (with maven conf) with 2 worker and each have 2 core or also tried with spark-submit.
public class StreamingWorkCount implements Serializable {
public static void main(String[] args) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
JavaStreamingContext jssc = new JavaStreamingContext(
"spark://192.168.1.19:7077", "JavaWordCount",
new Duration(1000));
JavaDStream<String> trainingData = jssc.textFileStream(
"/home/bdi-user/kaushal-drive/spark/data/training").cache();
trainingData.foreach(new Function<JavaRDD<String>, Void>() {
public Void call(JavaRDD<String> rdd) throws Exception {
List<String> output = rdd.collect();
System.out.println("Sentences Collected from files " + output);
return null;
}
});
trainingData.print();
jssc.start();
jssc.awaitTermination();
}
}
And log of that code
15/01/22 21:57:13 INFO FileInputDStream: New files at time 1421944033000 ms:
15/01/22 21:57:13 INFO JobScheduler: Added jobs for time 1421944033000 ms
15/01/22 21:57:13 INFO JobScheduler: Starting job streaming job 1421944033000 ms.0 from job set of time 1421944033000 ms
15/01/22 21:57:13 INFO SparkContext: Starting job: foreach at StreamingKMean.java:33
15/01/22 21:57:13 INFO DAGScheduler: Job 3 finished: foreach at StreamingKMean.java:33, took 0.000094 s
Sentences Collected from files []
-------------------------------------------
15/01/22 21:57:13 INFO JobScheduler: Finished job streaming job 1421944033000 ms.0 from job set of time 1421944033000 ms
Time: 1421944033000 ms
-------------------------------------------15/01/22 21:57:13 INFO JobScheduler: Starting job streaming job 1421944033000 ms.1 from job set of time 1421944033000 ms
15/01/22 21:57:13 INFO JobScheduler: Finished job streaming job 1421944033000 ms.1 from job set of time 1421944033000 ms
15/01/22 21:57:13 INFO JobScheduler: Total delay: 0.028 s for time 1421944033000 ms (execution: 0.013 s)
15/01/22 21:57:13 INFO MappedRDD: Removing RDD 5 from persistence list
15/01/22 21:57:13 INFO BlockManager: Removing RDD 5
15/01/22 21:57:13 INFO FileInputDStream: Cleared 0 old files that were older than 1421943973000 ms:
15/01/22 21:57:13 INFO FileInputDStream: Cleared 0 old files that were older than 1421943973000 ms:
15/01/22 21:57:13 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
The Problem is that, i am not getting data form the file which is in the directory. Please help me.

Try it with another directory and then copy these files to that directory, while the job is running.

had the same problem.
Here is my code:
lines = jssc.textFileStream("file:///Users/projects/spark/test/data');
the TextFileSTream is very sensitive; what i ended up doing was:
1. Run Spark program
2. touch datafile
3. mv datafile datafile2
4. mv datafile2 /Users/projects/spark/test/data
and that did it.

I think you need to add the scheme, i.e. file:// or hdfs:// in front of your path.
Undoing the edit to my comment because: It is in fact file:// and hdfs:// which needs to be added "in front of" the path, so the total path becomes file:///tmp/file.txt or hdfs:///user/data. If there is no NameNode set in the configuration, the latter needs to be hdfs://host:port/user/data.

JavaDoc suggests function only streams new files.
Ref:
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/streaming/api/java/JavaStreamingContext.html#textFileStream(java.lang.String)
Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.

textFileStream can only monitor a folder when the files in the folder are being added or updated.
If you just want to read files, you can rather use SparkContext.textFile.

You have to take in count that Spark Streaming will only read the new files in the directory, no the updated ones (once they are in the directory) and also they all must have the same format.
Source

I've bee scratching my head for hours, and what worked for me is
The answer from https://stackoverflow.com/a/33030590/1170677
I forgot to start the streaming process so you need to ssc.start()

Related

Flink checkpoint not replaying the kafka events which were in process during the savepoint/checkpoint

I want to test end-to-end exactly once processing in flink. My job is:
Kafka-source -> mapper1 -> mapper-2 -> kafka-sink
I had put a Thread.sleep(100000) in mapper1 and then ran the job. I took the savepoint while stopping the job and then I removed the Thread.sleep(100000) form the mapper1, and I expect that the event should be replayed as it was not sinked. But that didnt happen and job is waiting for new event.
My Kafka source:
KafkaSource.<String>builder()
.setBootstrapServers(consumerConfig.getBrokers())
.setTopics(consumerConfig.getTopic())
.setGroupId(consumerConfig.getGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.setProperty("commit.offsets.on.checkpoint", "true")
.build();
My kafka sink:
KafkaSink.<String>builder()
.setBootstrapServers(producerConfig.getBootstrapServers())
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(producerConfig.getTopic())
.setValueSerializationSchema(new SimpleStringSchema()).build())
.build();
My environmentSetup for flink job:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.enableCheckpointing(2000);
environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
environment.getCheckpointConfig().setCheckpointTimeout(60000);
environment.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
environment.getCheckpointConfig().setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
environment.getCheckpointConfig().setCheckpointTimeout(1000);
environment.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
environment.getCheckpointConfig().enableUnalignedCheckpoints();
environment.getCheckpointConfig().setCheckpointStorage("file:///tmp/flink-checkpoints");
Configuration configuration = new Configuration();
configuration.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, true);
environment.configure(configuration);
What am I doing wrong here?
I want that any event which is in process during the cancellation/stop of the job, should restart again.
EDIT 1:
I observed that my kafka was showing offset lag for my flink's kafka-source consumer group. I am assuming it means my checkpointing is behaving right, is that correct ?
I also observed when i restarted my job from checkpoint, it didnt start to consume from the remaining offsets, while I have the consumer offset set to EARLIEST. I had to send more events to trigger the consumption on kafka-source side and then it consumed all the events.
For exactly-once, you must provide a TransactionalIdPrefix unique across all applications running against the same Kafka cluster (this is a change compared to the legacy FlinkKafkaConsumer):
KafkaSink<T> sink =
KafkaSink.<T>builder()
.setBootstrapServers(...)
.setKafkaProducerConfig(...)
.setRecordSerializer(...)
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("unique-id-for-your-app")
.build();
When resuming from a checkpoint, Flink always uses the offsets stored in the checkpoint rather than those configured in the code or stored in the broker.

How to monitor Apache Flink in AWS EMR (ElasticMapReduce)?

I currently have Flink setup and have a Job running on EMR and I'm now trying to add monitoring by sending metrics off to prometheus.
I have come across an issue with running Flink on EMR. I'm using Terraform to provision EMR (I run ansible after to download and run a job). Out the box, it does not look like EMR's Flink distribution includes the optional jars (flink-metrics-prometheus, flink-cep, etc).
Looking at Flink's documentation, it says
"In order to use this reporter you must copy /opt/flink-metrics-prometheus-1.6.1.jar into the /lib folder of your Flink distribution"
https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/metrics.html#prometheuspushgateway-orgapacheflinkmetricsprometheusprometheuspushgatewayreporter
But when logging into the EMR master node, neither /etc/flink or /usr/lib/flink has a directory called opts and i can not see flink-metrics-prometheus-1.6.1.jar anywhere.
I know Flink has other optional libs you'd usually have to copy if you want to use them such as flink-cep, but I'm not sure how to do this when using EMR.
This is the exception i get, which I beleive is because it can not find the metrics jar in its classpath.
java.lang.ClassNotFoundException: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.flink.runtime.metrics.MetricRegistryImpl.<init>(MetricRegistryImpl.java:144)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createMetricRegistry(ClusterEntrypoint.java:419)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:276)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:227)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:191)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:137)
EMR resource in terraform
resource "aws_emr_cluster" "emr_flink" {
name = "ce-emr-flink-arn"
release_label = "emr-5.20.0" # 5.21.0 is not found, could be a region thing
applications = ["Flink"]
ec2_attributes {
key_name = "ce_test"
subnet_id = "${aws_subnet.ce_test_subnet_public.id}"
instance_profile = "${aws_iam_instance_profile.emr_profile.arn}"
emr_managed_master_security_group = "${aws_security_group.allow_all_vpc.id}"
emr_managed_slave_security_group = "${aws_security_group.allow_all_vpc.id}"
additional_master_security_groups = "${aws_security_group.external_connectivity.id}"
additional_slave_security_groups = "${aws_security_group.external_connectivity.id}"
}
ebs_root_volume_size = 100
master_instance_type = "m4.xlarge"
core_instance_type = "m4.xlarge"
core_instance_count = 2
service_role = "${aws_iam_role.iam_emr_service_role.arn}"
configurations_json = <<EOF
[
{
"Classification": "flink-conf",
"Properties": {
"parallelism.default": "8",
"state.backend": "RocksDB",
"state.backend.async": "true",
"state.backend.incremental": "true",
"state.savepoints.dir": "file:///savepoints",
"state.checkpoints.dir": "file:///checkpoints",
"web.submit.enable": "true",
"metrics.reporter.promgateway.class": "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter",
"metrics.reporter.promgateway.host": "${aws_instance.monitoring.private_ip}",
"metrics.reporter.promgateway.port": "9091",
"metrics.reporter.promgateway.jobName": "ce-test",
"metrics.reporter.promgateway.randomJobNameSuffix": "true",
"metrics.reporter.promgateway.deleteOnShutdown": "false"
}
}
]
EOF
}
I suspect i may have to download the Jar in the bootstrap stage, but wanted to check this first, and see if there's any examples of this being done
I haven't used Terraform, but note that you typically need to provision (set up jars) on both the master and the slaves in EMR. One way to figure out where EMR thinks jars should go is to log onto a slave when a job is running, do ps auxwww | grep java, find the TaskManager process, look at the jars added to the classpath when it launched, and find where those are located on the server. Or at least that worked for me in the past.
I've select the EMR release emr-5.24.0 and I monitoring with the influxdb .jar with suceed.
I've copy the .jar file to /usr/lib/flink/lib folder and restart the Flink cluster with the following bash command (with sudo permission).
/usr/lib/flink/bin/stop-cluster.sh && /usr/lib/flink/bin/stop-cluster.sh
I assume that you can solve your question with the same steps for prometheus
[ec2-user#ip-10-0-11-17 ~]$ cd /usr/lib/flink/opt/flink-metrics-
flink-metrics-datadog-1.8.0.jar flink-metrics-influxdb-1.8.0.jar flink-metrics-slf4j-1.8.0.jar
flink-metrics-graphite-1.8.0.jar flink-metrics-prometheus-1.8.0.jar flink-metrics-statsd-1.8.0.jar
[ec2-user#ip-10-0-11-17 ~]$ ll /usr/lib/flink/opt/flink-metrics-prometheus-1.8.0.jar
-rw-r--r-- 1 root root 101984 may 14 19:21 /usr/lib/flink/opt/flink-metrics-prometheus-1.8.0.jar
[ec2-user#ip-10-0-11-17 ~]$ uname -a
Linux ip-10-0-11-17 4.14.114-83.126.amzn1.x86_64 #1 SMP Tue May 7 02:26:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Apache Zeppelin Flink Interpretor can not connect Flink 1.5.2

I tried to connect Flink 1.5.2 from zeppelin.
Flink cluster started on the local host.
flink-conf.yaml is default.
/opt/flink1.5.2/current/bin/start-cluster.sh
enter image description here
zeppelin
git pull and buid.
git clone https://github.com/apache/zeppelin.git
cd zeppelin
mvn clean package -DskipTests -Dflink.version=1.5.2 -Pscala-2.11
bin/zeppelin-daemon start
interpretor setting this
enter image description here
submit sample flink notebook
%flink // let Zeppelin know what interpreter to use.
val text = benv.fromElements("In the time of chimpanzees, I was a monkey", // some lines of text to analyze
"Butane in my veins and I'm out to cut the junkie",
"With the plastic eyeballs, spray paint the vegetables",
"Dog food stalls with the beefcake pantyhose",
"Kill the headlights and put it in neutral",
"Stock car flamin' with a loser in the cruise control",
"Baby's in Reno with the Vitamin D",
"Got a couple of couches, sleep on the love seat",
"Someone came in sayin' I'm insane to complain",
"About a shotgun wedding and a stain on my shirt",
"Don't believe everything that you breathe",
"You get a parking violation and a maggot on your sleeve",
"So shave your face with some mace in the dark",
"Savin' all your food stamps and burnin' down the trailer park",
"Yo, cut it")
/* The meat and potatoes:
this tells Flink to iterate through the elements, in this case strings,
transform the string to lower case and split the string at white space into individual words
then finally aggregate the occurrence of each word.
This creates the count variable which is a list of tuples of the form (word, occurances)
counts.collect().foreach(println(_)) // execute the script and print each element in the counts list
*/
val counts = text.flatMap{ _.toLowerCase.split("\\W+") }.map { (_,1) }.groupBy(0).sum(1)
counts.collect().foreach(println(_)) // execute the script and print each element in the counts list
Error
ERROR [2018-09-11 19:21:18,004] ({pool-2-thread-2} Job.java[run]:174) - Job failed
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1081)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:502)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:487)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:904)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:748)
INFO [2018-09-11 19:21:18,034] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:115) - Job paragraph_1536657575367_1976000602 finished by scheduler interpreter_1496279000
why Address already in use ??
I do not want to run the job with Flink in zeppelin.
I want to run a job with Flink that already exists.

Appengine deployments are extraodinarily slow today?

We have a small java project need to deploy
it include 9000+ files
command : mvn gcloud:deploy
but I get the Log:
...
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/__static__/node_modules/rx/src/core/linq/observable/when.js] to [7dfb30ad32893c5042dba03601f006a40419fab0]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/assets/global/plugins/bootstrap-switch/js/bootstrap-switch.min.js] to [7e0725897d7b99c3c33b56915d202e2dde552ea9]
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/assets/global/plugins/bootstrap-switch/js/bootstrap-switch.min.js] to [7e0725897d7b99c3c33b56915d202e2dde552ea9]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/is-redirect/index.js] to [7e0afe4775bf7f8558665760171c01948c22f771]
[INFO] INFO: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/is-redirect/index.js] to [7e0afe4775bf7f8558665760171c01948c22f771]
[INFO] DEBUG: Uploading [/home/steven/work/idigisign/target/appengine-staging/node_modules/rxjs/src/util/Map.ts] to [7e11722f4cd9ce91ec99b97710fbc4e7f40be09d]
...
About 50 per minute
So it will spent 180 minute...
It is extraodinarily slow
anybody can help me?
Set the environment variable CLOUDSDK_APP_USE_GSUTIL=1 and try again; this uses a less-reliable but faster codepath for file upload (there are plans to speed up the default codepath).
We have the same issue, it's very slow.
Guess we have solved it.
First, we traced the gcloud logs and we found many files had been uploaded again, these files all are no modified. So we try to trace the source code of gcloud and we found the issue is caused by "Google Cloud Storage JSON API".
When it queried the List of Bucket, it returned 1000 items but we have 1325 items so I guess we find the issue.
Then, we look for the api reference, and we find a parameter - maxResults, so we try to modify the source code(cloud_storage.py), and we find it has No Effect when its value is over 1000.
Finally, we find another parameter - nextPageToken, and we query list until the "nextPageToken" is None, now it got all items from "Google Cloud Storage" and the exists files not be uploaded again.
def ListBucket(bucket_ref, client):
request = STORAGE_MESSAGES.StorageObjectsListRequest(bucket=bucket_ref.bucket)
items = set()
try:
response = client.objects.List(request)
for item in response.items:
items.add(item.name)
while response.nextPageToken:
request = STORAGE_MESSAGES.StorageObjectsListRequest(bucket=bucket_ref.bucket,pageToken=response.nextPageToken)
response = client.objects.List(request)
for item in response.items:
items.add(item.name)
except api_exceptions.HttpError as e:
raise UploadError('Error uploading files: {e}'.format(e=e))
return items

Error while importing mysql database

I am getting this error while importing database to my local machine.I am using WAMPServer.Can anyone please help me?
Fatal error: Maximum execution time of 360 seconds exceeded in
C:\wamp\apps\phpmyadmin4.1.14\libraries\dbi\DBIMysqli.class.php on line 285
and on 285 line return mysqli_query($link, $query, $method);
Thanks
Option 1
Use the MySQL console
Go to wampmanager -> MySQL -> MySQL Console or cmd mysql.exe -u root on windows from the mysql folder
USE YourDatabase;
SOURCE C:/yourpath/file.sql;
Option 2
Modify phpmyadmin.conf (alias folder)
php_admin_value upload_max_filesize 128M
php_admin_value post_max_size 128M
php_admin_value max_execution_time 360
php_admin_value max_input_time 360
</Directory>
Change the size's to what you want
Sample values? (Depends on your need)
post_max_size = 750M
upload_max_filesize = 750M
max_execution_time = 5000
max_input_time = 5000
memory_limit = 1000M
For answer completeness
-->If the above doesnt work (it should) then go ahead and
Add
$cfg['ExecTimeLimit'] = <LargeValue>(5000-6000?);
to phpMyAdmin\libraries\config.inc.php .
Dont edit the config.default.php directly.
Try adding this at the beginning of your php file
ini_set('MAX_EXECUTION_TIME', -1);
If you are using WAMP and if the problem is because of it
Increase the max_execution_time in php.ini file present in phpmyadmin\apache2 then go to
C:\wamp\apps\phpmyadmin3.4.10.1\libraries (or change path according to your installation)
and open the config.default.php and change value for $cfg['ExecTimeLimit'] to 0:
$cfg['ExecTimeLimit'] = 0;
this should resolve your issue.

Resources