I have developed a Word Count program using PyFlink. The program is not throwing any error yet not providing a desired output. According to the code, the program should create a new text file but no file is generating at the the time of execution. Kindly help, my code is attached below.
from flink.plan.Constants import WriteMode
from flink.plan.Environment import get_environment
from flink.functions.FlatMapFunction import FlatMapFunction
from flink.functions.GroupReduceFunction import GroupReduceFunction
from pyflink import datastream
from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors import (FileSource, StreamFormat, FileSink, OutputFileConfig, RollingPolicy)
class Tokenizer(FlatMapFunction):
def flat_map(self, value, collector):
super().__init__()
for word in value.lower().split(","):
if len(word)>1:
collector.collect((word, 1))
if __name__ == '__main__':
env = get_environment()
env.set_parallelism(2)
data = env.read_text("h.txt")
tokenized = data.flat_map(Tokenizer())
count = tokenized.group_by(0).sum(1)
count.write_text("D:/Cyber Security/Apache Flink")
Try using env.execute("Word Count Example...") at the end of the program. It kicks off your execution.
Related
import pandas
import os
import sqlalchemy
import sys
import time from watchdog.observers
import Observer from watchdog.events
import FileSystemEventHandler
class EventHandler(FileSystemEventHandler):
def on_any_event(self, event):
print("EVENT")
print(event.event_type)
print(event.src_path)
print()
if __name__ == "__main__":
path = 'input path here'
event_handler= EventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
print("Monitoring started")
observer.start()
try:
while(True):
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
engine = create_engine('Database information')
cursor = engine.raw_connection().cursor()
for file in os.listdir('.'):
file_basename, extension = file.split('.')
if extension == 'xlsx':
df = pd.read_excel(os.path.abspath(file))
df.to_sql(file_basename, con = engine, if_exists = 'replace')
so the first part where observer.join() ends i am okay with. but the next part where it starts with engine = create_engine("") that where i am having trouble with that part is suppose to send the file to the sql server but the code is not doing that i have researched the internet but having found anything. any help is appreciated.
the csv file is about 47MB, if i do not use batch, is will cost to much time to load the file, and QPS can reach 8000+.
do not use batch
but if i use batch(10000) or even more batch(100000), QPS will be very low about 800+. (the file has 100000 records)
use batch(100000)
script like this:
import io.gatling.core.Predef._
import io.gatling.core.feeder.BatchableFeederBuilder
import io.gatling.core.structure.ScenarioBuilder
import io.gatling.http.Predef._
import io.gatling.http.protocol.HttpProtocolBuilder
import scala.concurrent.duration._
class SuggestDot extends Simulation {
val protoal: HttpProtocolBuilder = http.warmUp("https://www.baidu.com")
.baseUrl("http://192.168.106.142:8080")
.connectionHeader("keep-alive")
.check(jsonPath("$.ret").ofType[Int] is 0)
.check(jsonPath("$.data.data[0]") exists)
val files: BatchableFeederBuilder[String]#F = csv("suggestDot.csv").batch(100000).circular
val test: ScenarioBuilder = scenario("lalamap.myhll.cn").forever(
feed(files).exec(
http("suggest/doc")
.get("${url}&isNewVersion=true")
)
)
setUp(
test.inject(atOnceUsers(100)).protocols(protoal)
).maxDuration(60 seconds)
}
This is something that will be fixed in upcoming Gatling 3.4.0, see https://github.com/gatling/gatling/issues/3943 and https://github.com/gatling/gatling/issues/3944.
BTW, I can see from your screen captures that you've forked Gatling. Please consider being a good open-source citizen and contributing upstream.
I am experiencing an infinite loop using an AsyncFunc in unordered mode.
It can be reproduced using the following code
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.AsyncFunction;
import org.junit.Test;
import java.util.Arrays;
import java.util.Collections;
import java.util.concurrent.TimeUnit;
public class AsyncTest {
#Test
public void test() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Integer> withTimestamps = env.fromCollection(Arrays.asList(1,2,3,4,5));
AsyncDataStream.unorderedWait(withTimestamps,
(AsyncFunction<Integer, String>) (input, collector) -> {
if (input == 3){
collector.collect(new RuntimeException("Test"));
return;
}
collector.collect(Collections.singleton("Ok"));
}, 10, TimeUnit.MILLISECONDS)
.returns(String.class)
.print();
env.execute("unit-test");
}
}
My guess would be that the UnorderedStreamElementQueue is the reason of the infinite loop.
It seems to add the StreamElementQueueEntry containing the failing future into its firstSet field but never removes it (as this failing future is not triggering the onCompleteHandler method).
Anyone knows if this could be right or if I am making a mistake in my code?
There is indeed a bug in the AsyncWaitOperator. The problem is that it does not react to results which are exceptions (user exceptions as well as timeout exceptions).
In the case of ordered waiting, this causes that exceptional results are only detected if another async result has been completed (without an exception).
In the case of unordered waiting, this causes that the StreamElementQueueEntry is never moved from the firstSet into the completedQueue.
There is an issue for this problem. It will hopefully be fixed in the next couple of days.
I've tried everything to get the microtime in Flash using the sampler class but to no avail. Here is my code so far:
import flash.sampler.Sample;
import flash.sampler.getLexicalScopes;
import flash.sampler.getMemberNames;
import flash.sampler.getSampleCount;
import flash.sampler.getSamples;
import flash.sampler.getSize;
import flash.sampler.startSampling;
import flash.sampler.stopSampling;
private function init():void {
startSampling();
var x:String = "Hello world";
stopSampling();
var samples:Object = getSamples();
var sampleCount:int = getSampleCount(); // 0
}
Here is the Sampler and SamplerScript extensions classes. It is calling setconst_time to get the time. It may also be calling another method:
uint64_t Sampler::nowMicros()
{
return GC::ticksToMicros(VMPI_getPerformanceCounter());
}
As seen in Sampler core class.
I was hoping there was a getMicroTime() method (see getTimer()) but there is not. So I was trying to create samples before and after a code block. Calling getSamples should return an array of Sample objects. Each Sample instance should has a sample.time property with the time in micro seconds. But, using the code above, no samples are taken. The sample count is zero.
I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.
Firstly, I know that Big Query loads datastore backups. This has several disadvantages that I'd like to avoid:
Backup tool is closed source
Backup tool format is undocumented.
Backup tool format cannot be read directly by Dataflow
Backup scheduling for appengine is in (apparently perpetual) alpha.
It is possible to implement your own backup handler in appengine, but it is fire and forget. You won't know when exactly the backup has finished or what the file name will be.
With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:
package com.example.dataflow;
import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.ByteArrayOutputStream;
public class GCDSEntitiesToAvroSSCCEPipeline {
private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
private static final String ENTITY_KIND = "Dummy";
private static Schema getSchema() {
return ProtobufData.get().getSchema(Entity.class);
}
private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
.addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));
p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
.withDataset(options.as(DataflowPipelineOptions.class).getProject())
.withQuery(q.build())))
.apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
.setCoder(AvroCoder.of(getSchema()))
.apply(AvroIO.Write.named("WriteToAvro")
.to(GCS_TARGET_URI)
.withSchema(getSchema())
.withSuffix(".avro"));
p.run();
}
private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c) throws Exception {
Schema schema = getSchema();
ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
dataFileWriter.create(schema, bos);
dataFileWriter.append(c.element());
dataFileWriter.close();
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
new SeekableByteArrayInput(bos.toByteArray()), datumReader);
c.output(dataFileReader.next());
}
}
}
The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:
bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.
However if I load the resultant avro file with avro tool, everything is just fine:
avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...
I used this code to populate the datastore with dummy data before running the Dataflow pipeline:
package com.example.datastore;
import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;
import java.io.IOException;
public static void main(String[] args) throws IOException {
Datastore datastore = DatastoreOptions.builder()
.projectId("myProjectId")
.authCredentials(AuthCredentials.createApplicationDefaults())
.build().service();
KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");
Batch batch = datastore.newBatch();
int batchCount = 0;
for (int i = 0; i < 4000; i++){
IncompleteKey key = dummyKeyFactory.newKey();
System.out.println("adding entity " + i);
batch.add(Entity.builder(key).set("number", i).build());
batchCount++;
if (batchCount > 99) {
batch.submit();
batch = datastore.newBatch();
batchCount = 0;
}
}
System.out.println("done");
}
So why is BigQuery rejecting my avro files?
BigQuery uses the C++ Avro library, and apparently it doesn't like the "$" in the namespace. Here's the error message:
Invalid namespace: com.google.api.services.datastore.DatastoreV1$
We're working on getting these Avro error messages out to the end user.