I have Flink job running in AWS Kinesis Analytics that does the following.
1 - I have Table on a Kinesis Stream - Called MainEvents.
2 - I have a Sink Table that is pointing to Kinesis Stream - Called perMinute.
The perMinute is populated using the MainEvents table as input and generates a sliding window(hop) agg.
So far so good.
My final consumer is a Kineis Python Script that reads the input from perMinute Kinesis Stream.
This is my Consumer Script.
stream_name = 'perMinute'
ses = boto3.session.Session()
kinesis_client = ses.client('kinesis')
response = kinesis_client.describe_stream(StreamName=stream_name)
shard_id = response['StreamDescription']['Shards'][0]['ShardId']
response = kinesis_client.get_shard_iterator(
StreamName=stream_name,
ShardId=shard_id,
ShardIteratorType='LATEST'
)
shard_iterator = response['ShardIterator']
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = str(record["Data"])
print(data)
time.sleep(1)
The issue i have is that i get encoded data, that looks like.
b'{"window_start":"2022-09-28 04:01:46","window_end":"2022-09-28 04:02:46","counts":300}'
b'{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}'
b'\xf3\x89\x9a\xc2\n$4a540599-485d-47c5-9a7e-ca46173b30de\n$2349a5a3-7949-4bde-95a8-4019a077586b\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}\xc3\xa1\xfe\xfa9j\xeb\x1aP\x917F\xf3\xd2\xb7\x02'
b'\xf3\x89\x9a\xc2\n$23a0d76c-6939-4eda-b5ee-8cd2b3dc1c1e\n$7ddf1c0c-16fe-47a0-bd99-ef9470cade28\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:30","window_end":"2022-09-28 04:03:30","counts":531}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:36","window_end":"2022-09-28 04:03:36","counts":560}\x0c>.\xbd\x0b\xac.\x9a\xe8z\x04\x850\xd5\xa6\xb3'
b'\xf3\x89\x9a\xc2\n$2cacfdf8-a09b-4fa3-b032-6f1707c966c3\n$27458e17-8a3a-434e-9afd-4995c8e6a1a4\n$11774332-d906-4486-a959-28ceec0d134a\x1aY\x08\x00\x1aU{"window_start":"2022-09-28 04:02:42","window_end":"2022-09-28 04:03:42","counts":1625}\x1aY\x08\x01\x1aU{"window_start":"2022-09-28 04:02:50","window_end":"2022-09-28 04:03:50","counts":2713}\x1aY\x08\x02\x1aU{"window_start":"2022-09-28 04:03:00","window_end":"2022-09-28 04:04:00","counts":3009}\xe1G\x18\xe7_a\x07\xd3\x81O\x03\xf9Q\xaa\x0b_'
Some Records are valid, the first two and the other records seems to have multiple entries on the same row.
How can i get rid of the extra characters that are not part of the json payload and get one line per invocation ?
If i would use decode('utf-8'), i get few record out ok but when it reaches a point if fails with:
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = record["Data"].decode('utf-8')
# data = record["Data"].decode('latin-1')
print(data)
time.sleep(1)
{"window_start":"2022-09-28 03:59:24","window_end":"2022-09-28 04:00:24","counts":319}
{"window_start":"2022-09-28 03:59:28","window_end":"2022-09-28 04:00:28","counts":366}
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-108-0e632a57c871> in <module>
39 shard_iterator = result["NextShardIterator"]
40 for record in records:
---> 41 data = record["Data"].decode('utf-8')
43 print(data)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
If i use decode('latin-1') it does not fail but i get alot of crazy text out
{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}
óÂ
$4a540599-485d-47c5-9a7e-ca46173b30de
$2349a5a3-7949-4bde-95a8-4019a077586bXT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}XT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}áþú9jëP7FóÒ·
óÂ
here is the stream producer flink code
-- create sink
CREATE TABLE perMinute (
window_start TIMESTAMP(3) NOT NULL,
window_end TIMESTAMP(3) NOT NULL,
counts BIGINT NOT NULL
)
WITH (
'connector' = 'kinesis',
'stream' = 'perMinute',
'aws.region' = 'ap-southeast-2',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);
%flink.ssql(type=update)
insert into perMinute
SELECT window_start, window_end, COUNT(DISTINCT event) as counts
FROM TABLE(
HOP(TABLE MainEvents, DESCRIPTOR(eventtime), INTERVAL '5' SECOND, INTERVAL '60' SECOND))
GROUP BY window_start, window_end;
Thanks
Related
I have the following Flink streaming application running locally, written with the SQL API:
object StreamingKafkaJsonsToCsvLocalFs {
val brokers = "localhost:9092"
val topic = "test-topic"
val consumerGroupId = "test-consumer"
val kafkaTableName = "KafKaTable"
val targetTable = "TargetCsv"
val targetPath = f"file://${new java.io.File(".").getCanonicalPath}/kafka-to-fs-csv"
def generateKafkaTableDDL(): String = {
s"""
|CREATE TABLE $kafkaTableName (
| `kafka_offset` BIGINT METADATA FROM 'offset',
| `seller_id` STRING
|) WITH (
| 'connector' = 'kafka',
| 'topic' = '$topic',
| 'properties.bootstrap.servers' = 'localhost:9092',
| 'properties.group.id' = '$consumerGroupId',
| 'scan.startup.mode' = 'earliest-offset',
| 'format' = 'json'
|)
|""".stripMargin
}
def generateTargetTableDDL(): String = {
s"""
|CREATE TABLE $targetTable (
| `kafka_offset` BIGINT,
| `seller_id` STRING
| )
|WITH (
| 'connector' = 'filesystem',
| 'path' = '$targetPath',
| 'format' = 'csv',
| 'sink.rolling-policy.rollover-interval' = '10 seconds',
| 'sink.rolling-policy.check-interval' = '1 seconds'
|)
|""".stripMargin
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
env.enableCheckpointing(1000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointStorage(s"$targetPath/checkpoints")
val settings = EnvironmentSettings.newInstance()
.inStreamingMode()
.build()
val tblEnv = StreamTableEnvironment.create(env, settings)
tblEnv.executeSql(generateKafkaTableDDL())
tblEnv.executeSql(generateTargetTableDDL())
tblEnv.from(kafkaTableName).executeInsert(targetTable).await()
tblEnv.executeSql("kafka-json-to-fs")
}
}
As you can see, the checkpointing is enabled and when I execute this application I see that the checkpoint folder is created and populated.
The problem that I am facing with is -- when I stop&start my application (from the IDE) I expect it to start from the same point it stopped in the previous execution but instead I see that it consumes all the offsets from the earliest offset in the topic (I see it from the new generated output files that contain zero offset although the previous run processed those offsets).
What am I missing about checkpointing in Flink? I would expect it to be exactly once.
Flink only restarts from a checkpoint when recovering from a failure, or when explicitly restarted from a retained checkpoint via the command line or REST API. Otherwise, the KafkaSource starts from the offsets configured in the code, which defaults to the earliest offsets.
If you have no other state, you could instead rely on the committed offsets as the source of truth, and configure the Kafka connector to use the committed offsets as the starting position.
Flink's fault tolerance via checkpointing isn't designed to support mini-cluster deployments like the one used when running in an IDE. Normally the job manager and task managers are running in separate processes, and the job manager can detect that a task manager has failed, and can arrange for a restart.
According to the Snowflake docs, when a user executes a copy command it will return 1 of 3 status values:
loaded
load failed
partially loaded
My question is if I use the Python Snowflake Connector (see example code below) to execute a copy command is an exception raised if the status returned is load failed or partially loaded?
Thank you!
copy_dml = 'copy into database.schema.table ' \
'from #fully_qualified_stage pattern = \'.*'+ table_name +'.*[.]json\' ' \
'file_format = (format_name = fully_qualified_json_format) ' \
'force = true;'
try:
import snowflake.connector
#-------------------------------------------------------------------------------------------------------------------------------
#snowflake variables
snowflake_warehouse = credentials.iloc[0]['snowflake_warehouse']
snowflake_account = credentials.iloc[0]['snowflake_account']
snowflake_role = credentials.iloc[0]['snowflake_role']
snowflake_username = credentials.iloc[0]['Username']
snowflake_password = credentials.iloc[0]['Password']
snowflake_connection = ''
cs = ''#snowflake connection cursor
exec_copy_dml = ''
copy_result_field_metadata = ''
copy_result = ''
snowflake_copy_result_df = ''
#-------------------------------------------------------------------------------------------------------------------------------
# load JSON file(s) into Snowflake
snowflake_connection = snowflake.connector.connect(
user = snowflake_username,
password = snowflake_password,
account = snowflake_account,
warehouse = snowflake_warehouse,
role = snowflake_role)
cs = snowflake_connection.cursor()
exec_copy_dml = cs.execute(copy_dml)
copy_result = exec_copy_dml.fetchall()
copy_result_field_metadata = cs.description
snowflake_copy_result_df = snowflake_results_df(copy_result_field_metadata,copy_result)
except snowflake.connector.errors.ProgrammingError as copy_error:
copy_exception_message = "There was a problem loading JSON files to Snowflake," + \
"a snowflake.connector.errors.ProgrammingError exception was raised."
print(copy_exception_message)
raise
except Exception as error_message:
raise
finally:
snowflake_connection.close()
I believe it won't raise exception for load status, you have to check the load status and take necessary action if required.
After you issue your COPY INTO dml, you can run the following query -
SELECT * FROM TABLE(VALIDATE(TABLE_NAME, job_id => '_last'))
This will give you details on the files that you were trying to load. It will normally return empty, unless you encountered issues upload.
You can save this save results in an object and make necessary control adjustments.
I am trying to solve a classic ETL problem using streaming. I have a batch of segments, Each segment holds information about the records associated for that segment like number of records, url to retrieve etc, to issue a http request to collect data. I need to extract the records from a source with paging size of 100 records, merge the pages of records for each segment, wrap in a xml header and footer. Now send each xml payload per segment to a target.
{http}
page 1
/ \
seg 1 > page 2 -> merge -> wrapHeaderAndFooter -> http target
/ \ /
/ page n
/
/
batch - seg 2 " -> http target
\ seg n " -> http target
val loadSegment: Flow[Segment, Response, NotUsed] = {
Flow[Segment].mapAsync(parallelism = 5) { segment =>
val pages: Source[ByteString, NotUsed] = pagedPayload(segment).map(page => page.payload)
//Using source concatenation to prepend and append
val wrappedInXML: Source[ByteString, NotUsed] = xmlRootStartTag ++ pages ++ xmlRootEndTag
val httpEntity: HttpEntity = HttpEntity(MediaTypes.`application/octet-stream`, pages)
invokeTargetLoad(httpEntity, request, segment)
}
}
def pagedPayload(segment: Segment): Source[Payload, NotUsed] = {
val totalPages: Int = calculateTotalPages(segment.instanceCount)
Source(0 until totalPages).mapAsyncUnordered(parallelism = 5)(i => {
sendPayloadRequest(request, segment, i).mapTo[Try[Payload]].map(_.get)
})
}
val batch: Batch = someBatch
Source(batch.segments)
.via(loadSegment)
.runWith(Sink.ignore)
.andThen {
case Success(value) => log("success")
case Failure(error) => report(error)
}
Is there a better approach? I am trying to use the HttpEntity.Chunked encoding to stream the pages. Sometimes the first request from the source can take longer time due to warm up and the target truncates the stream with no data. Is there a way to delay the actual connection to target until we have the first page in stream?
I would have more liked to do something like below. if it's possible how to implement methods wrapXMLHeader & toHttpEntity
val splitPages: Flow[BuildSequenceSegment, Seq[PageRequest], NotUsed] = ???
val requestPayload: Flow[Seq[PageRequest], Seq[PageResponse], NotUsed] = ???
val wrapXMLHeader: Flow[Seq[PageResponse], Seq[PageResponse], NotUsed] = ???
val toHttpEntity: Flow[Seq[PageResponse], HttpEntity.Chunked, NotUsed] = ???
val invokeTargetLoad: Flow[HttpEntity.Chunked, RestResponse, NotUsed] = ???
Source(batch.segments)
.via(splitPages)
.via(requestPayload)
.via(wrapXMLHeader)
.via(toHttpEntity)
.via(invokeTargetLoad)
.runWith(Sink.ignore)
I am currently working on a google cloud project in free trial mode. I have cron job to fetch the data from a data vendor and store it in the data store. I wrote the code to fetch the data couple of weeks ago and it was all working fine but all of sudden , i started receiving error "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded" for last two days. I believe cron job is supposed to timeout only after 60 minutes any idea why i am getting the error?.
cron task
def run():
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
company_list = cron.rest_client.load(config, "companies", '')
if not company_list:
logging.info("Company list is empty")
return "Ok"
for row in company_list:
company_repository.save(row,original_data_source, actual_data_source)
return "OK"
Repository code
def save( dto, org_ds , act_dp):
try:
key = 'FIN/%s' % (dto['ticker'])
company = CompanyInfo(id=key)
company.stock_code = key
company.ticker = dto['ticker']
company.name = dto['name']
company.original_data_source = org_ds
company.actual_data_provider = act_dp
company.put()
return company
except Exception:
logging.exception("company_repository: error occurred saving the company
record ")
raise
RestClient
def load(config, resource, filter):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
current_page = 1
data = []
while True:
if (filter):
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return 'Sorry, xxxx API request failed', 500
db = json.loads(response.content)
if not db['data']:
break
data.extend(db['data'])
if db['total_pages'] == current_page:
break
current_page += 1
return data
except Exception:
logging.exception("Error occured with xxxx API request")
raise
I'm guessing this is the same question as this, but now with more code:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded
I modified your code to write to the database after each urlfetch. If there are more pages, then it relaunches itself in a deferred task, which should be well before the 10 minute timeout.
Uncaught exceptions in a deferred task cause it to retry, so be mindful of that.
It was unclear to me how actual_data_source & original_data_source worked, but I think you should be able to modify that part.
crontask
def run(current_page=0):
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
data, more = cron.rest_client.load(config, "companies", '', current_page)
for row in data:
company_repository.save(row, original_data_source, actual_data_source)
# fetch the rest
if more:
deferred.defer(run, current_page + 1)
except Exception as e:
logging.exception("run() experienced an error: %s" % e)
RestClient
def load(config, resource, filter, current_page):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return [], False
db = json.loads(response.content)
return db['data'], (db['total_pages'] != current_page)
except Exception as e:
logging.exception("Error occured with xxxx API request: %s" % e)
return [], False
I would prefer to write this as a comment, but I need more reputation to do that.
What happens when you run the actual data fetch directly instead of
through the cron job?
Have you tried measuring a time delta from the start to the end of
the job?
Has the number of companies being retrieved increased dramatically?
You appear to be doing some form of stock quote aggregation - is it
possible that the provider has started blocking you?
I'm trying to exclude posts which have a tag named meta from my selection, by:
meta_id = db(db.tags.name == "meta").select().first().id
not_meta = ~db.posts.tags.contains(meta_id)
posts=db(db.posts).select(not_meta)
But those posts still show up in my selection.
What is the right way to write that expression?
My tables look like:
db.define_table('tags',
db.Field('name', 'string'),
db.Field('desc', 'text', default="")
)
db.define_table('posts',
db.Field('title', 'string'),
db.Field('message', 'text'),
db.Field('tags', 'list:reference tags'),
db.Field('time', 'datetime', default=datetime.utcnow())
)
I'm using Web2Py 1.99.7 on GAE with High Replication DataStore on Python 2.7.2
UPDATE:
I just tried posts=db(not_meta).select() as suggested by #Anthony, but it gives me a Ticket with the following Traceback:
Traceback (most recent call last):
File "E:\Programming\Python\web2py\gluon\restricted.py", line 205, in restricted
exec ccode in environment
File "E:/Programming/Python/web2py/applications/vote_up/controllers/default.py", line 391, in <module>
File "E:\Programming\Python\web2py\gluon\globals.py", line 173, in <lambda>
self._caller = lambda f: f()
File "E:/Programming/Python/web2py/applications/vote_up/controllers/default.py", line 8, in index
posts=db(not_meta).select()#orderby=settings.sel.posts, limitby=(0, settings.delta)
File "E:\Programming\Python\web2py\gluon\dal.py", line 7578, in select
return adapter.select(self.query,fields,attributes)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3752, in select
(items, tablename, fields) = self.select_raw(query,fields,attributes)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3709, in select_raw
filters = self.expand(query)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3589, in expand
return expression.op(expression.first)
File "E:\Programming\Python\web2py\gluon\dal.py", line 3678, in NOT
raise SyntaxError, "Not suported %s" % first.op.__name__
SyntaxError: Not suported CONTAINS
UPDATE 2:
As ~ isn't currently working on GAE with Datastore, I'm using the following as a temporary work-around:
meta = db.posts.tags.contains(settings.meta_id)
all=db(db.posts).select()#, limitby=(0, settings.delta)
meta=db(meta).select()
posts = []
i = 0
for post in all:
if i==settings.delta: break
if post in meta: continue
else:
posts.append(post)
i += 1
#settings.delta is an long integer to be used with limitby
Try:
meta_id = db(db.tags.name == "meta").select().first().id
not_meta = ~db.posts.tags.contains(meta_id)
posts = db(not_meta).select()
First, your initial query returns a complete Row object, so you need to pull out just the "id" field. Second, not_meta is a Query object, so it goes inside db(not_meta) to create a Set object defining the set of records to select (the select() method takes a list of fields to return for each record, as well as a few other arguments, such as orderby, groupby, etc.).