YCSB - Difference between recordcount, operationcount and insertcount parametres - benchmarking

Please, what is the difference between recordcount, operationcount and insertcount parametres.
Thanks in advance.

As per this github issue:
recordcount: how many records YCSB assumes are present or will be present in the backing data store
operationcount: how many workload operations YCSB should perform (scan / read / whatever according to the % breakdown in the workload)
insertcount: how many records a particular YCSB client ought to insert during a load phase.

Related

Blockfrost.io get transaction data with addresses (from/to)

I am looking at using Blockfrost.io API in order to read cardano transactions, I am looking to get the bare minimum which is:
Address from
Address to
Assets transfered (type + amount)
Fees
So far I can not find how to retrieve a transaction addresses from and to while using:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/~1txs~1%7Bhash%7D/get
Am I missing something?
So to answer my question:
Cardano does use something called "utxos" for the way it handles transactions and I would invite everyone to read about these.
Regarding Blockfrost.io, this means you need to have a look at the transactions api:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/%7E1txs%7E1%7Bhash%7D/get
and also combine it with the utxos api:
https://docs.blockfrost.io/#tag/Cardano-Transactions/paths/~1txs~1{hash}~1utxos/get

Flink CEP cannot get correct results on a unioned table

I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.

Django Model: Best Approach to update?

I am trying to update 100's of objects in my Job which is scheduled every two hours.
I have articles table in my Model. All articles are parsed and then different attributes are saved for each article.
First i query to get all unparsed articles and then parse each URL which is saved against article and save the received attributes.
Below is my code
articles = Articles.objects.filter(status = 0) #100's of articles
for art in articles:
try:
url = art.link
result = ArticleParser(URL) #Custom function which will do all the parsing
art.author = result.articleauthor
art.description = result.articlecontent[:5000]
art.imageurl = result.articleImage
art.status = 1
art.save()
except Exception as e:
art.author = ""
art.description = ""
art.imageurl = ""
art.status = 2
art.save()
The thing is when this job is running CPU utilization is very high also DB process utilization is very high. I am trying to pin point when and where it spikes.
Question: Is this the right way to update multiple objects or is there any better way to do it? Any suggestions.
Appreciate your help.
Regards
Edit 1: Sorry for the confusion. There is some explanation to do. The fields like author, desc etc they will be different for every article they will be returned after i parse the URL. The reason i am updating in loop is because these fields will be different for every iteration according to the URL. I have updated the code i hope it helps clearing the confusion.
You are doing 100s of DB operations in a relatively tight loop, so it is expected that there is some load on the DB.
If you have a lot of articles, make sure you have an index on the status column to avoid a table scan.
You can try disabling autocommit and wrapping the whole update in one transaction instead.
From my understanding, you do NOT want to set the fields author, description and imageurl to same value on all articles, so QuerySet.update won't work for you.
Django recommends this way when you want to update or delete multi-objects: https://docs.djangoproject.com/en/1.6/topics/db/optimization/#use-queryset-update-and-delete
1.Better not to use 'Exception', need to specify concretely: KeyError, IndexError etc.
2.Data can be created once. Something like this:
data = dict(
author=articleauthor,
description=articlecontent[:5000],
imageurl=articleImage,
status=1
)
Articles.objects.filter(status=0).update(**data)
To Edit 1: Probably want to set up a periodic tasks celery. That is, for each query to a separate task. For help see this documentation.

Issue related to Google App Engine query within a date range

I am concerned about querying entities this way
created_start = datetime.today()
created_start = created_start - timedelta(hours=1)
created_end = datetime.now()
a = Message.all()
a.filter('created >=',created_start)
a.filter('created <',created_end)
Due to the 1000 query results restriction. So two questions:
Will this work if .all() returns more that 1000 results? Or to put it in a different way. Will all() return more than a 1000 results incase there were more?
Is there a better way to achieve querying for entities between a given data range?
Thank you very much in advance
Your solution is good, since Version 1.3.6, query results are no longer capped at 1000.
You can iterate a entities until exhaustion or fetch chunks of entities using a cursor.

Row count of a column family in Cassandra

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.
For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.
If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:
nodetool --host <hostname> cfstats
This will dump out a list for each column family looking like this:
Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634
The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.
If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.
If you are not, you will need to store your user ids in a special row.
I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra
select count(*) from cf limit 1000000
Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.
[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]
I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.
I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.
Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.
http://www.mail-archive.com/user#cassandra.apache.org/msg03965.html
There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.
NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.
I have been getting the counts like this after I convert the data into a hash in PHP.

Resources