Mongo CDC with KAFKA, How to specify the configuration to get read data with a certain criteria - cdc

I am using Debezium + Kafka to consume Mongo data. Now I don't want the entire data and just need data with certain criteria. But I cannot find a suitable configuration for the same.
We have tried collation and pipeline..
"collation":"db.collection.find({id:466114})"
or ,
"pipeline":"[{\"$match\": {\"$and\": [{\"operationType\":\"insert\"}, { \"collection.id\":NumberLong(466114) }] } }]",
But both didn't work, I might be using these two methods in the wrong way. Could any one help me out?
Thanks.

Related

How to modify the projection of a dataset in a ADF Dataflow

I want to optimize my dataflow reading just data I really need.
I created a dataset that maps a view on my database. This dataset is used by different dataflow so I need a generic projection.
Now I am creating a new dataflow and I want to read just a subset of the dataset.
Here how I created the dataset:
And that is the generic projection:
Here how I created the data flow. That is the source settings:
But now I want just a subset of my dataset:
It works but I think I am doing wrong:
I wanto to read data from my dataset (as you can see from source settings tab), but when I modify the projection I read from the underlying table (as you can see from source option). It seems an inconsistence. Which is the correct way to manage this kind of customization?
Thank you
EDIT
The solution proposed does not solve my problem. If I go in monitor and I analyze the exections that is what I saw...
Before I had applyed the solution proposed and with the solution I wrote above I got this:
As you can see I had read just 8 columns from database.
With the solution proposed, I get this:
And just then:
Just to be clear, the purpose of my question is:
How ca n I read only the data I really need instead of read all data and filter them in second moment?
I found a way (explained in my question) but there is an inconsistency with the configuration of the dataflow (I set a dataflow as input but in the option I write a query that read from db).
First import data as a Source.
You can use Select transformation in DataFlow activity to select CustomerID from imported dataset.
Here you can remove unwanted columns.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-select

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.
One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.
I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

Retrieve all documents but only specific fields from Cloudant database

I want to return all the documents in a Cloudant database but only include some of the fields. I know that you can make a GET request against https://$USERNAME.cloudant.com/$DATABASE/_all_docs, but there doesn't seem to be a way to select only certain fields.
Alternatively , you can POST to /db/_find and include selector and fields in the JSON body. However, is there a universal selector, similar to SELECT * in SQL databases?
You can use {"_id":{"$gt":0}} as your selector to match all the documents, although you should note that it is not going to be performant on large data sets.

Querying Twitter JSON File in HBase

I have successfully downloaded twitter data through flume directly into HBase table containing one column family and all of the data is stored in one column like this
hbase(main):005:0> scan 'tweet'
ROW
default00fbf898-6f6e-4b41-aee8-646efadfba46
COLUMN+CELL
column=data:pCol, timestamp=1454394077534, value={"extended_entities":{"media":[{"display_url":"pic.twitter.com/a7Mjq2daKZ","source_user_id":2987221847,"type":"photo"....
Now i want to access structs and arrays through HBase like we can access then in Hive. I have tried googling the issue but still clue less. Kindly Help
You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.

Querying large amount of data processed by Hive

Say I have around 10-20GB of data in HDFS as a Hive table. This has been obtained after several Map-Reduce jobs and JOIN over two separate datasets. I need to make this Queryable to the user. What options do I have?
Use Sqoop to transfer data from HDFS to an RDS like Postgresql. But I want to avoid spending so much time on data transfer. I just tested HDFS->RDS in the same AWS region using Sqoop, and 800mb of data takes 4-8 minutes. So you can imagine ~60gb of data would be pretty unmanagable. This would be my last resort.
Query Hive directly from my Webserver as per user request. I haven't ever head of Hive being used like this so I'm skeptical about this. This struck me because I just found out you can query hive tables remotely after some port forwarding on the EMR cluster. But being new to big(ish) data I'm not quite sure about the risks associated with this. Is it commonplace to do this?
Some other solution - How do people usually do this kind of thing? Seems like a pretty common task.
Just for completeness sake, my data looks like this:
id time cat1 cat2 cat3 metrics[200]
A123 1234212133 12 ABC 24 4,55,231,34,556,123....(~200)
.
.
.
(time is epoch)
And my Queries look like this:
select cat1, corr(metrics[2],metrics[3]),corr(metrics[2],metrics[4]),corr(metrics[2],metrics[5]),corr(metrics[2],metrics[6]) from tablename group by cat1;
I need the correlation function, which is why I've chosen postgresql over MySQL.
You have correlation function in Hive:
corr(col1, col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
You can simply connect to a hiveserver port via odbc and execute queries.
Here is an example:
http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-10.html
Hive User Experience (hue) has a Beeswax query editor designed specifically for the purpose of exposing Hive to end users who are comfortable with SQL. This way they can potentially run ad-hoc queries against the data residing in Hive without needing to move it elsewhere. You can see an example of the Beeswax Query Editor here: http://demo.gethue.com/beeswax/#query
Will that work for you?
What i can understand from the question posted above is you have some data (20GB ) which you have stored in hdfs and using hive. Now you want to access that data to perform some kind of statistics functions like correlation and others.
You have functions in hive that perform correlation.
Otherwise you can directly connect R to hive using RHive or even excel to hive using datasource.
The other solution is installing hue which comes with hive editors where you can directly query the hive.

Resources