I wanted to clear my understanding on the following.
Use case
Basically I am running a flink batch job. My requirement is following
I have 10 tables having raw data in postgresql
I want to aggregate that data by creating a tumble window of 10 minutes
I need to store the aggregated data into aggregated postgresql tables
My pseudo code somewhat looks like this
initialize StreamExecutionEnvironment, StreamTableEnvironment
load all the configs from file
configs.foreach(
load data from table
aggregate
store data
delete temporary views created
)
streamExecutionEnvironment.execute()
Everything works fine for now. Still I have gotten one question. I think with this approach all the load functions would be executed simultaneously. So it would put load on flink right as all data is getting loaded simultaneously?? Or my understanding is wrong and the data would get loaded, processed and stored one by one?? please guide
Related
I'm trying to monitor pipeline runs from ADF in a Snowflake table. I've managed to use a REST API to get the data into Power BI but I now need to get the data from ADF to Snowflake. Anyone have any examples that would be of great help. The data I need to get is like Pipeline name, run time, start time, error message etc.
Please check below approach.
Take Pipeline name, run time, start time, error message etc.. details into pipeline variables
Inside copy activity, Source - point to some dummy file on blob or datalake and then add additional columns for Pipeline name, run time, start time, error message etc.. in source tab.
Inside Copy activity, Sink - Point to your snowflake table.
Inside Copy activity, Mapping tab - Mapp your source and Sink columns accordingly.
Please check below video where author doing same but using dataflow. In your case you can go with copy activity as explained above.
https://www.youtube.com/watch?v=-xna7n33lmc
To know how to access error message of any activity failure. Please check video https://www.youtube.com/watch?v=_lSB7jaDnG0
So I'm learning about Spark and I have a question about how client libs works.
My goal is to do some sort of data analysis in Spark, telling it where are the data sources (databases, cvs, etc) to process, and store results in hdfs, s3 or any kind of database like MariaDB or MongoDB.
I though about having a service (API application) that "tells" spark what I want to do. The question is: Is it enough setting the master configuration with spark:remote-host:7077 at context creation or should I send the application to spark with some sort of spark-submit command?
This completely depends on how your environment is set up, if all paths are linked to your account you should be able to run one of the two commands, to efficiently open the shell and run test commands. The reason to have a shell, is this will allow you to dynamically run commands and validate/learn how to run/tether commands onto one another and see what results come out.
Scala
spark-shell
Python
pyspark
Inside of the environment, if everything is linked to Hive tables you can check the tables by running
spark.sql("show tables").show(100,false)
The above command will run a "show tables" on the Spark-Hive-Metastore Catalogue and will return all active tables you can see (doesn't mean you can access the underlying data). The 100 means I am going to look at 100 rows and the false means to show the full string not the first N many characters.
In a mythical example if one of the tables you see is called Input_Table you can bring it into the environmrnt with the below commands
val inputDF = spark.sql("select * from Input_Table")
inputDF.count
I would heavily advise, while your learning, not to run the commands via Spark-Submit, because you will need to pass through the Class and Jar, forcing you to edit/rebuild for each testing making it difficult to figure how commands will run without have a lot of down time.
I am trying to save data as it arrives in a streaming fashion (with the least amount of delay) to my database which is InfluxDB. Currently I save it in batches.
Current setup - interval based
Currently I have an Airflow instance where I read the data from a REST API every 5min and then save it to the InfluxDB.
Desired setup - continuous
Instead of saving data every 5 min, I would like to establish a connection via a Web-socket (I guess) and save the data as it arrives. I have never done this before and I am confusing how actually it is done? Some question I have are:
One I write the code for it, do I keep it up like a daemon?
Do I need to use something like Telegraf for this or that's not really the case (example article)
Instead of Airflow (since it is for batch processing) do I need to use something like Apache Beam or Spark?
As you can see, I am quite lost on where to start, what to read and how to make sense from all this. Any advise on direction and/or guidance for a set-up would be very appreciated.
If I understand correctly, you are keen to code a java service which would process the incoming data, so one of the solution is to implement a websocket with for example jetty.
From there you receive the data in json format for example and you process the data using the influxdb-java framework with which you fill the database. Influxdb-java will allow you to create and manage the data.
I don't know airflow, and how you produce the data, so maybe there is built-in tools (influxdb sinks) that can save you some work in your context.
I hope that this can give you some guide lines to start digging more.
I have a winforms application that has an sql server backend. I have some tables that have static data, lookup tables, that I would like to fill in my dataset at application start to be used throughout the application when needed.
Normally in a form I would use something like this: Me.TEMSWBSETableAdapter.Fill(Me.EMS_DS.TEMSWBSE)
But I would have to do that in every form that required that data. The problem is it takes awhile to load that data, so I would like to load the data at startup in a background worker that can be used by any form that requires it. Basically filling the dataset with that data for use throughout.
I am not sure how to do this. Can someone point me in the right direction.
Thanks
you first need to implement DataReder object that can retrieve data with out creating a database TableAdapter. this will load your data with forward only and read only mode of the ADO.net mechanism.then you can implement ExecuteNonQuery() method to append the changes directly to the database.
i am developing a website on angularjs and my server gets information from a graph db in neo4j. At first ive used the default neo4j db (with movies and such) but when i want to load my own csv files the neo4j adds onlyhalf of them. I have 117000 rows and ive tried to use periodic commit and it added again, only 58000. What is the cypher command for adding all the data, is it ok to divide it to another csv file?
EDIT: Ive used this command:
USING PERIODIC COMMIT
LOAD CSV FROM 'http://docs.neo4j.org/chunked/2.1.2/csv/artists.csv' AS line
CREATE (:Artist { name: line[1], year: toInt(line[2])})
Another question: i need to show the result from a query using angularJS and i couldnt find a normal explanation, algorithm, example etc. is there a way to show the result (the result is in json)
EDIT: i need to show the results as a table and also as a nodes (like in the neo4j admin website)
Yes, please split it up into 3 questions.
For loading data,
see http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
And check out MERGE:
see: http://docs.neo4j.org/chunked/milestone/query-merge.html
For creating an angular application with Neo4j backend
see: https://github.com/kbastani/neo4j-movies-template