Is it possible to connect Oracle database and get live data stream into Google cloud pub/sub?
The short answer to your question is yes, but the longer more detailed answer includes making some assumptions like, when you say stream, do you literally mean stream or do you mean batch updating every minute?
Asking this question because there are huge implications depending on the answer meaning, if you require a true streaming solution, the only one is to bolt an Oracle product on top of your database called Oracle GoldenGate. This product is costly both in dollars and in engineering effort.
If a near real time solution is suitable to you, then you can use any of the following solutions:
NiFi
Or
Airflow
Luigi
With either plain SQL or using a streaming framework like Beam or Spark.
Or any other orchestration platform that can run queries on a timer. At the end of the day, all you need is something that can do select * from table where last_update > now() - threshold, generate an event for each delta, and then publish all the deltas to PubSub.
Yes, you can see a provided template at https://cloud.google.com/dataflow/docs/templates/provided-templates#gcstexttocloudpubsub that reads from Google Cloud Storage Text to Cloud Pub/Sub. You should be able to change the code that reads from storage to read from your database instead.
yes. I tried as part of 1 POC. Using triggers capture the changed records from Oracle, using a Cursor convert those into .txt file with data in JSON format.Prepare batch script to read the data and include Publish command inside the batch file to push the data through cloud PubSub. This is the overall flow
You can consider using Change data Capture(CDC) tools like Debezium that detect your db changes in real time.
Docs: https://debezium.io/documentation/reference/operations/debezium-server.html
With Spring boot: https://www.baeldung.com/debezium-intro
Related
I have used snowpipe to retrieve data from AWS S3 to Snowflake, but in my case, its not working as expected. Sometimes the files are not processing into snowflake.
Is there any alternate methods available for the same?
The event handling from AWS S3 has been said to be unreliable in the way that events might arrive several minutes late (this is an AWS issue, but affects Snowpipe).
The remedy is to schedule a task to periodically (minimum daily) do:
ALTER PIPE my_pipe REFRESH [ PREFIX = '<path>' ];
Please use a prefix to avoid scanning large S3 buckets for unprocessed items. Also watch for announcements from Snowflake about when the S3 event issue is fixed by Amazon, so you can delete any
unnecessary REFRESH tasks.
If you have eg. a YYYY/MM/DD/ bucket structure this unfortunately means you have to create a Stored Procedure to run the command with a dynamic PREFIX...
I use this combination (PIPE/REFRESH TASK) for my Snowpipes.
To answer your question: Yes. I've used it in the past on multiple occasions in production (AWS) and it has worked as expected.
I have some files that contains thousands of rows that I need to insert into Google BigQuery, so, because the execution time exceeds the 60s request limit in AppEngine, I moved the BQ queries in a task queue.
For now, It works very well, but I don't know if this is the best place to put BQ queries. I am saying this because the requests are taking up to 3 minutes to complete, and I think that this is a bit slow. Do you think that there's a faster / better place to query BQ ?
PS: I am using the google bigquery api to send the queries.
There is two options:
You file with data is formatted to be used with BQ load jobs. In this case - you start load job in task queue - and store jobid you get from REST call to datastore. And quit from task queue. As another process you setup appengine cron that run say every minute and just check all running jobids and update status (process from cron run as task queue and using - so it will be under 10 mins limit) if changed and initiate another process if needed. In this case I think it will be pretty scalable
You process file and somehow manually insert rows - in this case best case of action will be using pubsub or again start multiple tasks in taskqueue - by manually splitting data into small pieces and using BQ Streaming insert API - of course it depends on size of your row - but I found that 1000-5000 recs per process works well here.
Also check out Potens.io (also available at Cloud Launcher)
The Magnus - Workflow Automator which is part of Potens suite - supports all BigQuery, Cloud Storage and most of Google APIs as well as multiple simple utility type Tasks like BigQuery Task, Export to Storage Task, Loop Task and many more
Disclosure: I am creator of those tools and leader on Potens team
If you have your text files in Google Cloud Storage, Cloud Dataflow could be a natural solution for your situation {1}.
You can use a Google-Provided Template to save some time in the process of creating a Cloud Dataflow pipeline {2}.
This way you can create a batch pipeline to move (and transform if you want) data from Google Cloud Storage (files) to BigQuery.
{1}: https://cloud.google.com/dataflow/
{2}: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-storage-text-to-bigquery
I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.
To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.
I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.