As mentioned in the title, I'd like to know what data sources Snowflake supports. I'm not completely sure how to even approach this question. I know you can create an external stage in the cloud storage of supported cloud providers, but what if I want to load data from the Oracle database, for example. What's the best solution in that case, is it to use the ODBC driver, or?
And please feel free to give me any suggestions, or advice on where to continue my research. Also, let me know if any part of my question is unclear so that I can rephrase it :)
Snowflake natively supports AVRo, Parquet, CSV, JSON and ORC. These are landed in a stage for ingestion --- your ELT/ETL tool of choice or even a home-built application must land the data in a stage, either internal or external.
That file is then ingested into Snowflake utilizing a COPY command either automated by said tool or using something like Snowpipe.
We have documentation on Firehose / Kafka pipelines landing data for Snowpipe to ingest either through AUTO_INGEST notifications (limited to external stage) or calling our REST API.
All supported by our documentation, simply google the terms I have mentioned and there will be tons of documentation
Multiple existing ETL Tools allow to define Snowflake as destination, supporting a wide variety of sources.
Native Programmatic Interfaces
Snowflake Ecosystem - Data Integration
Related
Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools
I have a scenario with spring batch where I need read data from ms SQL server database and write it to the Cassandra database.
Am new to batch processing not much resources getting in Google to understand more on this,
Could you please share inputs in the same
Thanks in advance
You question is very light on detail and a little too open-ended so I wanted to warn you that there's a chance the community will vote to close it for those reasons.
Based on what you've provided, it sounds like you've got a streaming use case where you have an app "service" that would be the source of the data and publish it on a messaging/event platform and other systems/services can subscribe to those events.
You can use Kafka or Pulsar as the platform and Cassandra is one of the sinks. If you're interested in trying it out, Astra Streaming is a streaming-as-a-service backed by Pulsar with Astra DB (Cassandra-as-a-service) as the sink.
Astra Streaming and DB have free tiers which don't require a credit card so you can quickly do POCs without having to worry about downloading/installing/configuring clusters.
As a side note, Astra DB comes with ready-to-use Stargate.io -- a data platform that allows you to connect to Cassandra using REST, GraphQL and JSON/Doc APIs so you can easily build applications on top of Cassandra using APIs. Cheers!
Using a mysql source connector, I can capture the mysql change and post to them ES or another db for backup. But for that I need separate connector(both source and sink) for each table I have in my source db.
So my question is:
Without creating same amount of source and sink connector for each tables, how can I achieve the same purpose? As this is cumbersome to create that many connectors. So that backing up db(replica) and making faster response service for clients would become much easier for me. Or there is no way to do this?
For the source connectors, you can use table.whitelist. For example,
table.whitelist: "User, Address, Email"
Sink connectors can only be configured for one table at a time.
And I wouldn't say that it is hard to maintain multiple sink/source connectors and topics. From my experience, it is harder to maintain connectors which are replicating data from multiple topics/sources. For example, if you want to apply SMT (Simple Message Transform) on a particular topic, you won't be able to achieve it if you don't have isolated connectors as SMTs are applied on a connector level. Furthermore, if you configure a single connector for all of your sources and at some point it fails, all of your target systems will encounter downtime.
I am just starting to set up a project to keep track of some open, home devices that are enabled for an at home network. I have a program that saves this data, and am putting together a process to upload that data to Snowflake automatically. I would like to know what you would recommend so I can easily access the home device information from anywhere.
The two options I am considering are aws's and snowflake's auto ingest option using the snowpipe rest api, which I have tested with only a few devices.
I am considering these two factors - which method can I set up to upload and select data quickly from a mobile app written in python or ruby depending on the device.
Any advice or resources you can point me to on this?
Thank you!
Your question is a pretty open question, so details from you might make this answer a bit more detailed, as well. However, in general, I would suggest that if your IoT data can be stored directly to Blob Storage (S3 in the case of AWS), then you should leverage Snowflake's Snowpipe for continuous ingestion. Also, look into Tasks and Streams to automate moving that data through whatever processes you'll setup once the data is in Snowflake.
A good reference for you:
https://docs.snowflake.net/manuals/user-guide/data-pipelines-intro.html
i wanna know what are the pros and cons using Fusion instead of regular Solr ? can you guys give some example (like some problem that can be solved easily using Fusion)?
First of all, I should disclose that I am the Product Manager for Lucidworks Fusion.
You seem to already be aware that Fusion works with Solr (or one or more Solr clusters or instances), using Solr for data storage and querying. The purpose of Fusion is to make it easier to use Solr, integrate Solr, and to build complex solutions that make use of Solr. Some of the things that Fusion provides that many people find helpful for this include:
Connectors and a connector framework. Bare Solr gives you a good API and the ability to push certain types of files at the command line. Fusion comes with several pre-built data source connectors that fetch data from various types of systems, process them as appropriate (including parsing, transformation, and field mapping), and sends the results to Solr. These connectors include common document stores (cloud and on-premise), relational databases, NoSQL data stores, HDFS, enterprise applications, and a very powerful and configurable web crawler.
Security integration. Solr does not have any authentication or authorizations (though as of version 5.2 this week, it does have a pluggable API and an basic implementation of Kerberos for authentication). Fusion wraps the Solr APIs with a secured version. Fusion has clean integrations into LDAP, Active Directory, and Kerberos for authentication. It also has a fine-grained authorizations model for mananging and configuring Fusion and Solr. And, the Fusion authorizations model can automatically link group memberships from LDAP/AD with access control lists from the Fusion Connectors data sources so that you get document-level access control mirrored from your source systems when you run search queries.
Pipelines processing model. Fusion provides a pipeline model with modular stages (in both API and GUI form) to make it easier to define and edit transformations of data and documents. It is analogous to unix shell pipes. For example, while indexing you can include stages to define mappings of fields, compute new fields, aggregate documents, pull in data from other sources, etc. before writing to Solr. When querying, you could do the same, along with transforming the query, running and returning the results of other analytics, and applying security filtering.
Admin GUI. Fusion has a web UI for viewing and configuring the above (as well as the base Solr config). We think this is convenient for people who want to use Solr, but don't use it regularly enough to remember how to use the APIs, config files, and command line tools.
Sophisticated search-based features: Using the pipelines model described above, Fusion includes (and make easy to use) some richer search-based components, including: Natural language processing and entity extraction modules; Real-time signals-driven relevancy adjustment. We intend to provide more of these in the future.
Analytics processing: Fusion includes and integrates Apache Spark for running deep analytics against data stored in Solr (or on its way in to Solr). While Solr implicitly includes certain data analytics capabilities, that is not its main purpose. We use Apache Spark to drive Fusion's signals extraction and relevancy tuning, and expect to expose APIs so users can easily run other processing there.
Other: many useful miscellaneous features like: dashboarding UI; basic search UI with manual relevancy tuning; easier monitoring; job management and scheduling; real-time alerting with email integration, and more.
A lot of the above can of course be built or written against Solr, without Fusion, but we think that providing these kinds of enterprise integrations will be valuable to many people.
Pros:
Connectors : Lucidworks provides you a wide range of connectors, with those you can connect to datasources and pull the data from there.
Reusability : In Lucidworks you can create pipelines for data ingestion and data retrieval. You can create pipelines with common logic so that these can be used in other pipelines.
Security : You can apply restrictions over data i.e Security Trimming data. Lucidworks provides in built query-pipeline stages for Security Trimming or you can write custom pipeline for your use case.
Troubleshooting : Lucidworks comes with discrete services i.e api, connectors, solr. You can troubleshoot any issue according the services, each service has its logs. Also you can configure JVM properties for each service
Support : Lucidworks support is available 24/7 for help. You can create support case according the severity and they schedule call for you.
Cons:
Not much, but it keeps you away from your normal development, you don't get much chance to open your IDE and start coding.