We are planning to implement a project in Azure cloud where data storage will be Azure Data lake for now and in future HDP will be implemented and ADLS will be the extended datanode. From ADLS we want to expose data for Dashboard creation using Tableau. Initial plan was to use Hive and Tableau will connect to Data through Hive. But here comes the performance issue as:
There will be multiple users who will have access to Data through Tableau(100+)
We will also have to expose Data to different portal with API calls.
Which means multiple connectivity will be established at the same time which will hit hive . My question is:
Can hive serve the purpose with minimal time?
How can i measure the performance?
I dont want to let my users to sit back after running a query in tableau and wait for a long time to see the dashboard.
Would you please share your experiences in this design issue? Should we use Hive or should We use some other tools which have better performance to work with tableau and HDFS storage. Someone suggested me to use Azure SQL Server and connect Tableau to SQL server. But its again the old fashion and also matter of cost as price is related with the execution of each query.
If you have any better solution experience please share , would be greatly appreciated.
Thanks in advance.
Hive LLAP could work, if you can get it installed.
Otherwise, at my work, we've had good experience with PrestoDB and Tableau on S3 data.
Some teams use Spark SQL, and you can setup a Spark Thrift Server, that should be compatible with the Hive JDBC/ODBC drivers
Related
I use a SQL Server 2017 database to perform ad-hoc data analytics activities to support my team. In order to source data from various databases, I either mount a backup on my environment (if the target DB holding the data I’m after is also SQL Server) or use linked servers to establish a direct connection (where I need data from Oracle or iSeries).
More recently though I’m coming across SaaS based systems and was wondering if there’s any way I can establish a direct connection between my database and the SaaS database? I’m not sure whether SSIS packages will do the trick. Any pointers would be gratefully appreciated as I’m struggling to get the right, scalable solution for this problem!
Data integration with SaaS solutions is a mixed bag. You have to discover on a case-by-case basis what integration or data export functionality each SaaS application has. Few will allow any kind of direct database access, but you might find an ETL tool that has pretty broad connectivity.
In Azure you can look at Logic Apps Connectors, and Power Query Connectors. Or products like Boomi. Which have connectors for many popular SaaS Applications.
I need to regularly (but incrementally) sync (one way) the contents of a set of SQL Server Azure tables to a PostgreSQL Azure instance.
Here are some of the avenues I've considered:
Linked server from SQL Server. No go. Apparently Azure doesn't support linked servers.
Foreign Data Wrapper from PostgreSQL. No go. PostgreSQL on Azure only supports the postgres_fdw, not the needed tds_fdw.
Azure Data Factory. No go. The data copy process doesn't work incrementally, and the sink pipeline component doesn't support PosgreSQL.
Commercial replication solutions. Too expensive for a startup and most aren't hosted.
SymmetricDS or ReplicaDB. These might work, but aren't hosted so we may or may not save time over building a custom solution after all the time and effort of configuration and debugging.
Am I missing an obvious solution?
Congratulations, you solved your problem. It will be better that if you could share us more detail about your simple replication system.
I have a project to ingest metadata from the Snowflake data warehouse into the Azure Data Catalog (ADC). The ADC does not natively support this, so I must use the ADC API or ODBC.
Before I proceed on implementing the API solution, I thought it wise to ask if anyone has suggestions on this. A Google search returned nothing of note.
Unfortunately ODBC will not work either. I am moving onto a solution to load via API.
The API or ODBC are the best choices.
I am working on a project where i need to display the database mssql server's performance metrics for example memory consumed/free, storage free space etc. I have researched for this purpose and one thing came up was DOGSTATSD.
Datadog provides the library for .net project to get custom metrics but that was not the solution for me because the metrics works on datadog website. I have to display the all (in graph or whatever suited) data, received from MSSQL SERVER. There will be multiple servers/instances.
Is there a way to do that, our WebApp connected with multiple databases and we receive/display information.
I cannot use already available tools for the insights.
You can easily get all needed data via querying dmv and other resources inside SQL Server. Good start is here.
I am starting a Elastic search 5 project from data that are actually in a SQL Server, so I am starting from the start:
I am thinking about how import data from my SQL Server, and especially how to synchronise my data when data are updated or added.
I saw here it is adviced to make no too frequent batch.
But how make synchronisation batchs, may I have to write it myself or is there very used tools and practices ?
River and JDBC plugin feeder appears to have been really used but don't work with Elastic Search 5.*
Any help would be very welcomed.
I'd recommend using Logstash:
It's easy to use and setup
You can do your own ETL in logstash configuration files
You can have multiple JDBC sources in one file
You'll have figure out how to make incremental (batched) updates to sync your data. It really depends on your data model.
This is a nice blog piece to begin with:
https://www.elastic.co/blog/logstash-jdbc-input-plugin