MS SQL server stored procedures to Spark

MS SQL server stored procedures to Spark - sql-server

We have MS SQL server as a primary option for various databases and we run hundreds of Stored procedures on a regular basis.
Now we are moving to completely big data stack. We are using Spark for the batch jobs. But, We have already invested enormous effort in creating those stored procedure. Is there a way to reuse the stored procedure on top of Spark? or is there an easy way to migrate them to Spark instead of writing from scratch?
Or any framework like Cloudera distribution/impala addresses this requirement?

No, there's not as far as I can tell. You may be able to use a very similar logical flow but you're going to need to invest serious time and effort to convert the T-SQL to Spark. I would recommend going straight to Scala and not wasting time with Python/PySpark.
My rule of thumb for the conversion would be to try to do anything that's SQL in the stored procs as SQL in Spark (sqlContext.sql("SELECT x FROM y")) but be aware that Spark DataFrames are immutable so any UPDATE or DELETE actions will have to be changed to output a new modified DataFrame.

Related

Yesod Database Query

I'm new to both Yesod and Databases so please bear with me this basic question:
I plan to write SQL procedures and store them in .sql files and the Yesod web application just calls these sql file instead of writing queries directly in Haskell.
Is this common practice?
I assume that writing queries in SQL itself is more suitable than writing them in Haskell.
Any comments are welcome.

Is this common practice?
No. The closest you can get to this is use something like rawSql in persistent or use some other low level library. I would recommend you to avoid writing raw sql query unless you have a valid reason to.
I assume that writing queries in SQL itself is more suitable than writing them in Haskell.
No, the whole point of using persistent is to bring type safety to database queries. If you are gonna write them in SQL itself you lose that benefit.

Why Hive is not supporting Stored Procedure?

Why hive is not supporting Stored procedure?
If its not supporting then how we will handle Sp in Hive? have any alternate solution?
(Because we have a already a data base is there in mssql)
What about HBASE? Is it support SP?

First of all, Hadoop or Hive is NOT an alternative to your SQL DB. You must never consider either of these 2 to be used as a replacement of your RDBMS.
Hive was developed just to provide warehousing capabilities on top of an existing Hadoop cluster keeping in mind the large base of SQL users, both expert database designers and administrators, as well as casual users who use SQL to extract information from their data warehouses. Although it provides you a SQL like interface, it is not a SQL DB. Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly. Simply put for offline batch processing kind of stuff.
There is nothing like stored procedures in HBase as well. But they have something called as Coprocessor which resembles stored procedures in RDBMS. To find more on Coprocessor you can go here.
And as #zsxwing has said Sqoop is just a data migration tool, nothing more. Once you switch to the NoSQL world you need to be flexible and you need to abide by the NoSQL rules.
If you could elaborate your use case a bit, maybe we can help you better.
In response to your comment :
Yes Facebook uses Hadoop and Hive and other related tool extensively. Infact Hive was developed at Facebook. But These are not the only things. Wherever they have OLTP and full transactional need, they still depend on RDBMS. One example is their Timeline feature, which uses MySQL. They have a gigantic(and awesome) pipeline which consists of a lot of things and not just Hadoop and Hive. See the picture below.

Hive and Hbase are not support stored procedure. However, Hive plans to support Sp (HIVE-3087) in the future. HBase has no plan about supporting Sp since it only focuses on being a Storage and more like NoSQL.
Hive UDF could implement some function of stored procedure, though it's not enough.

Hive does not have stored procedures
Hive indeed does not have any stored procedures as explained in existing answers. However, here are 2 mitigating factors:
Hive has views
Of course it is not a proper substitute for stored procedures, but with smart use of views you can perhaps remove the need for some of your procedures.
You can call hive from another program
The last time I ran into the problem that hive does not have stored procedures, I realized that the thing I wanted to do (loop over all columns) was something that I could also do in another program. As such I followed the following workflow:
Run a query to get the relevant (meta) data: Python calls hive to get column names
Use the information to build the query: Python takes in all column names and builds the correspondng select statements
Run the resulting query: Python does a system call with hive -e
Optionally, go to 2 if needed
With views and external calls, I have so far been able to work around the lack of stored procedures.

Please refer to HPL/SQL, I am looking for same solution but not try yet.
I believe the data warehouse application need stored procedure support, but prefer set-based than row-based procedure.
In my personal experience, procedural support is needed when leverage server-side program template in structured data warehouse application. It makes data warehouse application more easy to porting between SQL/NoSQL, like Netezza, MSSQL, Oracle, DB2, and BigInsight.

Have a look at open-source project PL/HQL at http://www.plhql.org. It allows you to run existing SQL Server, Oracle, Teradata, MySQL etc. stored procedures in Hive.

Can I replicate from Postgres to MS SQL?

I will have a Postgres database in production but want to use MS SQL (whatever edition) for reporting. So, I would like to have replication set up where MS SQL subscribes from postgres. Is this possible?

All heterogeneous replication scenarios are deprecated by Microsoft, and they now recommend building solutions using SSIS and CDC instead.
We load data from PostgreSQL into our SQL Server reporting database using SSIS and it works well, although we had to use a commercial OLE DB provider because of limitations (at that time) in the open-source one.
Actually copying the data is usually the easy part; most of the work comes in gathering requirements, understanding the data, transforming it, implementing logging and error handling etc. SSIS can do some things for you right away (e.g. logging) but my general advice would be to use it primarily as a workflow tool and for simple data copying with minimal transformation logic (e.g. data type conversion). If something seems seems too difficult or clumsy in SSIS then you can put it into a stored procedure or script and call that from SSIS instead.

I've been using and following PostgreSQL for several years and not aware of such a solution. If one exists, I'm concerned that might be complex or fragile. I would recommend regular export/imports via cron. In the between the export and import, you would need to take care of the translation step of the formats.
If you reporting actually happens in MS Excel or MS Access, I recommend looking into connecting them directly to PostgreSQL via ODBC.

How to build big and complex database in sql - IN EASY WAY?

I have installed Oracle XE. I build small database every day to practice from command prompt, but now I want to have more. I want to have a bigger database with a lot of different data to practice and make exercises.
So, is possible to get a big data file from somewhere and upload to XE database?

You can't get 'big' data for Oracle Express edition as it is limited to 4GB (10g) or 10GB (11g ).
That said, there are public datasets available. Personally I like the FAA data on registered aircraft owners/operators

As you are practicing with Oracle, perhaps a good solution (which will also generate exactly the data you need) would be to write your own stored procedures to generate your data in a loop (or similar construct).
You could then generate as much as you like whilst also practicing your handling of large datasets and writing of efficient PL/SQL and SQL code.
This way your data will match your current database structure too without having to build a new database matching whichever dataset you download from the web.

IIRC there are sample schemas as HR that can be enabled. See this.

How to separate programming logic and data in MS SQL Server 2005?

I am developing a data driven website and quite a lot of programming logic resides in database stored procedures and database functions. I found myself changing the stored proc/functions quite a lot in order to fix bugs or add new functionality. The data (tables) have remained mostly untouched.
The issue I am having is keeping track of versions of stored proc/functions. Currently I am incrementing version of whole database when I do a set of changes. As data is huge (10 Gb) I get issues having to run development version and release versions of databases in parallel.
I wish to put all the stored procs and functions in one database and keep data in one database, so that I can better manage the changes.
I am sure others would have encountered similar suggest and request suggestions on how to best handle this situation.

I would also recommend using source control keyword expansion in your stored procedures ($Version:$)
That way you can eyeball, grep, search syscomments, etc to see what version you have on your deployed database.

You can version just the schema dumps. In combination with source control keword expansion (as suggested by Rawheiser), you just take a look at what version you have in the database, generate a diff and apply it.
Also, there are several excellent tools to compare databases and their schemas, generate DDL scripts etc.: SQL Workbench, Power Architect, DDLUtils and Redgate SQL Compare, to name a few. SQL Compare is likely to work best with SQL Server, although all the others are FOSS and provide a higher ROI (in terms of time spent learning and what you can do with them) as they are platoform and RDBMS independent.
Finally, I have to say...I understand that the immediate results you get with logic in the DB are tempting, but if you've gone beyond more than a couple of procedures in the database, you're setting your self up for quite a lot of pain, sifting through what easily turns into spaghetti code and locking your application to a single database vendor. You might have your reasons, but I've been there and didn't like it very much. Logic can live very nicely in a different layer.

For source control you have several options:
Use a Visual Studio Database project.
Use SQL Server 2005's built-in support for source control
Use a third part tool such as SQL Compare
IMO Option 1. is preferable.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight