Querying large amount of data processed by Hive - database

Say I have around 10-20GB of data in HDFS as a Hive table. This has been obtained after several Map-Reduce jobs and JOIN over two separate datasets. I need to make this Queryable to the user. What options do I have?
Use Sqoop to transfer data from HDFS to an RDS like Postgresql. But I want to avoid spending so much time on data transfer. I just tested HDFS->RDS in the same AWS region using Sqoop, and 800mb of data takes 4-8 minutes. So you can imagine ~60gb of data would be pretty unmanagable. This would be my last resort.
Query Hive directly from my Webserver as per user request. I haven't ever head of Hive being used like this so I'm skeptical about this. This struck me because I just found out you can query hive tables remotely after some port forwarding on the EMR cluster. But being new to big(ish) data I'm not quite sure about the risks associated with this. Is it commonplace to do this?
Some other solution - How do people usually do this kind of thing? Seems like a pretty common task.
Just for completeness sake, my data looks like this:
id time cat1 cat2 cat3 metrics[200]
A123 1234212133 12 ABC 24 4,55,231,34,556,123....(~200)
.
.
.
(time is epoch)
And my Queries look like this:
select cat1, corr(metrics[2],metrics[3]),corr(metrics[2],metrics[4]),corr(metrics[2],metrics[5]),corr(metrics[2],metrics[6]) from tablename group by cat1;
I need the correlation function, which is why I've chosen postgresql over MySQL.

You have correlation function in Hive:
corr(col1, col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

You can simply connect to a hiveserver port via odbc and execute queries.
Here is an example:
http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-10.html

Hive User Experience (hue) has a Beeswax query editor designed specifically for the purpose of exposing Hive to end users who are comfortable with SQL. This way they can potentially run ad-hoc queries against the data residing in Hive without needing to move it elsewhere. You can see an example of the Beeswax Query Editor here: http://demo.gethue.com/beeswax/#query
Will that work for you?

What i can understand from the question posted above is you have some data (20GB ) which you have stored in hdfs and using hive. Now you want to access that data to perform some kind of statistics functions like correlation and others.
You have functions in hive that perform correlation.
Otherwise you can directly connect R to hive using RHive or even excel to hive using datasource.
The other solution is installing hue which comes with hive editors where you can directly query the hive.

Related

Azure Data Factory- Referencing Lookup activities in Queries

I'm following a tutorial on Azure Data Factory migration from Azure SQL to Blob through pipelines. While most of the concepts make sense, the 'Copy Data' query is a bit confusing. I have a background in writing Oracle SQL, but Azure SQL on ADF is pretty different and I'm struggling to find specific technical documentation, probably because it's not widely adopted yet.
Pipeline configuration shown below:
Query is posted below:
SELECT data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION
FROM data_source_table
RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
#{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION})
AS CT ON data_source_table.PersonID = CT.PersonID
WHERE CT.SYS_CHANGE_VERSION <=
#{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion}
Output to the sink Blob as a result of the 'Copy Data' query:
2,name2,14,4,U
7,name7,51,3,I
8,name8,31,5,I
9,name9,38,6,I
Couple questions I had:
There's a lot of external referencing from other activities in the 'Copy Data' query like #{activity('...').output.firstRow.CurrentChangeTrackingVersion. Is there a way to know the appropriate syntax to referencing external activities? Can't find any good documentation the syntax, like what .firstRow is or what the changetable output looks like. I can't replicate this query in SSMS, which makes it a bit of a black box for me.
SYS_CHANGE_OPERATION appears in the SELECT with no table name prefix. Is this directly querying from the table in SourceDataset? (It points to data_source_table, which has table tracking enabled) My main confusion stems from how table tracking information is stored in the enabled tables. Is there a way to show all the table's tracked changes in SSMS? I see some documentation on what the return values, but it's hard for me to visualize it without seeing it on the table, so an output query of some return values would be nice.
LookupLastChangeTracking activity queries in all rows from a table (which when I checked, is just one row), but LookupCurrentChangeTracking activity uses a CHANGE_TRACKING function to pull the version of the data sink in table_store_ChangeTracking_version. Why does it use a function when the data sink's version is already recorded in table_store_ChangeTracking_version?
Sorry for the many questions, but I can't find any way to make this learning curve a bit less steep. Any guides or resources would be awesome!
There is an article to get the same thing done from the UI and it will help you understand it better .
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal .
1 . These are the Lookup activity ,. very straight forward , please read about them here .
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
2.SYS_CHANGE_OPERATION is a column on data_source_table and so that should be fine . Regarding the details on the how the change tracking (CT) is stored , I am not sure if all the system table are exposed on Azure SQL , but we did had few table on the on-prem version of the SQL which could be queried if needed . But for this exercise I think that will be an over kill .

Bulk Update with SQL Server based on a list of primary keys

I am processing data from a database with millions of rows. I am pulling a batch of 1000 items from the database and processing them without a problem. I am not loading the whole entity and I am just pulling down a few columns of data for the batch.
What I want to do is mark the 1000 rows as processed with a single SQL command.
Something like:
UPDATE dbo.Clients
SET HasProcessed = 1
WHERE ClientID IN (...)
The ... is just a list of integers.
Context:
Azure SQL Server 2012
Entity Framework
Database.ExecuteSqlCommand
Ideas:
I know I could build the command as a pure string, but this would mean not using any SqlParameters and not benefiting from the query plan optimization.
Also, I found some information about table-valued parameters, but this requires creating a table type and some overhead which I would like to avoid. This is just a list of integers after all.
Question:
Is there an easy (performant) way to do this that I am overlooking either with Entity Framework or ExecuteSqlCommand?
If not and using table-valued parameters is the best way, could you provide a complete example of how to convert an integer list into the simplest type and running that with the above query?

How to store XML result of WebService into SQL Server database?

We have got a .Net Client that calls a Webservice. We want to store the result in a SQL Server database.
I think we have two options here how to store the data, and I am a bit undecided as I can't see the pros and cons clearly: One would be to map the results into database fields. That would require us to have database fields corresponding to each possible result type, e.g. for each "normal" result type as well as those for faults.
On the other hand, we could store the resulting XML and query that via the SQL Server built in XML functions.
Personally, I am comfortable with dealing with both SQL and XML, so both look fine to me.
Are there any big pros and cons and what would I need to consider in terms of database design when trying to store the resulting XML for quite a few different possible Webservice operations? I was thinking about a result table for each operation that we call with different entries for the different possible outcomes / types and then store the XML in the right field, e.g. a fault in the fault field, a "normal" return type in the appropriate field etc.
We use a combination of both. XML for reference and detailed data, and text columns for fields you might search on. Searchable columns include order number, customer reference, ticket number. We just add them when we need them since you can extract them from the XML column.
I wouldn't recommend just the XML. If you store 10.000 messages a day, a query like:
select * from XmlLogging with (nolock) where Response like '%Order12%'
can become slow and interfere with other queries. You also can't display the logging in a GUI because retrieval is too slow.
I wouldn't recommend just the text columns either. If the XML format changes, you'd get an empty column. That's hard to troubleshoot without the XML message. In addition, if you need to "replay" the message stream, that's a lot easier with the XML messages. Few requirements demand replay, but it's really helpful when repairing the fallout of production problems.

How to roughly and automatically translate the contents of a database once?

Let's say there is a very large SQL database - hundreds of tables, thousands of fields, millions of records. This database is in Dutch. I would like to translate all the values of certain fields for test purposes in English. It doesn't have to be completely correct, it has to be readable for the testers.
I know that most of the texts are stored in fields, named "name" and "description" throughout the database. Or basically all fields with type NVARCHAR(any length) in all tables would be the candidates for translation. Enumerating all the tables and fields is really too much work, and I would like to avoid it, so translating only these fields would be enough.
Is there a way to walk through all the tables and for all the records retrieve the values from the specific fields and replace them with their English translations? Can this be done with SQL only?
The db server doesn't matter - I can mount the database on MSSQL, Oracle or any server of a choice. Translating the texts using google or some other automatic tool is good enough for the purpose. Of course this tool must have an API in order to be used automatically.
Does anybody have an experience with similar operations?
In Pseudocode, here is what I would do, using Oracle :
Query AllFieldsQuery = new Query(connection,"Select table_name, column_name
from user_tab_columns where column_name='name' OR column_name='description'");
AllFieldsQuery.ExecuteReader();
It gives you something like this :
TABLE_NAME | COLUMN_NAME
Table1 | Name
Table2 | Name
Table2 | Description
........... | ...........
Foreach TableColumnLine
Query FieldQuery = new Query("Select DISTINCT "+COLUMN_NAME+" AS ToTranslate
from +"TABLE_NAME);
Create a new parametrized query :
MyParamQuery = new ParamQuery(connection, UPDATE TABLE_NAME SET COLUMN_NAME =
#Translated WHERE COLUMN_NAME = #ToTranslate);
Foreach ToTranslateLine
#Translated = GetTranslationFromGoogle(#ToTranslate);
MyParamQuery.Execute(ToTranslate = #ToTranslate, Translated = #Translated);
End Foreach ToTranslateLine
End Foreach TableColumnLine
The AllFieldsQuery gives you all the fields in all the tables you need to update.
The FieldQuery gives you all the texts you need to translate.
The MyParamQuery updates each one of these texts.
This may be a very long process, it depends also on the time you spend to get the translation. What I would recommend is to do at least a commit by table, and print a report to tell you for each table if everything went ok, because you could in your API have an exclude function that excludes fiels and/or tables from the first query because they were already translated before.
.
Enhancement : Writing a File and executing it at the end of the translation instead of executing sql updates.
If it's too long, and it will possibly be too long, instead of executing the sql UPDATES directly, you'd rather write a file with all your updates, and after you foreach, go get the file and use it in SQL.
.
Enhancement : Multi Threaded Version
A multi threaded version of your API could let you win some translation time, you put all your results to translate in a synchronized queue, and you use a pool of threads to translate the queue, and put all the translated messages in another synchronized Queue, which is used to write the SQL output file or launching the updates.

appengine equivalent of a "NOT IN" query

I am coming to Appengine from a relational database background, and was considering how best to accomplish this task.
Basically I have a table of objects and would like to retrieve a pair that a user has never seen.
In mysql for example the most straightforward could be something like
SELECT *
FROM object_pairs
WHERE id NOT IN(
SELECT object_pair_id
FROM user_object_pairs
WHERE user_id = :user_id
)
Ideally I'd also like to be able to retrieve a random (or semi-random) pair from the possible results.
Any suggestions appreciated.
Thanks
There's no easy way to do this. The SQL query you suggest will be executed by most SQL databases as first constructing an in-memory list of IDs, then doing a linear scan over the table, returning only rows that aren't in the list of IDs. You can do the same in App Engine: get the user's 'seen list', do a query for all entities, and skip those that he's seen before.

Resources