I am looking to store data into Hive to run analysis on the pas months (~100GB per days).
My rows contains a date (STRING) field looking like that: 2016-03-06T04:31:59.933012793+08:00
I want to partition based on this field but only based on the date (2016-03-06) --and i don't care about the timezone. Is there any ways to achieve that without changing the original format?
The reason for partitioning is both performances and the ability to delete old days to have a rolling window of data.
Thank you
You can achieve this through Insert Overwrite table with dynamic partition.
You can apply sub-string or regexp_extract function on your date time column and get the value in required format.
Below is my sample query where I am loading a Partitioned table by applying function on the partition column.
CREATE TABLE base2(id int, name String)
PARTITIONED BY (state string);
INSERT OVERWRITE TABLE base2 PARTITION (state)
SELECT id, name, substring(state,0,1)
Here I am applying some transformation the partition column. Hope this helps.
FROM base;
Related
I have three columns of data- incident_ID, date, state.
The incident_ID are each unique, date is in year format only and ranges from 2013 to 2016, states are in no particular order and repeat if an incident occurs w/in the same state in the same year.
I'm going to be combining this data w/ another table, but first I need to organize the data to better match the other table's format- which is laid out showing the year, state, and dollars spent per state. So for 2013, I would have 51 rows (each state + DC) and each state would have a dollar amount- then rinse/repeat down the table through to 2016.
I'm pretty new to SSIS/Visual Studio, but from my understanding I should be able to use a Derived Column Transform to accomplish this.... but I don't know how to get there.
Is there a way to use Derived Column to 'count' and rearrange the data in order to show how many incidents occurred per state in the given year?
There is no aggregate (like SUM,Count,Min,Max) functionality in Derive Columnenter code here.For count, you can use ROW COUNT Task or you can insert the data into SQL table and then you can call the result value in derived column.
I am modelling a financial stock price storage in Cassandra, where I need to cater for retrospective changes.
An append only database is what came to mind.
CREATE TABLE historical_data (
ticker text,
eoddate timestamp,
price double,
created timestamp,
PRIMARY KEY(ticker, eoddate)
) WITH CLUSTERING ORDER BY (eoddate DESC);"""
eg a record might be:
ticker=AAPL, eoddate=2016-09-28, price=123.4, created=2016-09-28 16:30:00
A day later, there was a retro data fix, I'd insert another record
ticker=AAPL, eoddate=2016-09-28, price=120.9, created=2016-09-29 09:00:00
What is the best way to model/query this data, if I'd like to get the latest series for AAPL (ie filtering the first value)?
in SQL I could write a parition query. How about in CQL ?
Or should the filter be applied at the application level?
Thanks.
If i understand correctly your need, your table is good.
With this schema, you can run query like :
SELECT price
FROM historical_data
WHERE ticker = 'AAPL'
LIMIT 1;
It will return the last price for ticker AAPL.
The CLUSTERING ORDER BY clause order your data physically in descending order for a specific ticker, it wont order the whole table. So this query should be enough.
I have table with
fields: Id, ChId, ChValue, ChLoggingDate
Now the data will be saved for everyminute in to the database. I need a query to check if the data exists for every minute in the table through out the year for a particular weekday. That is, all Monday's in 2013 if it has complete data for that day calculate the arithmetic mean for the year of Monday's.
YOu will need a table - or a table value function - to create a timestamp for every minute, then you can join with that and check where the original table has no data.
Only way - any query otherwise can only work with the data it has.
This is a common approach for any reporting.
http://www.kodyaz.com/t-sql/create-date-time-intervals-table-in-sql-server.aspx
explains how, you just must expand it to a times table. If you separate date and time you can get away with a time table (00:00:00 to 24:59:59) and a date table separately.
I'm working with a user table and want to put a "end of probational period date". Basically, each new user has 2 full months from when they join as part of their probation period. I saw that I can put a formula in the column for my user table, but I'm wondering if I should have a script that updates this instead or if this is an acceptable time to use computed columns. I access this table for various things, and will occasionally update the users' row based on performance milestone achievements. The Application Date will never change/be updated.
My question is: Is using the computed column a good practice in this situation or will it recompute each time I update that row (even though I'm not going to update the App Date)? I don't want to create more overhead when I update the row in the future.
Formula I'm using in the column definition for the Probation End Date:
(dateadd(day,(-1),dateadd(month,(3),dateadd(day,(1)-datepart(day,[APP_DT]),[APP_DT]))))
Seeing that this date most likely will never change once it's set, it's probably not a good candidate for a computed column.
After all: once you insert a row into that table, you can easily calculate that "end of probation period" date right there and then (e.g. in a trigger), and once set, that date won't ever change.
So while you can definitely do it this way, I would probably prefer to use a AFTER INSERT trigger (or a stored procedure for the INSERT operation) that just calculates it once, and then stores that date.
Also, just as a heads-up: a computed column with just the formula is being calculated every time to access it - just be aware of that. That is, unless you specify the PERSISTED keyword, in that case, the result is stored along side the other data in the row, and this would be a much better fit here - again, since that value, once calculated, is not bound to change ever again.
If you want to later extend someone's probation period without having to change their application date, then a computed column is NOT the way to go. Why not just use a DEFAULT constraint for both columns?
USE tempdb;
GO
CREATE TABLE dbo.foo
(
MemberID INT IDENTITY(1,1),
JoinDate DATE NOT NULL DEFAULT SYSDATETIME(),
ProbationEndDate NOT NULL DEFAULT
DATEADD(DAY, -1, DATEADD(MONTH, DATEDIFF(MONTH,0,SYSDATETIME())+3, 0))
);
INSERT dbo.foo DEFAULT VALUES;
SELECT MemberID, JoinDate, ProbationEndDate FROM dbo.foo;
Results:
MemberID JoinDate ProbationEndDate
-------- ---------- ----------------
1 2013-04-05 2013-06-30
(Notice I used a slightly less convulted approach to get the end of the month for two months out.)
There's no overhead when you insert data; only when you read the column the values are computed for this column. So I'd say your approach is correct.
I need to store several date values in a database field. These values will be tied to a "User" such that each user will have their own unique set of these several date values.
I could use a one-to-many relationship here but each user will have exactly 4 date values tied to them so I feel that a one-to-many table would be overkill (in many ways e.g. speed) but if I needed to query against them I would need those 4 values to be in different fields e.g. MyDate1 MyDate2 ... etc. but then the SQL command to fetch it out would have to check for 4 values each time.
So the one-to-many relationship would probably be the best solution, but is there a better/cleaner/faster/whatever another way around? Am I designing it correctly?
The platform is MS SQL 2005 but solution on any platform will do, I'm mostly looking for proper db designing techniques.
EDIT: The 4 fields represent 4 instances of the same thing.
If you do it as four separate fields, then you don't have to join. To Save the query syntax from being too horrible, you could write:
SELECT * FROM MyTable WHERE 'DateLiteral' IN (MyDate1, MyDate2, MyDate3, MyDate4);
As mentioned in comments, the IN operator is pretty specific when it comes to date fields (down to the last (milli)second). You can always use date time functions on the subquery, but BETWEEN is unusable:
SELECT * FROM MyTable WHERE date_trunc('hour', 'DateLiteral')
IN (date_trunc('hour', MyDate1), date_trunc('hour', MyDate2), date_trunc('hour', MyDate3), date_trunc('hour', MyDate4));
Some databases like Firebird have array datatype, which does exactly what you described. It is declared something like this:
alter table t1 add MyDate[4] date;
For what it's worth, the normalized design would be to store the dates as rows in a dependent table.
Storing multiple values in a single column is not a normalized design; normalization explicitly means each column has exactly one value.
You can make sure no more than four rows are inserted into the dependent table this way:
CREATE TABLE ThisManyDates (n INT PRIMARY KEY);
INSERT INTO ThisManyDates VALUES (1), (2), (3), (4);
CREATE TABLE UserDates (
User_ID INT REFERENCES Users,
n INT REFERENCES ThisManyDates,
Date_Value DATE NOT NULL,
PRIMARY KEY (User_ID, n)
);
However, this design doesn't allow you make the date values mandatory.
How about having 4 fields alongwith User ID (if you are sure, it wont exceed that)?
Create four date fields and store the dates in the fields. The date fields might be part of your user table, or they might be in some other table joined to the user table in a one-to-one relationship. It's your call.