My system collects a lot of data from different resources (each resource has text-ID), and send it to client bounded together in predefined groups. there some hundreds of different resources, each might set record in period of second up some hours. there less then hundred "view groups".
The data collector is single-threaded.
what is the best method to organize the data?
a. make different table for each source, where the name of the table is based on the source id?
b. make single table and add the source id as text-field (key if possible)?
c. table for each predefined display group, with the source id as text-field?
each record has value (float) and date (date). the query will be something like select * from ... where date < d1 and date > d2. In case of single table, it will be "and sourceId in(...)"
database type is unknown yet, it might be lightsql, postgres, mysql, mssql ...
Related
Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1
is it possible to get this output from select Query
I tried the below query
select monthly + savings as monthly savings from table
but the resultant data is under one column
is there any solution to get more than one column under same heading
There is no way to retrieve the information from SQL Server with merged headers, at least not with the most widely used clients.
SQL Server is a relational database and it's fundation is based on sets of data arranged in tables with columns, rows and relationships between them. Suppresing a header would mean breaking the column-value link. If you want to manipulate them, you will have to do so after retrieving the data from the database, maybe on your display layer or in a helper process between the database and your presentation, as Tim suggested.
I need to move data between two databases and wanted to see if SSIS would be a good tool. I've pieced together the following solution, but it is much more complex than I was hoping it would be - any insight on a better approach to tackling this problem would be greatly appreciated!
So what makes my situation unique; we have a large volume of data, so to keep the system performant we have split our customers into multiple database servers. These servers have databases with the same schema, but are each populated with unique data. Occasionally we have the need to move a customer's data from one server to another. Because of this, simple recreating the tables and moving the data in place won't work as in the database on server A there could be 20 records, but there could be 30 records in the same table for the database on server B. So when moving record 20 from A to B, it will need to be assigned ID 31. Getting past this wasn't difficult, but the trouble comes when needing to move the tables which have a foreign key reference to what is now record 31....
An example:
Here's a sample schema for a simple example:
There is a table to track manufacturers, and a table to track products which each reference a manufacturer.
Example of data in the source database:
To handle moving this data while maintaining relational integrity, I've taken the approach of gathering the manufacturer records, looping through them, and for each manufacturer moving the associated products. Here's a high level look at the Control Flow in SSDT:
The first Data Flow grabs the records from the source database and pulls them into a Recordset Destination:
The OLE DB Source pulls from the source databases manufacturer table while pulling all columns, and places it into a record set:
Back in the control flow, I then loop through the records in the Manufacturer recordset:
For each record in the manufacturer recordset I then execute a SQL task which determines what the next available auto-incrementing ID will be in the destination database, inserts the record, and then returns the results of a SELECT MAX(ManufacturerID) in the Execute SQL Task result set so that the newly created Manufacturer ID can be used when inserting the related products into the destination database:
The above works, however once you get more than a few layers deep of tables that reference one another, this is no longer very tenable. Is there a better way to do this?
You could always try this:
Populate you manufacturers table.
Get your products data (ensure you have a reference such as name etc. to manufacturer)
Use a lookup to get the ID where your name or whatever you choose matches.
Insert into database.
This will keep your FK constraints and not require you to do all that max key selection.
I'm creating a rather large APEX application which allows managers to go in and record statistics for associates in the company. Currently we have a database in oracle with data from AD which hold all the associates information. Name, Manager, Employee ID, etc.
Now I'm responsible for creating and modeling a table that will house all their stats for each employee. The table I have created has over 90+ columns in it. Some contain data such as:
Documents Processed
Calls Received
Amount of Doc 1 Processed
Amount of Doc 2 Processed
and the list goes on for well over 90 attributes. So here is my question:
When creating this table in my application with so many different columns how would I go about choosing a primary key that's appropriate? Should I link it to our employee table using the employees identification which is unique (each have a associate number)?
Secondly, how can I create these tables (and possibly form) to allow me to associate the statistic I am entering for an individual to the actual individual?
I have ordered two books from amazon on data modeling since I am new to APEX and DBA design. Not a fresh chicken, but new enough to need some guidance. An additional problem I am running into is that each form can have only 60 fields to it. So I had thought about creating tables for different functions out of my 90+ I have.
Thanks
4.2 allows for 200 items per page.
oracle apex component limits
A couple of questions come to mind:
Are you sure that the employee Ids are not recyclable? If these ids are unique and not recycled.. you've found yourself a good primary key.
What do you plan on doing when you decide to add a new metric? Seems like you might have to add a new column to your rather large and likely not normalized table.
I'd recommend a vertical table for your metrics.. you can use oracle's pivot function to make your data appear more like a horizontal table.
If you went this route you would store your employee Id in one column, your metric key in another, and value...
I'd recommend that you create a metric table consisting of a primary key, a metric label, an active indicator, creation timestamp, creation user id, modified timestamp, modified user id.
This metric table will allow you to add new metrics, change the name of the metric, deactivate a metric, and determine who changed what and when.
This would be a much more flexible approach in my opinion. You may also want to think about audit logs.
We are building a set of features for our application. One of which is a list of recent user activities ala on SO. I'm having a little problem finding the best way to design the table for these activities.
Currently we have an Activities table with the following columns
UserId (Id of the user the activity is for)
Type (Type of activity - i.e. PostedInForum, RepliedInForum, WroteOnWall - it's a tinyint with values taken from an enumerator in C#)
TargetObjectId (An id of the target of the activity. For PostedInForum this will be the Post ID, for WroteOnWall this will be the ID of the User whose wall was written on)
CreatedAtUtc (Creationdate)
My problem is that TargetObjectId column doesn't feel right. It's a soft link - no foreign keys and only a knowledge about the Type tells you what this column really contains.
Does any of you have a suggestion on an alternate/better way of storing a list of user activites?
I should also mention that the site will be multilingual, so you should be able to see the activity list in a range of languages - that's why we haven't chosen for instance to just put the activity text/html in the table.
Thanks
You can place all content to a single table with a discriminator column and then just select top 20 ... from ... order by CreatedAtUtc desc.
Alternatively, if you store different type of content in different tables, you can try something like (not sure about exact syntax):
select top 20 from (
select top 20 ID, CreatedAtUtc, 'PostedToForum' from ForumPosts order by CreatedAtUtc
union all
select top 20 ID, CreatedAtUtc, 'WroteOnWalll' from WallPosts order by CreatedAtUtc) t
order by t.CreatedAtUtc desc
You might want to check out http://activitystrea.ms/ for inspiration, especially the schema definition. If you look at that spec you'll see that there is also the concept of a "Target" object. I have recently done something very similar but I had to create my own database to encapsulate all of the activity data and feed data into it because I was collecting activity data from multiple applications with disparate data sources in different databases.
Max