I need a fast and reliable solution for manipulating data in a column-based database. The database will store simulation data with 1 million values for each variable, so a table (or tables) will always be with 1 million rows.
I need a solution that will allow really fast:
saving of 1 million values into the table (or a column), e.g. Var1, Var2, etc..
update of a table with 1 million new values (existing million values replaced by new values)
calculated columns, e.g. Var1 + Var2 , or Var1 * (1 + Var2), etc.
I have tried MS SQL things like:
SELECT * INTO <VarTable> FROM <CLR Table-Valued Function>
And then
UPDATE <VarTable> SET <Var1> = <NewVar> FROM <TempTable> WHERE VarTable.ID = TempTable.ID
But the above works very slowly as there are a million rows in each column.
I need a solution that allows quick replacement of whole columns and fast simple mathematical operations between the columns (or tables if the columns are in different tables) with the assumption that all columns will always have the same number of rows (1 million).
Related
Database: MongoDB.
In my dataset, every record will have an attribute that can take on 1 of 3 values.
In terms of performance of query speed, would it be faster to have one table with a column that will store this attribute or have 3 separate tables for each of the values?
I am trying to understand if it is worth from a performance point of view in order to create an index on a column of a huge table (about 90 million records in total).
What I am trying to achieve is fast filtering on the indexed column. The column to be indexed can have only 3 possible values and as per my requirement I have to fetch data on a regular basis with two of those values. This comes out to about 45 million records (half the table size).
Does it make any sense to create an index on a column that can have only 3 possible values and you need to retrieve data with two values amongst them ? Also, will creating this index improve performance of my query with WHERE clause on the column ?
I am currently using ClickHouse to store few billions of data each week. We use aggregated tables to fetch data, so far so good. Now there is a need to fetch a single row from this database.
ClickHouse is not meant for such a case, even though after applying some optimization recommended by ClickHouse single row select is still somehow slow (few seconds).
to clarify this little more, this table is indexed by columns a,b,c, and d and also partitioned monthly (The table has some more columns). A new service has to query this table whereas only knows a and b and z (a UUID column). However, the average response is between 3 and 10 seconds for 10 billion data.
I have an opportunity to add an extra data store layer so that I can store the data into an extra database for this need.
Now the actual question: What could be the best database for such a case where we only need to read a single row of billions of data?
P.S:
Due to storage and network cost, we can't use Redis
We can't add more columns to the select query to optimize the query
Cassandra?
You can use an additional table and a materialized view to emulate inverted index.
This additional table should be sorted by z and contain pk columns (a,b,c, d) from the main table.
Then query the main table like
select ... from main_table where (a,b,c,d) in
( select a,b,c,d from additional_table where z= ... )
and z = ...
additional_table can be automatically filled by the materialized view from the main_table.
I have 2 tables. One having 1 million rows (Table 1), and another having 99 million rows (Table 2). Both of them are in a separate schemas.
They have similar structures, so no problem there.
My question would be this:
I need the table containing both of the tables' Data on the schema containing Table 1.
Would it be faster to run a code to transfer all 99 million rows of Table 2 to Table 1
OR
Would it be faster to run a code to transfer all 1 million rows to Table 2, and then run a code to Alter Table 2's schema to Table 1's schema?
OR
Would everything actually be instantaneous?
My understanding is you want to insert all records from Table 2 into Table 1. If this is the case I would suggest dropping the indexes on table 1, run the insert and then rebuild it. Alternatively you could leave the indexes on, but that would slow things WAY DOWN. Another solution which is my preferred, is create Table 3, insert each of the 2 tables into it, build the index then rename Table 1 & 2 to TableName_Backup, and rename table 3 to be whatever you want. The 2nd solution should give you optimal results while keep both original tables in their original state while you validate the data. Once you feel good, either put the 2 original tables in an archive location or drop them, depending on your company policy.
I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.