I'm building a game where the player has to build different types of buildings and can upgrade them. Some buildings may be upgradable to level 30, whereas some others to level 5 only.
I wonder what is the best database layout for that. I am using sqlite3 if that makes any difference, but the questions applies to other engines as well.
I have through of two options for my buildings table:
Option one: Make a building_group column to idenfity which buildings are similar:
id (Integer, Auto increment), building_name, building_group, level, points, cost
1, path, 1, 1, 100, 1000
2, road, 1, 2, 200, 2000
3, highway, 1, 3, 300, 3000
4, village, 2, 1, 1000, 10000
5, town, 2, 2, 2000, 20000
6, city, 2, 3, 3000, 30000
Option two: Have one entry per building and have all level information in the same row. This doesn't seem the best approach to me but I thought I would mention it anyways.
id (Integer, Auto increment), building_name_1, points_1, cost_1, building_name_2, points_2, cost_2, building_name_3, points_3, cost_3,...
1, path, 100, 1000,road, 200, 2000, highway, 300, 3000
2, village, 1000, 10000, town, 2000, 20000, city, 3000, 30000
I'm sure there are better ways to handle that and I would like to hear your suggestions.
As you noted, the second approach makes very little sense. In order to effectively query or update it, you'll have to construct your column names dynamically, which is a great source for hard to track bugs (not to mention potential vulnerability to SQL injections if you don't do it properly). Moreover, sooner or later, you're going to think of some super-duper special upgrade which has a level higher than the columns you've planned originally, and you'll have to change the definition of the table just to accommodate it, which makes no sense at all.
The first design seems like the text-book way of doing things, which will easily allow yo to create complex queries on the buildings and on what a player has or hasn't built. If you ever need to extract some common data to an entire building group you could create another building_groups table with that information, and make the building_group column in buildings a foreign key to its primary key.
Related
I have 2 large tables in Snowflake (~1 and ~15 TB resp.) that store click events. They live in two different schemas but have the same columns and structure; just different sources.
The data is dumped/appended into these tables on a monthly basis, and both tables have a time_id integer field which represents days since 1999-12-31 the click event took place.
Question is -- Should I leave it up to Snowflake to optimize the partitioning --OR-- Is this a good candidate for manually assigning a clustering key? And say, I do decide to add a clustering key to it, would re-clustering after next insert be just for the incremental data? --OR-- Would it be just as expensive as the initial clustering?
In case it helps, here is some clustering info on the larger of the 2 tables
select system$clustering_information( 'table_name', '(time_id)')
{
"cluster_by_keys" : "LINEAR(time_id)",
"total_partition_count" : 1151026,
"total_constant_partition_count" : 130556,
"average_overlaps" : 4850.673,
"average_depth" : 3003.3745,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 127148,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"01024" : 984,
"02048" : 234531,
"04096" : 422451,
"08192" : 365912
}
}
A typical query I would run against these tables
select col1, col_2, col3, col4, time_id
from big_table
where time_id between 6000 and 7600;
Should I leave it up to Snowflake to optimize the partitioning? Is
this a good candidate for manually assigning a clustering key?
Yes, it seems it's a good candidate to assign a clustering key (size + update intervals + query filters)
And say, I do decide to add a clustering key to it, would
re-clustering after next insert be just for the incremental data?
After the initial reclustering, if you do not insert data belongs to earlier days, existing partitions would be in a "constant" state, so the reclustering will just process only the new data/micro-partitions.
https://docs.snowflake.com/en/user-guide/tables-auto-reclustering.html#optimal-efficiency
Would it be just as expensive as the initial clustering?
In normal conditions, it should not.
Mostly a long winded commend to the question on Gokhan's answer:
This is helpful! Just so I have a sense of cost and time, how long do you think it'll take to run the clustering?
I would suggest you do a one off rebuild of the table with the order by time verse leave auto-cluster to incrementally sort a table this large.
I say this as we had a collect of tables with about 3B rows each (there was about ~30x of these table), and would do a GDPR related PII clean up every month, that deleted 1 month's data via an UPDATE command, as the UPDATE has no order by the order was destroyed for about 1/3 of the table, which auto-cluster would then "fix" over that following day.
Our auto-clustering bill was normal ~100 credits a day, but on these days we where using ~300 credits. which is implying ~6 credits per table, where a full table re-create with a order by would maybe take a medium 15 minutes so ~1 credit.
Which is not deriding auto-clustering, but when a table gets random scrambled it's "a little at a time" approach is too passive/costly, imho.
But on the other hand is you cannot block the insert process for N minutes while you recreate the table, maybe auto-cluster might be your only option, that other other-hand to this if you are always writing to the table auto-cluster will back off a lot, from failed writes.. But this point is more the "general case details to watch out for", given as you state you do monthly loads.
So I have this "Google Sheets App" in one row. It has 2 interactive dropdowns (Dropdown B options appear based on Dropdown A), and some fields which change based on the option. I finally got all that to work using ARRAYFORMULA(INDIRECT), VLOOKUP, and so on. And it all works well, for the first row.
However, I need many rows of that, so I select the entire first row, and extend it all the way down. However, now Dropdown B options are based on the Dropdown A option FROM THE FIRST ROW, not the row where I'm picking stuff currently. And I understand that my ARRAYFORMULA(INDIRECT) is linked that way, and I would have to delete the first row if I was going to pick something else from another row. What I want to know is if it's possible to go around that, basically new row = new options, just keep the first row as simple values, don't affect anything else? Or at least somehow export the data from the row with a single click, so I can delete it and start all over again for new data?
This would ideally be done in a click since my boss wants me to make a completely functioning enterprise software in Google Sheets!
Google Sheet:
https://drive.google.com/file/d/1HRZsqKyIxD35dqCmCc75ldbtZeGvimKD/view?usp=sharing
try:
=ARRAYFORMULA(IFNA(
IFERROR(VLOOKUP(D2:D, data!A1:B20, 2, 0),
IFERROR(VLOOKUP(D2:D, data!A21:B42, 2, 0),
IFERROR(VLOOKUP(D2:D, data!A43:B54, 2, 0),
IFERROR(VLOOKUP(D2:D, data!A55:B61, 2, 0),
IFERROR(VLOOKUP(D2:D, data!A62:B94, 2, 0),
IFERROR(VLOOKUP(D2:D, data!A95:B101, 2, 0),
VLOOKUP(D2:D, data!A102:B139, 2, 0)))))))))
H2 would be:
=ARRAYFORMULA(IF(F2:F="",,VALUE(TEXT(G2:G-F2:F, "h:mm:ss"))*24*60*60))
and I2 would be:
=ARRAYFORMULA(IF(E2:E="",,IF(E2:E>40, "Paket unijeti rucno", E2:E*H2)))
We have a table user_info with high read-write-update activity containing millions of rows and another table combinations with low write, high read containing thousands of rows. We need to link these two tables on a unique ID in combinations composed of 3 integers using a one-many relationship. Each row in user_info will be linked to ~5 rows in combinations.
Example combinations:
id, some_id1, some_id2, some_id3
1, 10, 10, 100
2, 10, 21, 201
3, 20, 21, 201
Example user_info (with array column approach):
id, some_column_1, some_column_2, combination_ids
1, 'Bla bla', 'Smth', {1,3}
2, 'Bla bla', 'Smth Smth', {2,3}
We know that this could be done both using an integer[] column on user_info or using a link table. We are interested in optimizing for performance, specifically for joins between the two tables involving many rows. All the tables would be indexed on the relevant columns. Would it be faster to join using array columns or using a link table?
I have a database table with a Value column that contains comma delimited strings and I need to order on the Value column with a SQL query:
Column ID Value
1 ff, yy, bb, ii
2 kk, aa, ee
3 dd
4 cc, zz
If I use a simple query and order by Value ASC, it would result in an order of Column ID: 4, 3, 1, 2, but the desired result is Column ID: 2, 1, 4, 3, since Column ID '2' contains aa, '1' contains bb, and etc.
By the same token, order by Value DESC would result in an order of Column ID: 2, 1, 3, 4, but the desired result is Column ID: 4, 1, 2, 3.
My initial thought is to have two additional columns, 'Lowest Value' and 'Highest Value', and within the query I can order on either 'Lowest Value' and 'Highest value' depending on the sort order. But I am not quite sure how to sort the highest and lowest in each row and insert them into the appropriate columns. Or is there another solution without the use of those two additional columns within the sql statement? I'm not that proficient in sql query, so thanks for your assistance.
Best solution is not to store a single comma separated value at all. Instead have a detail table Values which can have multiple rows per ID, with one value each.
If you have the possibility to alter the data structure (and by your own suggestion of adding columns, it seems you have), I would choose that solution.
To sort by lowest value, you could then write a query similar to the one below.
select
t.ID
from
YourTable t
left join ValueTable v on v.ID = t.ID
group by
t.ID
order by
min(v.Value)
But this structure also allows you to write other, more advanced queries. For instance, this structure makes it easier and more efficient to check if a row matches a specific value, because you don't have to parse the list of values every time, and separate values can be indexed better.
string / array splitting (and creation, for that matter) is covered quite extensively. You might want to have a read of this, one of the best articles out there covering a comparison of the popular methods. Once you have the values the rest is easy.
http://sqlperformance.com/2012/07/t-sql-queries/split-strings
Funnily enough I did something like this just the other week in a cross-applied table function to do some data cleansing, improving performance 8 fold over the looped version in place.
Has anyone generated a big, really big table about a millions records using this tool DBGEN TPC-H?, some one recommended it, but only gave me the url.
The software is DBGEN, it is a program in C that generates text files taht can be imported to DBMS).
I am only asking for issues you have found or some trouble...
Or if you can tell how to gen a 200 X 1,000,000 table using this I would appreciate it...
If you run dbgen with no parameters, it'll generate 8 tables (made up of 150k Customers, 6million Line Items and 1.5million Orders).
If you are after 200x that, then you need to run dbgen with an appropriate 'scale factor', so for 200x, you'd use -s200
-s -- scale factor. TPC-H runs are only compliant when run against SF's
of 1, 10, 100, 300, 1000, 3000, 10000, 30000, 100000