Need strategy for managing aggregated data during large database table creation - database

Imagine collecting all of the world's high-school students' grades each month into a single table and in each student's record, you're required to include the final averages for the subject across the student's class, city and country. This can be done in a post-process, but your boss says it has to be done during data collection.
Constraint: the rows are written to a flat file then bulk-inserted into the the new table.
What would be good strategy or design-pattern to hang on to the several hundred thousand Averages until the table is done without adding excessive memory/processing overhead to the JVM or RDBMS? Any ideas will be helpful.
Note: Because the table is used as read-only, we add a clustered index to it on completion.

I'd tell my boss to stop micromanaging.
But seriously, sort the data by class, city, and then country. Then compute the running average for each by keeping a running total and count for class, city, and country. When you encounter a different class, write the class name and average to a file. Do the same for cities and countries only use different files for each. Then you can open the sorted data file and the average files and insert rows in the database one by one.
If you want to use a framework that will handle all the writing to disk, I would look into using Hadoop for the processing.

Related

How to move from Excel to designing a Data Warehouse Model

I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit

MS Access - Matching records without single identifier

I need to find a way to match records between two tables. The problem is a single identifier that would make the match very simple isn’t available so I need to find a way to make that match based on some other available information in the records.
In an elementary school all registered/existing students have a Student ID. It is unique and makes a perfect primary key. However, any new students entering the school for the coming year do not get a Student ID until they are officially registered.
Before the next school year starts the school invites the new incoming students to be part of a pre-registration assessment program to help determine their current level and needs for the coming school year. It is at this point that as much data about each prospective student is gathered. This information is stored in a separate table from the main student information, mostly because there is no official Student ID. The idea is to merge the pre-registration students and their data into the main student information table(s) once they have an official Student ID assigned to them.
My thinking was to assign these new students a temporary ID just to have a unique identifier for them in case there are name duplications.
My question is how can I match up the temporary ID’s with the real ID’s once the student is assigned one?
Some information that will be gathered in the pre-registration process will include Last Name, First Name, Middle Name, Grade, with Birthday being another possibility (but isn’t included at this time).
Maybe I’m going about this in the wrong way so any suggestions on offer would be greatly appreciated.
It sounds like you are exporting information from the main Student Information System, running additional processing in Microsoft Access, then ultimately merging it back into the main system. This being the case, you will have to work with the limitations in the export and merge features, and building your matching logic around what is available there.
Plan A: Ideally your Excel export would include some type of primary record identifier from the main system, independent of the Student ID that gets assigned later. (It very likely uses a unique ID internally, even if that is not included in the export file.) You would then use this to match to your records in Microsoft Access.
Plan B: If the primary system does not export a unique identifier, then you will need to come up with your best combination of data to uniquely identify the student. How you do this will depend on how many students you are dealing with, and whether the matched data changes in either system. Full name and birthdate is a fairly common way to do this, if that data is complete in the originating system.
With the unique identifier established, I would use two queries in Access. The first would be an update query to assign the Student ID to your Access system as soon as it becomes available in the main system. (Search for matching students that have a Student ID in Excel, but not yet in Access.)
The second query would be an append query to add the new students from the main system into Access. (Where the student in Excel does not match any existing student in Microsoft Access.)
Taking this approach, you would pull the Excel export regularly from the main system and run the above queries to keep your Access system updated. Then when you are ready to merge information back into the main system, you could filter on students in Access that have a Student ID assigned. The actual update of data in the main system might be done through an update query, or perhaps an export from Access that includes the Student ID. (Depending on how your main system merges the incoming data.)
The way I would approach this is to merge both tables into a single table of students. This table would have an AutoNumber ID column that refers to the student or prospective student. Then you would have another column in this table for the StudentID which would be assigned at a later point.
Your forms and reports can then filter the data based on the StudentID field to show you either current or prospective students.
Taking this approach means your student data gets entered into one place, and you don't have to worry about trying to repeat information or merge it later. Since a single record represents a single individual, it makes logical sense to me to use a single table.

How are Long ids used in Google Datastore insert/update queries?

Our product is using Google Datastore as the application database. Most of the entities use IDs of type Long and some of type String. I noticed that the IDs of type Long are not in consecutive order.
Now we are exporting some big tables, with around 30 - 40 million entries, to json files for some business purposes. Initially we expected that a simple query like "ofy().load().type(ENTITY.class).startAt(cursor).limit(BATCH_LIMIT).iterator()" will help us iterate through the entire content of that specific table, starting from the first entry and ending with the most recently created one. We are working in batches and storing the cursor after every batch, so that the next task can load the batch and resume.
But after noticing that an entity created some minutes ago can have an ID smaller than the ID of another entity created 1 week ago, we are wondering if we should consider a content freeze during this export period. On one hand it's critical to make a good export and not to miss older data up to a specific date, on the other hand a content freeze longer than 1 day is a problem for our customers.
What do you advice us to do?
Thanks,
Cristian.
I do not think you need to worry about uniqueness of your id. Datastore build on top of Bigtable with 6 tables.
first table stores entities
second stores entities by kind
third stores indexes for the property values in the ascending order
fourth to store indexes for the property values in the descending order
fifth stores indexes for multiple properties together
sixth keeps a track of the next unique ID for Kind
Format is something like this.
[application ID]-[namespace]-[Kind]-[ID]
It is garanties of uniqueness each entities.
Yes, the format on that table is [Application ID]-[Kind Name] and the value is the next value. Let say you have kind products and that table will look like this |key(yourapp-products), Next ID(3)|. Now you created new entity for kind products it will be assigned to ID(3) and the row on that table will get new value |key(yourapp-products), Next ID(4)|. Also to mention that table has only one row since we have only one kind products.
Do you specify ID yourself or let datastore generate itself? It sounds like you have "Pre-allocating IDs" issue, just speculating but for every batch you need sort Kind.allocate_ids(size=blah) that way you can keep sequence.

How can I store an indefinite amount of stuff in a field of my database table?

Heres a simple version of the website I'm designing: Users can belong to one or more groups. As many groups as they want. When they log in they are presented with the groups the belong to. Ideally, in my Users table I'd like an array or something that is unbounded to which I can keep on adding the IDs of the groups that user joins.
Additionally, although I realize this isn't necessary, I might want a column in my Group table which has an indefinite amount of user IDs which belong in that group. (side question: would that be more efficient than getting all the users of the group by querying the user table for users belonging to a certain group ID?)
Does my question make sense? Mainly I want to be able to fill a column up with an indefinite list of IDs... The only way I can think of is making it like some super long varchar and having the list JSON encoded in there or something, but ewww
Please and thanks
Oh and its a mysql database (my website is in php), but 2 years of php development I've recently decided php sucks and I hate it and ASP .NET web applications is the only way for me so I guess I'll be implementing this on whatever kind of database I'll need for that.
Your intuition is correct; you don't want to have one column of unbounded length just to hold the user's groups. Instead, create a table such as user_group_membership with the columns:
user_id
group_id
A single user_id could have multiple rows, each with the same user_id but a different group_id. You would represent membership in multiple groups by adding multiple rows to this table.
What you have here is a many-to-many relationship. A "many-to-many" relationship is represented by a third, joining table that contains both primary keys of the related entities. You might also hear this called a bridge table, a junction table, or an associative entity.
You have the following relationships:
A User belongs to many Groups
A Group can have many Users
In database design, this might be represented as follows:
This way, a UserGroup represents any combination of a User and a Group without the problem of having "infinite columns."
If you store an indefinite amount of data in one field, your design does not conform to First Normal Form. FNF is the first step in a design pattern called data normalization. Data normalization is a major aspect of database design. Normalized design is usually good design although there are some situations where a different design pattern might be better adapted.
If your data is not in FNF, you will end up doing sequential scans for some queries where a normalized database would be accessed via a quick lookup. For a table with a billion rows, this could mean delaying an hour rather than a few seconds. FNF guarantees a direct access lookup path for each item of data.
As other responders have indicated, such a design will involve more than one table, to be joined at retrieval time. Joining takes some time, but it's tiny compared to the time wasted in sequential scans, if the data volume is large.

How to design parking street database?

I try to design database which contains data about street parking. Parking have gps coordinates, time restriction by day, day of week rules (some days are permitted, other restricted), free or paid status. In the end, I need to do some queries that can specify parking by criteria.
For first overdraw I try to do something like this:
Pakring
-------
parkingId
Lat
Long
Days (1234567)
Time -- already here comes trouble
But it`s not normalized and quickly overflow database. How to design data in the best way?
Update For now I have two approaches
The first one is:
I try to use restrictions tables with many-to-many links.(This is example for days and months). But queries will be complicated and I don`t now how to link time with day.
The second approach is:
Using one restricted table with Type field, that will have priority. But this solution also not normalized.
Just to be clear what data I have.
PakingId Coords String Description(NO PARKING11:30AM TO 1PM THURS)
And I want to show user where he can find street parking by area, time and day.
Thanks to all for your help and time.
This seems like a difficult task. Just a few thoughts.
Are you only concerned with street parking? Parking houses have multiple floors so GPS coordinates won't work unless you stay on the streets.
What is the accuracy of the coordinates? Would it be easier to identify each parking space individually by some other standard. Like unique identifiers of the painted parking squares. (But what happens if people don't park into squares? Or the GPS coordinates accuraycy fails/is not exact enough because of illegal parking? Do you intend to keep records of the parking tickets too?)
Some thought for the tables or information you need to take into account:
time: opening hours, days
price: maybe a different price for different time intervals?
exceptions: holidays, maintenance (maybe not so important, you could just make parking space status active/inactive)
parking slot: id (GPS/random id), status
Three or four tables above could be linked by an intermediate table which reveals the properties of a parking space for every possible parking time (like a prototype for all possible combinations). That information could be linked into another table where you keep records of a actual parking events (so you can for example keep records of people who have or have not paid their bills if you need to).
There are lots of stuff that affect your implementation so you really need to list all the rules of the parking space (and event?). Database structure can be done (and redone) later after you have an understanding of the properties of the events you need to keep records of. And thats the key to everything: understanding what you need to do so you can design and create the implementation. If the implementation (application) doesn't work change the implementation. If the design is faulty redesign. If you don't undestand the whole process (what you really need), everything you do is bound to fail. (Unless you are incredibly lucky but I wouldn't count on luck...)
Try using two tables with an intersection entity between them.
Table parking will have parking_id, lat and long columns. Table Restrictions will have all the type of restrictions that you have in your scenario with something like restriction_id, restriction_day, restriction_time and restriction_status and maybe restriction_type.
Then you can link the two tables with foreign key constraints in the intersection entity.
Example parking_id has restriction_id.
This way a parking can have more than one restriction and a restriction can be applied to more than one parking.
As you seem to have heard of normalization, and following the comment from Damien, you should use different tables to represent different things.
You should then think about how to link those tables together, and in the process define the type of relationship between the 2. Could be one-to-one (this one is the one where you could be tempted to put everything in the same table, but a simple foreign key in a linked table is cleaner), one-to-many (this is where the trouble would begin if you put everything in one table, cause now there will be several lines in the linked table with the same foreign key, and if everything was in the same table, you'd have to myltiply the fields in that table), or many to many (where you would need to add a table only to make the link between 2 other tables, thus with 2 foreign key fields pointing to records in both tables).
For example, in your case, a Parking table could hold the parking name, coordinates, etc.
A second table TimeTable could hold the opening days/time for each parking, with a foreign key to the parkingId (making it a one-to-many rlationship, 1 parking can have many opening frames). The fields of this table could for example be DayOfWeek (number indicating the day), openingTime, closingTime. This would allow you to define several timeframes on the same day, or a single one (if it's always open for example), giving in this case 7 records in this table for this parking (=> one-to-many relationship).
You could then imagine a 3rd table Price where you put data concerning the price of that parking (probably a one-to-many too, with records for hourly rates/long stay/..., and so on depending on the needs and the different "objects" you would need to represent.
Please note these are only rough examples. Database design can sometimes be very tricky and that's a matter I'm not specialist in, but I think these advises can help you go further and come back with another question if you get stuck.
Good luck !

Resources