Hash Tables or BST? - c

Currently i have a problem that i'm trying to figure out but not sure if my answers are correct.
You have 1 million records. In these records you will frequently need to search by
two criteria: employee ID and salary (but not by both at the same time).
You have the following constraints:
each record is very large and because of that you can only keep one copy of this data.
Your program needs to be reasonably fast. Simply scanning through all the items for each search would be too slow.
What data structure would you use?
My Answer?
I would use Hash table because the worst case time would be O(1000000) = O(1)
How will you retrieve the record when you search by ID?
How will you retrieve the record when you search by salary?

I'd expect many collision issues for a hash-table based on salary, but one for an ID could work with no collisions quite easily using a little cryptographic theory. It seems odd to want to search by salary rather than sort or get some range, which could be performed much more easily on a BST.
The short of it though is that if you want to search by two independent properties you're going to have to maintain two structures. Fortunately pointers exist, so you don't have to keep multiple copies. Personally I'd keep a hash table of IDs to references, then a BST of salaries to references, but if I'm restricted to one datatype I'd have to do a BST with nodes like this:
Node {
int id;
Node idLessThan;
Node idGreaterThan;
int salary;
Node salaryLessThan;
Node salaryGreaterThan;
Data fileInfo;
}
Creating essentially two BSTs over the same node set.

Related

MS Access - Matching records without single identifier

I need to find a way to match records between two tables. The problem is a single identifier that would make the match very simple isn’t available so I need to find a way to make that match based on some other available information in the records.
In an elementary school all registered/existing students have a Student ID. It is unique and makes a perfect primary key. However, any new students entering the school for the coming year do not get a Student ID until they are officially registered.
Before the next school year starts the school invites the new incoming students to be part of a pre-registration assessment program to help determine their current level and needs for the coming school year. It is at this point that as much data about each prospective student is gathered. This information is stored in a separate table from the main student information, mostly because there is no official Student ID. The idea is to merge the pre-registration students and their data into the main student information table(s) once they have an official Student ID assigned to them.
My thinking was to assign these new students a temporary ID just to have a unique identifier for them in case there are name duplications.
My question is how can I match up the temporary ID’s with the real ID’s once the student is assigned one?
Some information that will be gathered in the pre-registration process will include Last Name, First Name, Middle Name, Grade, with Birthday being another possibility (but isn’t included at this time).
Maybe I’m going about this in the wrong way so any suggestions on offer would be greatly appreciated.
It sounds like you are exporting information from the main Student Information System, running additional processing in Microsoft Access, then ultimately merging it back into the main system. This being the case, you will have to work with the limitations in the export and merge features, and building your matching logic around what is available there.
Plan A: Ideally your Excel export would include some type of primary record identifier from the main system, independent of the Student ID that gets assigned later. (It very likely uses a unique ID internally, even if that is not included in the export file.) You would then use this to match to your records in Microsoft Access.
Plan B: If the primary system does not export a unique identifier, then you will need to come up with your best combination of data to uniquely identify the student. How you do this will depend on how many students you are dealing with, and whether the matched data changes in either system. Full name and birthdate is a fairly common way to do this, if that data is complete in the originating system.
With the unique identifier established, I would use two queries in Access. The first would be an update query to assign the Student ID to your Access system as soon as it becomes available in the main system. (Search for matching students that have a Student ID in Excel, but not yet in Access.)
The second query would be an append query to add the new students from the main system into Access. (Where the student in Excel does not match any existing student in Microsoft Access.)
Taking this approach, you would pull the Excel export regularly from the main system and run the above queries to keep your Access system updated. Then when you are ready to merge information back into the main system, you could filter on students in Access that have a Student ID assigned. The actual update of data in the main system might be done through an update query, or perhaps an export from Access that includes the Student ID. (Depending on how your main system merges the incoming data.)
The way I would approach this is to merge both tables into a single table of students. This table would have an AutoNumber ID column that refers to the student or prospective student. Then you would have another column in this table for the StudentID which would be assigned at a later point.
Your forms and reports can then filter the data based on the StudentID field to show you either current or prospective students.
Taking this approach means your student data gets entered into one place, and you don't have to worry about trying to repeat information or merge it later. Since a single record represents a single individual, it makes logical sense to me to use a single table.

Is there any DB server which could support following operations?

I need to store a list of strings as field along with the Id: listId, <list>.
Now I need following operations in order O(1) time:-
Removing a given string from an existing listId.
Adding a new string in an existing listId.
Is there any DB which could support above operations? Having HashSet as one of its datatype would help. Note that I need a highly scale-able solution where list could have 10Mn keys in 1000+ listIds.
I understand that such datatype if exists in any database would have considerable indexing overhead. I believe that chances are really slim for something similar to exist. If not, then I would implement something myself.
What you describe sounds like a textbook case for normalization.
You'd have two tables: one that contains the lists, and another that contains the list elements.
They are linked through the list ID:
Lists table:
id name (+ whatever else you need)
List elements table:
id listId (connected to an id in the lists table) (+ whatever else you need)

Using b trees in a database

I have to implement a database using b trees for a school project. the database is for storing audio files(songs), and a number of different queries can be made like asking for all the songs of a given artist or a specific album.
The intuitive idea is to use on b tree for each field ( songs, albums, artists, ...), the problem is that one can be asked to delete any member of any field, and in-case you delete an artist you have to delete all his albums and songs from the other b trees, keeping in mind that for example all the songs of a given artist don't have to be near each other in the b tree that corresponds to songs.
My question is: is there a way to do so (delete the songs after a delete to an author has been made) without having to iterate over all elements of the other b trees? I'm not looking for code just ideas because all the ones I've come up with are brute force ones.
This is my understanding and may not be entirely right.
Typically in a database implementation B Trees are used for indexes, so unless you want to force your user to index every column, defaulting to creating a B Tree for each field is unnecessary. Although this many indexes will lead to a fast read in virtually every case (with an index on everything, you wont have to do a full table scan), it will also cause an extremely slow insert/update/delete, as the corresponding data has to be updated in each tree. As I'm sure you know, modern databases for you to have at least one index (the primary key), so you will have at least one B Tree with a key for the primary key, and a pointer to the appropriate node.
Every node in a B Tree index should have a pointer/reference to the full object it represents.
Future indexes created would include the attributes you specify in the index, such as song name, artist, etc, however will still contain the pointer/reference to the corresponding node. Thus when you modify, lets say, the song title, you will want to modify the referenced node which all the indexes reference. If you have any indexes that have the modified reference as an attribute, you will have to modify the values in that index itself.
Unfortunately I believe you are correct in your belief that you will have to brute-force your way through the other B Trees when deleting/updating, and is one of the downsides of using alot of indexes (slowed update/delete time). If you just delete the referenced nodes, you will likely end up with pointers to deleted objects, which will (depending on your language) give you some form of a NullPointerException. In order to prevent this they references will have to be removed from all the trees.
Keep in mind though that doing a full scan of your indexes will still be much better than doing full table scans.

Finding a suitable data structure for deletion from both lists

This might be deleted, since involves idea sharing which is not quite allowed in stack overflow, but still before that if I could get any ideas from solid programmers, it will be a win situation for me
Assume that you have a class Student, stored in the database, and this class has a list property called favoriteTeachers. This list constantly gets updated by the system and involves the id of teachers.
You also have a class Teacher, also stored in database and likewise has a list property favouriteStudents. It is again updated constantly and involves the id's of students.
In our system, when a student calls a function (let's say notMyFavoriteTeacher), our system has to apply the changes below;
Delete the given teacher's id from favouriteTeacher list
Delete the student's id from given teacher's favouriteStudent list
I've tried to consider the number of rows updated could exhaust the database so instead of mapping the students with their favorite teachers in a separate table as user_id, teacher_id, instead I created a column and stored a string which contains the teachers id's separated by comma. (Ex: "1,2,14,4,25"). Same applied for the teacher as well.
However when we call this function, we also face another problem. In order for this operation to be done, you need to convert the string to list, find the element by linear search and later on delete, and later on convert list to string and push back to db. And you have to do the other operation for the teacher class as well. If we did not apply the string method, deletion would be easier but since we would be handling deletion and addition operations for like 2k times a day, i did not think it would be feasible to use separate tables.
I wanted to ask in order to decrease the number of operations, could a data structure be chosen such that it would increase the efficiency?
Storing an relation as an array in a single column is a violation of first normal form, and should not be done without good reason. Although various forms of denormalization may result in increased efficiency in some cases, I don't see this case being one of those. What's worse, you'll get no help from the database in enforcing referential integrity. And some operations will result in guaranteed row scans: When deleting a teacher, you will have to examine every row of every student to remove the teacher from each student's favorite list. Same goes for deleting a student.
Relational Databases are designed and built to link rows to other rows. You need a very good reason to keep them from doing what they're design to do. You should go ahead and design a proper relational schema, and only if actual measurement shows that it is too slow should you worry about its performance.
First of all, I don't understand your choice of storing ids of favorite teachers/students as comma separated strings, because either in the case of comma separated values or in case of a table with studentId, teacherId structure, you do exactly 2 row updates/deletes (first in the favoriteTeachers table, second in the favoriteStudent table).
But one way of optimizing performance given your current data structure would be keeping the comma separated strings sorted. I mean from the very formation of rows, keep your comma separated ids like "1, 5, 7, 15". This way, if you convert it to a list, you could perform binary search and it would take Log(n) time instead of n.
You are losing all the benefits provided by any RDBMS by storing it as a list of strings. Create a separate table with Student_id and favorite teacher_id. Apply filtering conditions (either for student or for teacher) before joining it to main tables.

Does it Make Sense to Map a Graph Data-structure into a Relational Database?

Specifically a Multigraph.
Some colleague suggested this and I'm completely baffled.
Any insights on this?
It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. Like this:
create table node (
id integer primary key
);
create table edge (
start_id integer references node,
end_id integer references node,
primary key (start_id, end_id)
);
However, there are a couple of sticky points about storing a graph this way.
Firstly, the edges in this scheme are naturally directed - the start and end are distinct. If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me.
Secondly, the schema above has no way to represent a multigraph. You can extend it easily enough to do so; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing.
However, even having sorted out the storage, you have the problem of working with the graph. If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. If you want to do this, my suggestion would be to post further questions about the specific queries.
It's an acceptable approach. You need to consider how that information will be manipulated. More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation.
Without considering what types of queries you might execute, start with two tables vertices and edges. Vertices are simple, an identifier and a name. Edges are complex given the multigraph. Edges should be uniquely identified by a combination two vertices (i.e. foreign keys) and some additional information. The additional information is dependent on the problem you're solving. For instance, if flight information, the departure and arrival times and airline. Furthermore you'll need to decide if the edge is directed (i.e. one way) or not and keep track if that information as well.
Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. For instance, optimal flights. The book Programming Collective Intelligence has some useful algorithms for this purpose. But where the data is kept doesn't change the algorithm itself.
Well, the information has to be stored somewhere, a relational database isn't a bad idea.
It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections.
Consider how Facebook might implement the social graph in their database. They might have a table for people and another table for friendships. The friendships table has at least two columns, each being foreign keys to the table of people.
Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that.

Resources