I am working on a research project where we are trying to build a machine learning model for a disease prediction. For this we have data from hospitals around the country. The problem is each hospital is having their own database model(RDBMS, Nosql), different column names for patient records. Are there any solutions available on how to integrate all this data to create a machine learning on top of it?
I'm working on creating a better way to view the history of items held at the public library I work at. I have been using a combination of the built in functions of SirsiDynix (our library software), Excel, and Autohotkey to extract, manipulate, and display the data. I am currently stuck on designing a way to view the change in status of an item over time since the system as it is only shows based on the last transaction. For example, if I have the following item:
0000519227318 005.54 WAL 101 EXCEL 2013 TIPS, TRICKS & TIMESAVERS Walkenbach, John, author WE-WH 2013 7 7/13/2013 6/29/2015 35
I can tell you it was created on 7/13/2013, last checked out on 6/29/2015, and has been checked out a total of 7 times. But I am unable to tell you anything about the length of those checkouts, or when they occurred, or if the book had been missing for a year in the middle of that time period.
With Autohotkey and the SirsiDynix Director's Station I have been able to create "daily snapshot" csv files that indicate where an item is every day. However I am having trouble figuring out how to consolidate that information. Originally I was planning to simply add an additional column to the end of the record every day so that after the general item information you would have a series of numbers listing the changing location. The coding I have for AHK to do this is somewhat slow and I'm still working on how I would best display it in Excel regardless. However it occurred to me that there may be a much better way to handle this that could fully automate the process.
So I'm asking whether there are suggestions for either a simple database system to use or an improvement to my current method that could assist me. The queries I plan to do are simply to be able primarily to type in an item number and have a chart display the status of an item, hopefully with something that could also show whenever the total checkouts has increased. I have been looking at Stock Market charts as examples but as many people with those seem to want open,close,hi,low values the responses they get seem beyond what I may need. Additional queries of items with the longest period of time on the shelf relative to total time would be useful although not initially required.
Any help as to what direction I may want to go would be appreciated. I have basic understanding of AutoHotKey and Excel, and I briefly used MySQL several years ago so I have a general feel of how a database can be used.
Not too familiar with your specific software or Autohotkey but for an efficient, secure, and scalable solution, consider any type of relational database management system (RDMS) including server level enterprise systems (some open-source or proprietary): Oracle, SQL Server, DB2, PostgreSQL, MySQL; or file level databases including SQLite and MS Access. One main thing is to try to move out of the concept of flatfile spreadsheets and applications. Excel is simply not a database and should be only used an end-use document for reporting or graphics/analytics using retrieved database content.
With a relational database you can maintain data across normalized related tables linked together by primary and foreign keys. Essentially, you want to build a Library Management System which can comprise of the following tables:
Items -unique list of Items ISBN, Catalog, Title, Author, Publisher, Category (Fiction, Nonfiction, Reference, Media); Primary Key: ItemID (autonumber/auto-increment)
Stock -copies of Items, Condition, Missing/Damaged Status, Cost, and Inventory Quantity; one-to-many relationship with Items table; Primary Key: StockID, Foreign Key: ItemID
Checkouts -infinite history of checkout record including Stock Item, CheckoutDate, CheckinDate, Notes; many-to-many relationship with Stock; Primary Key: CheckoutID, Foreign Key: StockID
Now with this schema, you can better manage each Item throughout its life cycle (new or arrived item, checked out periods, and retired/discard/basement stored) with easy queries for real-time reporting. Additionally, you can use any type of general purpose language (Java, C#, PHP, Python, Perl, VB) which can connect to any aforementioned RDMS to build the interface or tool for this backend system. A host of free consoles are also available including:
PhpMyAdmin (PHP/MySQL)
PgAdmin (PostgreSQL)
SQLite-manager (Firefox addin/SQLite)
Management Studio (SQL Server)
MS Access (Jet/ACE Engine and MS Office) the often misnomer to conflate MS Access as a database when it is actually a GUI that connects by default to a Windows technology, the JET/ACE SQL engine, where this default can be switched out to any aforementioned RDMS
Here, Excel can connect via ODBC/OLEDB VBA for reporting on checked out items status, history, and current shelf stock. Depending on the RDMS, you can even build triggers where as soon as an item is checked out, a record is added to CheckOut table or code it in you tool or scripts. Finally, outputs into txt, csv, xml, pdf reports, email attachments to co-workers, board, etc can be integrated.
I'm running a classifieds website that has ads and comments on it. As the traffic has grown to a considerable amount and the number of ads in the system have reached over 1.5 million out of which nearly 250K are active ads.
Now the problem is that the system has been designed to be very dynamic in terms of the category of the ads and the properties each kind of ad can have based on it category or sub category, therefore to display an ad I have to join nearly 4 to 5 tables.
To solve this issue I have created a flat table (conceptually what I call a publishing table) and populate that table with an SQL Job every 3 to 4 minutes. Now for web requests I query that table to show ad listings or details.
I also have implemented a data cache of around 1 minute for each unique url combination for ad listings and for each ad detail.
I do the same thing for comments on ads (i.e. cache the comments and as the comments are hierarchical, I have used a flat table publishing model for them also, again populated with an SQL Job)
My questions are as follows:
Is the publishing model with a backgroung sql job a good design approach?
What approach would you take or people take for scenarios like this?
How does a website like facebook show comments realtime with millions of users, keeping sure that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Starting at the end:
3.How does a website like facebook show comments realtime with millions of users, keeping sure
that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Two things:
Smarter programming than you. They can put a larget etam on solving this problem for months.
Ignorance. They really dont care too muich about a cache being a little outdated. Noone will really realize.
Hardware ;) More and more powerful servers than yours.
That said, your apoproach sounds sensible.
I want to create a system that stores books (and some other documents). Users will be able to log into the system where they can either see a list of all books or enter some search string and get a list of the books containing the search string. My problem is that I don´t know how I should go about storing my books. The books obv have to be searchable and the search needs to return the books ID, Name, and preferable page. Anything more like the text surrounding the search term would be a nice extra.
Some facts that might help you help me get the best answer.
The database does not have to be free. If SQL Server or an Oracle DB will help me than I´m all for that.
The books will be about ~100 (2-600 pages)
The documents will be about ~1000 (10-50 pages)
Adding books and documents will be a slow process that will happen infrequently so any type of re-indexing of tables does not need to be fast.
I have not decided how to search the documents. I do need my search results to be ranked based on relevance somehow. This might become a source of another question in the future
Do not use a RDBMS database. RDBMS are good for storing relational data. Data you are trying to store are a set of documents. Use a document store like couchDB or mongoDB. However, you since have to search this data, it is better to index this data in lucene which is built for such needs
Provided you don't intend to search the entire text of the book (perhaps consider initial processing to store a serialized hash of unique words?):
SQL Server 2008R2 has a new FILESTREAM system which will enforce relational integrity using the DB engine but will maintain the files in the file system.
It's the "best of both worlds" and you won't have to worry about how DB backup plans affects your BLOBs
http://msdn.microsoft.com/en-us/library/cc949109(v=sql.100).aspx
SharePoint Foundation 2010 and 2013 could be your perfect solution which is absolutely free to use. You can store bulk amount of documents to different document libraries, add and edit their metadata, and search them using metadata like Title, Author, etc and even the text content inside the book.
I am working on a product (ASP.NET Web site) developed for educational institutions. There are around 20 educational inst. in my site. For each of them academic year start and end date varies. There are huge number of records in the database for attendance and results.
Now I need to show all previous years data (like attendance, results etc) whenever a student, teacher logs in. There are some reports which compares student performance in various academic years.
Now my problem is how to maintain that huge data ?
I wanted to go with 2 databases. 1 for current academic year, another for all previous yrs.
But my current year DB schema may change for enhancement. So whenever I move the current year data to archive database then it creates problem for me. Please suggest a good way to implement this.
Thanks,
seshu.
Have you thought about table partitioning? It allows you to rapidly move data through sliding windows - so that at the start of a new year, you slide last year's details into an archive partition. (You need to check the SQL Server edition you have to see whether it is enabled)
MSDN details:
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx
If you want to keep two databases in sync, schema-wise, there are plenty of tools available for that. Here is mine, here is Red Gate's and here is Apex's. There are many more available, including one which comes with Visual Studio Team System Database edition (if you have that already - if you don't then one of the ones I have previously mentioned will be a lot cheaper).