train a classifier directly from mySQL database - database

Currently, I got a position to work as a data scientist on ML. my question is as follows, is it possible to train an algorithm directly from mySQL database and is there a similarity with the way you train it from an csv file. moreover, I would like to know if you are working on very unbalanced dataset. when you use for instance 0.2 percentage of the data for testing, does it divides the proportion of the negative and positive cases in the training and the testing in equal proportion. Can any one propose me either a good tutorial or documentation?

Sure you can train your model, directly from the database. This is what happens all around in production systems. Your software should be designed, that is does not matter if your data source is SQL, csv or whatever. As you donĀ“t mention the programming language, it is hard to say, how to do it, but in python you can take a look here: How do I connect to a MySQL Database in Python?
If your data set is unbalanced, like it is often in reality, you can use class weights to make your classifier aware of that. e.G. in keras/sci-kit learn you can just pass the class_weights parameter. Be aware that if your data set is too small, you can run into problems with default measures like accuracy. Better take a look at the confusion matrix or other metrics like the Matthews correlation coefficient
Another good reference:
How does the class_weight parameter in scikit-learn work?

Related

Ground Truth datasets for Evaluating Open Source NLP tools for Named Entity Recognition

I am working on building a document similarity graph for a collection. I already do all the basic things like tokenization, stemming, stop-word removal, and bag-of-word representation to represent the documents and computing similarity using Jaccard coefficient. I am now trying to extract Named Entities and evaluate if these would be helpful in improving the quality of the document similarity graph. I have been spending much of time on finding ground-truth datasets for my analysis. I have been very disappointed with Message Understanding Conference (MUC) datasets. They are cryptic to understand and requires sufficient data cleaning/massaging before it can be used on a different platform (like Scala)
My questions are here more specifically
Are there tutorials on getting started with MUC datasets that would make it easier for analyzing the results using open source NLP tools like openNLP
there other datasets available?
Tools like OpenNLP and Stanford Core NLP employ approaches that are essentially supervised. Correct?
GATE is a great tool for hand-annotating your own text corpus Correct?
For a new test dataset (that I hand-create) how can I compute the baseline (Vocabulary Transfer) or what kind of metrics can I compute?
First of all, I have a few concerns about using Jaccard coefficient to compute similarity. I'd expect TF.IDF and cosinus similarity to give better results.
Some answers to your questions:
See the CoNLL 203 evaluation campaign: it also provides data, evaluation tools, etc. You ma also have a look at ACE.
Yes
Gate is also a pipeline that automatically annotates text, but as far as I know NER is a rule-based component.
A baseline is most of the time a very simple algorithm (e.g. majority classes) so it is not a baseline to compare corpus, but to compare approaches.

What is the most effective method for handling large scale dynamic data for recommendation system?

We re thinking on a recommendation system based on large scale data but also looking for a professional way to keeping a dynamic DB structure for working in faster manner. We consider some of the alternative approaches. One is to keep in a normal SQL database but it would be slower compared to using normal file structure. Second is to use nosql graph model DB but it is also not compatible with the algorithms we use since we continuously pull al the data into a matrix. Final approach we think is to use normal files to keep the data but it is harder to keep track and watch the changes since no query method or the editor. Hence there are different methods and the pros and cons. What ll be the your choice and why?
I'm not sure why you mention "files" and "file structure" so many times, so maybe I'm missing something, but for efficient data processing, you obviously don't want to store things in files. It is expensive to read/write data to disk and it's hard to find something to query files in a file system that is efficient and flexible.
I suppose I'd start with a product that already does recommendations:
http://mahout.apache.org/
You can pick from various algorithms to run on your data for producing recommendations.
If you want to do it yourself, maybe a hybrid approach would work? You could still use a graph database to represent relationships, but then each node/vertex could be a pointer to a document database or a relational database where a more "full" representation of the data would exist.

Is there a lightweight database system that supports interpolation and gaps in time-series data?

I need to implement a system for storing timestamped sensor data from several devices on an embedded platform. According to this related question, a relational database is the preferred solution for storing that kind of data and I have therefore been looking at SQLLite.
However, I also need the database to be able to answer questions such as "What was the indoor temperature on Sep 12 at 13:15", even if the sensor data was not recorded precisely at that time. In other words, I need the database to be able to handle interpolation. As far as I could tell, SQLite cannot handle this, nor can the usual suspects (MySQL, PostgreSQL).
In addition, I would also need the database to be able to detect gaps in the data.
This related question seems to deal with mainframe-ish databases and not with embedded ones.
Therefore: Is there a database system suitable for embedded platforms that supports the "usual" operations one might want to perform on time-series data?
You shouldn't want or expect a database to do interpolation for you. Just pull out the nearest values to your desired time, and write your own interpolation. Only you would know what the appropriate type of interpolation should be used. Maybe simple linear across two points, maybe higher order polynomial across more points. It really depends on your system and data modeling situation.
Specialized time series databases.
Try:
RRDTool (simple utility, may be sufficient for you)
OpenTSDB
InfluxDB
Given your use case, may also be relevant to take a totally different approach, using an "IoT" targeted data store optimized for simple inserts (Xively, Phant.io) and then post-process for time series analysis.

Stored Procedure or Code

I am not asking for opinions but more on documentations.
We have a lot of data files (XML, CSV, Plantext, etc...), and need to process them, data mine them.
The lead database person suggested using stored procedure to accomplish the task. Basically we have a staging table where the file get serialized, and saved into a clob, or XML column. Then from there he suggested to further use stored procedure to process the file.
I'm an application developer with db background, more so on application development, and I might be bias, but using this logic in the DB seems like a bad idea and I am unable to find any documentation to prove or disapprove what I refer to as putting a car on a train track to pull a load of freights.
So my questions are:
How well does the DB (Oracle, DB2, MySQL, SqlServer) perform when we talking about regular expression search, search and replace of data in a clob, dom traversal, recursion? In comparison to a programming language such as Java, PHP, or C# on the same issues.
Edit
So what I am looking for is documentation on comparison / runtime analysis of a particular programming language compare to a DBMS, in particular for string search and replace, regular expression search and replace. XML Dom traversal. Memory usage on recursive method calls. And in particular how well they scale when encountered with 10 - 100's of GB of data.
It sounds like you are going to throw business logic into the storage layer. For operations like you describe, you should not use the database. You may end up in trying to find workarounds for showstoppers or create quirky solutions because of inflexibility.
Also keep maintainability in mind. How many people will later be able to maintain the solution?
Speaking about speed, choosing the right programming language you will be able to process data in multiple threads. At the end, your feeling with the car n the train is right;)
It is better to pull the processing logic out of data layer.Profiling your implementation in Database will be difficult.
You get the freedom and option to choose between libraries and comparing their performance if the implementation is done with any language.
Moreover you can choose frameworks like (Spring-Batch for Java) to process bulk volume of data as batch process.

Database recommendation

I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible

Resources