I recently received an assignment in my Java programming class to analyze a (what I would guess is a very small) dataset. I really enjoyed the assignment and the use of a 'tokenizer' etc which was a new concept to me. The dataset we got to work with was pretty boring, as it only contained dates.
What I'm looking for is:
Public datasets (XML, txt or similar) to practice analysis on
This can be anything really (preferably pretty simply), as I'm mainly trying to print out statistics, patterns and graphs.
Try the Stackoverflow data dump.
The UC Irvine Machine Learning Repository is a great resource for this kind of thing.
In addition to the raw data dump mentioned by nos, see its companion, the Stack Exchange Data Explorer. There, you can run a SQL query and download the resultset. (Useful if you're looking for something smaller than everything.)
Related
I thought I'd ask before I try to build something from scratch.
Here is the type of problem I need to answer. One of our researchers comes to me and says "How many people in our data have such-and-such SNP genotyped?"
Our genetics data consists of several dozen GWAS files, typically flat delimited. Each GWAS file has between 100,000-1,000,000 SNPs. There is some overlap in the SNPs, but less than I'd originally thought.
Anyway, what I want to do is have some sort of structured database that links our participant IDs to a particular GWAS study, and then link that GWAS study to a list of SNPs, and I can write some kind of query that will pull all IDs that have the data. At no point do I need individual level genotype data, it is way easier to pull the SNP/Samples that I need once I know where they are.
So that is my problem and what I'm looking for. For anyone who works with a lot of GWAS data, I'm sure you're familiar with the problem. Is there anything (free or paid) that is built for this type of problem? Or do you have thoughts on what direction I might want to go if I need to build this myself?
Thanks.
After doing some research, I was amazed with the power of Prolog to express queries in a very simple way, almost like telling the machine verbally what to do. This happened because I've become really bored with Propel and PHP at work.
So, I've been wondering if there is a way to translate database table rows (Postgres, for example) into Prolog facts. That way, I could stop using so many boring joins and using ORM, and instead write something like this to get what I want:
mantenedora_ies(ID_MANTENEDORA, ID_IES) :-
papel_pessoa(ID_PAPEL_MANTENEDORA, ID_MANTENEDORA, 1),
papel_pessoa(ID_PAPEL_IES, ID_IES, 6),
relacionamento_pessoa(_, ID_PAPEL_IES, ID_PAPEL_MANTENEDORA, 3).
To see why I've become bored, look at this post. The code there would be replaced for these simple lines ahead, much easier to read and understand. I'm just curious about that, since it will be impossible to replace things around here.
It would also be cool if something like that was possible to be done in PHP. Does anyone know something like that?
check the ODBC interface of swi-prolog (maybe there is something equivalent for other prolog implementations too)
http://www.swi-prolog.org/pldoc/doc_for?object=section%280,%270%27,swi%28%27/doc/packages/odbc.html%27%29%29
I can think of a few approaches to this -
On initialization, call a method that performs a selects all data from a table and asserts it into the db. Do this for each db. You will need to declare the shape of each row as :- dynamic ies_row/4 etc
You could modify load_files by overriding user:prolog_load_files. From this activity you could so something similar to #1. This has the benefit of looking like a load_files call. http://www.swi-prolog.org/pldoc/man?predicate=prolog_load_file%2F2 ... This documentation mentions library(http_load), but I cannot find this anywhere (I was interested in this recently)!
There is the Draxler Prolog to SQL compiler, that translates some pattern (like the conjunction you wrote) into the more verbose SQL joins. You can find in the related post (prolog to SQL converter) more info.
But beware that Prolog has its weakness too, especially regarding aggregates. Without a library, getting sums, counts and the like is not very easy. And such libraries aren't so common, and easy to use.
I think you could try to specialize the PHP DB interface for equijoins, using the builtin features that allows to shorten the query text (when this results in more readable code). Working in SWI-Prolog / ODBC, where (like in PHP) you need to compose SQL, I effettively found myself working that way, to handle something very similar to what you have shown in the other post.
Another approach I found useful: I wrote a parser for the subset of SQL used by MySQL backup interface (PHPMyAdmin, really). So routinely I dump locally my CMS' DB, load it memory, apply whathever duty task I need, computing and writing (or applying) the insert/update/delete statements, then upload these. This can be done due to the limited size of the DB, that fits in memory. I've developed and now I'm mantaining this small e-commerce with this naive approach.
Writing Prolog from PHP should be not too much difficult: I'd try to modify an existing interface, like the awesome Adminer, that already offers a choice among basic serialization formats.
I am trying to create an old-school Text Adventure Game. I'm a bit stuck on creating the World Map and rooms.
Should the room descriptions be part of the source code or should it be separated out? I was thinking of placing all such descriptions and room properties in a MySQL database and then have code to organize the logic of each room; putting each room description in with the actual source code seems a bit untidy.
Is this the preferred method of organising Descriptions in an adventure game? I was also thinking that this might be preferable since I could then query the database to find common properties about the data.
Any comments would be appreciated.
No, don't include level/room description within code, it is not dynamic this way.
Many many development frameworks now tend to go with separating code from data. So, for usual cases, we put game rooms data within files and read those to build the level and maybe enable the user to construct a new level on his own and eventually create a new file to carry the room data.
I work in a company where they build games, and they have the rooms separated from the code, they have it in mysql. Actually also the items that go in each room are in a table, and there is also a table that says which item is at which room at that moment.
Besides if you want to expand your game or do statistics about it is much better doing it with a database.
I will address two issues here. First, you are right to keep the data that defines the game away from the engine that will use it. This makes it so that you dont have to recompile everything in order to fix a typo or the like in the case of a text based game.
Secondly though, I would just question the use of MySQL. If you are making a dos typed game that is to be installed on people's systems you dont want a pre-req to be 'Install MySQL', hehe. There is a little program out there that is written in C that is free for all to use called SQLite that would suit your needs much better. If on the other hand the web is the medium for the release of this text based game, then have at it :)
You could just use a system like ADRIFT, then all you need to worry about are the descriptions and logic.
Should the room descriptions be part of the source code or should it
be separated out?
Separated out.
Try Prolog language.
It has similar database to SQL (actually logical predicates)
With some skill You may be able to check whether after some change is Your adventure finishable.
You may easily create this description by some logical predicates if You don't mind it being very "computer like".
You can see examples of Prolog text adventures in simple Google search.
I suggest using engines that already have a vibrant community around them. That way, your source code is only that; the source code of the game. I'd go with either TADS 3 or Inform 7
I would construct such a game as an interpreter which reads in room data, and based on the room data, allows for a set of valid commands (move, take, drop, change...). For movement you would have a pre-built graph with nodes being rooms and edges being allowed moves.
I would separate the descriptions from the code, having an object Room, that owns an object Description that calls a "database" through some Facade, so that you may use a file or a database or anything you wish. It would also eventually allow you to add some scripting to the room itself, like having objects in your description that have behaviors.
I am trying to do the following:
we are trying to design a fraud detection system for stock market.
I know the Specification for the frauds (they are like templates).
so I want to know if I can design a template, and find all records that match this template.
Notice:
I can't use the traditional queries cause the templates are complex
for example one of my Fraud is circular trading,it's like this :
A bought from B, and B bought from C, And C bought from A (it's a cycle)
and this cycle can include 4 or 5 persons.
is there any good suggestion for this situation.
I don't see why you can't use "traditional queries" as you've stated. SQL can be used to write extraordinarily complex queries. For that matter I'm not sure that this is a hugely challenging question.
Firstly, I'd look at the behavior you have described as vary transactional, therefore I treat the transactions as a model. I'd likely have a transactions table with some columns like buyer, seller, amount, etc...
You could alternatively have the shares as its own table and store say the previous 100 owners of that share in the same table using STI (Single Table Inheritance) buy putting all the primary keys of the owners into an "owners" column in your shares table like 234/823/12334/1234/... that way you can do complex queries and see if that share was owned by the same person or look for patterns in the string really easily and quickly.
-update-
I wouldn't suggest making up a "small language" I don't see why you'd want to do something like that when you have huge selection of wonderful languages and databases to choose from, all of which have well refined and tested methods to solve exactly what you are doing.
My best advice is pop open your IDE (thumbs up for TextMate) and pick your favorite language (Ruby in my case). Find some sample data and create your database and start writing some code! You can't go wrong trying to experiment like this, it'll will totally expose better ways to go about it than we can dream up here on Stackoverflow.
Definitely Data Mining. But as you point out, you've already got the models (your templates). Look up fraud DETECTION rather than prevention for better search results?
I know a some banks use SPSS PASW Modeler for fraud detection. This is very intuitive and you can see what you are doing as you play around with the data. So you can implement your templates. I agree with Joseph, you need to get playing, making some new data structures.
Maybe a timeseries model?
Theoretically you could develop a "Small Language" first, something with a simple syntax (that makes expressing the domain - in your case fraud patterns - easy) and from it generate one or more SQL queries.
As most solutions, this could be thought of as a slider: at one extreme there is the "full Fraud Detection Language" at the other, you could just build stored procedures for the most common cases, and write new stored procedures which use the more "basic" blocks you wrote before to implement the various patterns.
What you are trying to do falls under the Data Mining umbrella, so you could also try to learn more about it: maybe you can find a Data Mining package for your specific DB (you didn't specify) and see if it helps you finding common patterns in your data.
We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.