Finding databases for use in applications - database

Does anyone have some recommendations on how I can find databases for random things that I might want to use in my application. For example, a database of zip code locations, area code cities, car engines, IP address locations, food calorie counts, book list, or whatever. I'm just asking generally when you decide you need a bunch of data where are some good places to start looking other than google?
EDIT:
Sorry if I was unclear. It doesn't necessarily need to be a database. Just something I could dump into a relational database like a CSV file. I was just listing some examples of things I've needed in the past. I always find myself searching all over Google for these types of things and am trying to find a few places to look first. Are there companies that just collect tons of data on things and sell it or give it away?

Bureau of Labor Statistics
The Zip Code Database Project
US Census Bureau (You can use their interface to create and download custom CSV files)
Data.gov has tons of stuff

You don't necessarily need a database to get going. A lot of the kinds of information you're listing will exist in delimited files (such as CSVs.) Any structured text file will do since importing woth most major database engines is somewhat trivial. In fact, the raw data will most likely not exist in a db for that reason. E.g. you can then imported into the RDBMS of your choice and the provider of that data does not need to worry about a bunch of different db formats.
Zip Code CSV file at SourceForge
List of english words (not CSV per se, but just as easy to import... for things like spell checkers)
Really good looking source of a few different datasets

If you want to track the users position this http://code.google.com/apis/gears/api_geolocation.html might be a better way than a lookup table or a csv file.

Related

Loading different types of files into a database

At my work, we receive thousands of emails per day with attached files (xlsx, csv, xml, html, pdf, etc). Those emails get processed by a program and the files get downloaded and filtered into different folders depending on the sender.
The data in those files is then loaded into our MS SQL Server db by a proprietary software for which we are charged a hefty sum of money each year.
Now, I'm sure this is a super common process for enterprises, so there are probably some open source tools that we could use to replace this software.
How do most people do this? With individual scripts? What would be the correct way to do it?
Thank you very much!
EDIT: all the email attachments get converted to xlsx before being moved to their respective folders.
The 'correct' way would be to not use Excel as the interchange format - its fuzzy notion of data types (e.g. numbers vs. text) creates many issues.
Still, Excel feeds are a fact of life, so it's crucial to have a comprehensive and reliable way of dealing with multiple feeds that sometimes fail. I like this approach; the example uses my company's commercial cross-platform .NET ETL library, but I've also used the same approach with several other ETL tools previously.
Cheers,
Kristian

Developers with C# , VBA, and Python want to send the data into database

I have a question, I have so much confusing. I am recent graduate, started working in a company there lots of old data in text files, so I have organised all these data by using Python and i have generated a excel file. so for the situation good.
The problem, for new incoming data of sensors from different developers
Most of the guys using VB scripting , people test the data and saving data in text files by using delimited (, and |).
some other guy developing projects in C#, these people got some different kind of text data with different delimiters(: and ,)
The question I have to set up the database, how can i take all these developers data , automatically into the data base, for these text data I have created column names and structure for it.
Do i need to develop individual interface for all these developers or is there any common interface i have to focus or where to start is still lacking me.
I hope people out there understood my situation.
Best regards
Vinod
Since you are trying to take data from different sources, and you are using SQL Server, you can look into SQL Server Integration Services (SSIS). You'd basically create SSIS packages, one for each incoming data source and transfer the data into one homogeneous database.
These type of implementations and design patterns really depend on your company and the development environment you are working in.
As for choosing the right database, this also depends on the type of data, velocity, volume and common questions your company would like to answer using that data.
Some very general guidelines:
Schematic databases differ from "NoSQL" databases
Searching within a text is different than searching simpler primitive types and you have to know what you will be doing!
If you are able to, converting both VB scripting & C# text files to popular data structure files (i.e JSON, CSV (comma delimited text file is the same)) might help you with storing the data in almost any database.
There is no right or wrong in implementing a specific interface for each data supplier, it depends on the alternatives at stake & your coding style
Consult with others & ready Case Studies!
Hope that's helpfull
You should consider using Excel’s Power Query tool. You can intake almost any file type (including text files) from multiple locations, cleanse and transform the data easily (you can deliminate as you please), and storage is essentially unlimited as opposed to a regular excel spread sheet which maxes at around 1.4 million rows.
I don't know your company's situation but PDI (Pentaho Data Integration tool, is good for transformations of data. The Community Edition is free!
However, if you generated them into an excel file Power Query might be a better tool for this project, if your data is relatively small.

Neo4J: Binary File storage and Text Search "stack"

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:
Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.
With each file they will enter in any and all associations. Examples:
If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
and so on...
So the basic goal at this point would be to:
Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.
The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.
Initial thoughts...
I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)
Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.
Some combinations of solutions I know of would be:
Store EVERYTHING including files as binary in Neo4J.
Would be wrestling Neo4J for something its not built for.
How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.
How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.
Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.
Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.
Get to the bloody questions..
Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
Have the application store the data into Neo4J, then use triggers to update Postgres.
Or have the files loaded into Postgres and triggers update Neo4J.
Or Have the application load data to Nea4J and then application loads data into Postgres.
etc
How you would tie these together is probably what I am really trying to grasp.
Thank you very much for any input on this.
Cheers.
p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)
Here's my recommendations:
Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
Suggested Stack:
Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss
Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Do database systems contain/store picures, sounds and files, or just a reference of them?

The other day I was watching a tutorial video on Youtube about db, and in the introduction it said "db stores files such as pictures, sound files ... ".
As far as I know, database systems only store text. This text could be alphabetical, numerical and other characters, and by using a combination of that, we can create links of the actual files that are stored on the server directory.
Am I wrong?
Most Relational Databases allow for the storage of binary data in the form of some datatype supporting that. In Oracle, for example, this is the BLOB datatype (Binary Large Object).
Whether or not this is a good idea is highly subjective and depends on a case to case basis but it definitely is possible

cross-platform frameworks for storage + metadata?

I don't quite know what to use for terminology, so bear with me...
Are there any cross-platform frameworks out there that facilitate a kind of "virtual file storage" to encapsulate adding files along with a database of metadata? I'm thinking about something along the lines of iTunes or iPhoto, where the program manages a whole bunch of files (in those cases audio or image files) and has a database of metadata so you can organize/find those files easily. I'd like to cobble together something along those lines for files in general.
edit: I am hesitant to store files in a database alone, e.g. MySQL, as there would be potentially tens of gigabytes in my application (this issue has been mentioned in several SO posts, see this one that gives several links to others). I'm looking at CouchDB though and maybe it has promise....
Why not a database? You can keep files inside a database.

Resources