The project is required to store large weather data (http://www1.ncdc.noaa.gov/pub/data/igra/)
into file system with JPA. I mean disk files.
How to store those data. For example, how to organize the files? So that we can retrieve those files for retrieving.
I had a quick look at the data, and the description of what they contain, I don't think it's practical to keep those data in disk files if you want to extract information from them. You'd probably be better off to design a couple of simple database tables in which to store these data, and query this database to get the data for your calculations, maybe with JPA.
A library that may help you to parse the data files is JFileHelpers, it makes working with fixed-width and delimited files a lot easier.
Hope this helps to get you started.
Related
I'm developing an application that extracts some information for every image in a dataset of images, and store these data for future use. The problem I have is how to properly store these data. Is it better to create a single annotation file (I use JSON files) for each image in the dataset or to create a big unique file that contains all the extracted data?
The kind of information I'm extracting is similar from image to image but not equal. The dataset of images can be huge, >1milion images.
If relevant, I'm using Python on Linux or MacOS.
I would use a single document (file or in a NoSQL database) per dataset.
If you have > 1 million images, single file per image will mean > 1 million files/documents.
Not something that will be easy to manage or manipulate.
A single file/document is much easier to manage and search.
I'd also consider using a NoSQL database to store the JSON documents.
EDIT:
After considering the comments, I'd have to say that you might need to cut off a JSON file at a certain amount of data, resulting in a few files per dataset.
As for files getting corrupted that's a risk you run on any storage, even database files, that's why we have backups and replicas.
You can always run a NoSQL database locally, but again, this will need some computing resources.
I've seen some code read large data from mat files instead of doing queries on a database. What are the benefits of doing this as oppose to using a database? Is it possible to easily move the mat file contents into a database and vice versa?
Reading data from mat file, is also a "database" in which you read your data from file.
Eventually, you will have to implement queries by yourself, and take care of many other issues.
Also, it is not a scalable solution, which means that for a large amount of data, it won't work well.
Of course, if you have small amount of data, and only basic queries, the fuss of setting up a database, using SQL isn't worth it.
Regarding your second question, it really depends on the data you have there.
I agree with Andrey. It depends on the data and what you want to do with it. I created a small program in Matlab that queries a relatively small .mat database but as the database and users grew performance has been going down.
In the light of this we decided to use a MySQL database. I created a small java application that talks to the database and imported that into Matlab to move data between Matlab and MySQL. But I had to create specific queries for my data. If someone can bring me a better solution I would be grateful.
Perhaps it wouldn't be such a bad idea to generate a general script that moves data between .mat data between Matlab and a SQL database. Store the data in a structure and use that to create the tables.
If you want to discuss something like this further via email I would be happy to. Maybe we can learn a thing or two from each other.
Does anyone have some recommendations on how I can find databases for random things that I might want to use in my application. For example, a database of zip code locations, area code cities, car engines, IP address locations, food calorie counts, book list, or whatever. I'm just asking generally when you decide you need a bunch of data where are some good places to start looking other than google?
EDIT:
Sorry if I was unclear. It doesn't necessarily need to be a database. Just something I could dump into a relational database like a CSV file. I was just listing some examples of things I've needed in the past. I always find myself searching all over Google for these types of things and am trying to find a few places to look first. Are there companies that just collect tons of data on things and sell it or give it away?
Bureau of Labor Statistics
The Zip Code Database Project
US Census Bureau (You can use their interface to create and download custom CSV files)
Data.gov has tons of stuff
You don't necessarily need a database to get going. A lot of the kinds of information you're listing will exist in delimited files (such as CSVs.) Any structured text file will do since importing woth most major database engines is somewhat trivial. In fact, the raw data will most likely not exist in a db for that reason. E.g. you can then imported into the RDBMS of your choice and the provider of that data does not need to worry about a bunch of different db formats.
Zip Code CSV file at SourceForge
List of english words (not CSV per se, but just as easy to import... for things like spell checkers)
Really good looking source of a few different datasets
If you want to track the users position this http://code.google.com/apis/gears/api_geolocation.html might be a better way than a lookup table or a csv file.
Recently, I and my colleagues, we are discussing how to build a huge storage systems which could store billions a pictures which could searched and download quickly.
Something like a fickr, but not for an online gallery. Which means, most of these picture will never be download.
My colleages suggest that we should save all these files in database directly. I really feels that it's not a good idea and I think database is not desgined for restore huge number of binary files. But I have very strong reason for why that's not a good ideas.
What do you think about it.
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
If you are really talking about billions of images, I would store them in the file system because retrieval will be faster than serializing and de-seralizing the images
The answers above appear to assume the database is an RDBMS. If your database is a document-oriented database with support for binary documents of the size you expect, then it may be perfectly wise to store them in the database.
It's not a good idea. The point of a database is that you can quickly resolve complex queries to retrieve textual data. While binary data can be stored in a database, it can slow transactions. This is especially true when the database is on a separate server from the running application. In the database, store meta-data and the location/filename of the images. Images themselves should be on static server(s).
I have been hired to help write an application that manages certain information for the end user. It is intended to manage a few megabytes of information, but also manage scanned images in full resolution. Should this project use a database, and why or why not?
Any question "Should I use a certain tool?" comes down to asking exactly what you want to do. You should ask yourself - "Do I want to write my own storage for this data?"
Most web based applications are written against a database because most databases support many "free" features - you can have multiple webservers. You can use standard tools to edit, verify and backup your data. You can have a robust storage solution with transactions.
The database won't help you much in dealing with the image data itself, but anything that manages a bunch of images is going to have meta-data about the images that you'll be dealing with. Depending on the meta-data and what you want to do with it, a database can be quite helpful indeed with that.
And just because the database doesn't help you much with the image data, that doesn't mean you can't store the images in the database. You would store them in a BLOB column of a SQL database.
If the amount of data is small, or installed on many client machines, you might not want the overhead of a database.
Is it intended to be installed on many users machines? Adding the overhead of ensuring you can run whatever database engine you choose on a client installed app is not optimal. Since the amount of data is small, I think XML would be adequate here. You could Base64 encode the images and store them as CDATA.
Will the application be run on a server? If you have concurrent users, then databases have concepts for handling these scenarios (transactions), and that can be helpful. And the scanned image data would be appropriate for a BLOB.
You shouldn't store images in the database, as is the general consensus here.
The file system is just much better at storing images than your database is.
You should use a database to store meta information about those images, such as a title, description, etc, and just store a URL or path to the images.
When it comes to storing images in a database I try to avoid it. In your case from what I can gather of your question there is a possibilty for a subsantial number of fairly large images, so I would probably strong oppose it.
If this is a web application I would use a database for quick searching and indexing of images using keywords and other parameters. Then have a column pointing to the location of the image in a filesystem if possible with some kind of folder structure to help further decrease the image load time.
If you need greater security due to the directory being available (network share) and the application is local then you should probably bite the bullet and store the images in the database.
My gut reaction is "why not?" A database is going to provide a framework for storing information, with all of the input/output/optimization functions provided in a documented format. You can go with a server-side solution, or a local database such as SQLite or the local version of SQL Server. Either way you have a robust, documented data management framework.
This post should give you most of the opinions you need about storing images in the database. Do you also mean 'should I use a database for the other information?' or are you just asking about the images?
A database is meant to manage large volumes of data, and are supposed to give you fast access to read and write that data in spite of the size. Put simply, they manage scale for data - scale that you don't want to deal with. If you have only a few users (hundreds?), you could just as easily manage the data on disk (say XML?) and keep the data in memory. The images should clearly not go in to the database so the question is how much data, or for how many users are you maintaining this database instance?
If you want to have a structured way to store and retrieve information, a database is most definitely the way to go. It makes your application flexible and more powerful, and lets you focus on the actual application rather than incidentals like trying to write your own storage system.
For individual applications, SQLite is great. It fits right in an app as a file; no need for a whole DRBMS juggernaut.
There are a lot of factors to this. But, being a database weenie, I would err on the side of having a database. It just makes life easier when things changes. and things will change.
Depending on the images, you might store them on the file system or actually blob them and put them in the database (Not supported in all DBMS's). If the files are very small, then I would blob them. If they are big, then I would keep them on he file system and manage them yourself.
There are so many free or cheap DBMS's out there that there really is no excuse not to use one. I'm a SQL Server guy, but f your application is that simple, then the free version of mysql should do the job. In fact, it has some pretty cool stuff in there.
Our CMS stores all of the check images we process. It uses a database for metadata and lets the file system handle the scanned images.
A simple database like SQLite sounds appropriate - it will let you store file metadata in a consistent, transactional way. Then store the path to each image in the database and let the file system do what it does best - manage files.
SQL Server 2008 has a new data type built for in-database files, but before that BLOB was the way to store files inside the database. On a small scale that would work too.