how to store data crawled from website - database

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.
Suggestions?
Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?

It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to
use an RDBMS (SQL server of sorts) to store the meta-data associated with the pages.
Such info would be stored in a simple table (maybe with a very few support/related tables) containing fields such as Url, FileName (where you'll be saving it), Offset in File where stored (the idea is to keep several pages in the same file) date of crawl, size, and a few other fields.
use a flat file storage for the text proper.
The file name and path matters little (i.e. the path may be shallow and the name cryptic/automatically generated). This name / path is stored in the meta-data. Several crawled pages are stored in the same flat file, to optimize the overhead in the OS to manage too many files. The text itself may be compressed (ZIP etc.) on a per-page basis (there's little extra compression gain to be had by compressing bigger chunks.), allowing a per-file handling (no need to decompress all the text before it!). The decision to use compression depends on various factors; the compression/decompression overhead is typically relatively minimal, CPU-wise, and offers a nice saving on HD Space and generally disk I/O performance.
The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.

Having it in a database will help search through the content and page matadata. You can also try in-memory databases or "memcached" like storage to speed in up.

Depending on the processing power of the PC which will do the data mining, you could add the scraped data to a compressible archive like a 7zip, zip, or tarball. You'll be able to keep the directory structure intact and may end up saving a great deal of disk space - if that happens to be a concern.
On the other hand, a RDBMS like SqLite will balloon out really fast but wont mind ridiculously long directory hierarchies.

Related

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Is couchdb good for lot of documents with file attachments over multiple servers?

i would love to hear your thoughts about couchdb, and would it handle my use case.
What i will do, i will have database where i store documents in size about 20kb with attachment of 1-10MB for each.
will couch handle database 10TB or more per server with my schema?(in 4u case you can put 24 2TB drives is this too much per couch node?, there will be very less reads, so i down need speed)
will couch be able replicate all documents with attachments
how about splitting all data to multiple servers (for example to 4 nodes)? will it handle that much attachments?
what problems do you see here?
need more info please ask :)
I don't think you will hit a physical limitation with a 10TB file, that is I don't think couch has some inbuilt "can't use files bigger than X" with X being < 10TB.
However.
The biggest issue is the file compaction. In order to reclaim space, Couch wants to compress the file. This effectively means copying the file. So, for some point at least, 10TB needs to be 20TB as it duplicates the live data in the new copy.
If you are mostly appending to the file, that is you are simply adding new data and not updating or overwriting old data, then this will be less of a problem, as compaction won't gain you quite that much. If your data is basically static, then I would build the file and compact it a final time and be doe with it.
There are "3rd party" sharding solution for Couch, Lounge is popular.
When I approach a couch solution the primary thing to consider is what your query criteria is. Couch is all about the views, really. What kind of views are you looking at? If you're simply storing data by some simple key (file name, the date, or whatever), you may well be better off simply using a file system, and an appropriate directory structure, frankly.
So I'd like to hear more about your views you plan to use since you don't intend to do a lot of reading.
Addenda:
You still haven't mentioned what kind of queries you're looking for. The queries are, effectively, THE design component, especially for a Couch DB since it gets more and more difficult to add new queries on large datasets.
When you said attachments, I assumed you meant attachments to the Couch DB payload (since it can handle attachments).
So, all that said, you could easily create meta-data document capturing all of the whatever information you want to capture, and as part of that document add a path name to the actual file stored on the file system. This will reduce the overall size of the Couch file dramatically, which makes the maintenance faster and more efficient. You lose some of the "Self contained" part of having it all in a single document, of course.

Database storage engine for images

My application uses a database for (among other things) storing scanned documents. These are usually JPEG images, and mostly just text.
I know it's recommended to use files for images and link to them from the database, but it's just easier to retrieve them this way. The application uses one single server and multiple clients, and storing files would waste a lot of hard drive space (overhead, sector allocations, etc.) and require the use of shared folders and mapped drives and stuff. Slow, and less secure.
Anyway, some of our customers have been using our application for a few years, and their databases have grown into the tens of gigabytes, mostly because of these images. At the moment they're stored in an InnoDB table (MySQL).
Question is: is there a better way?
The data itself is write-once, and not normally deleted (but theoretically possible), so something that is slow to store (within reason), unchangeable and fast to access would be perfect.
I'm thinking of making my own storage engine that would include compression, indexing, and caching that only allows two columns (bigint, blob) per table. Does something like this already exist?
MSSQL 2005 introduced file-streaming which allows you to store pointers in the database, but keep the blobs out of the database and in a directory (as they should be). The neat thing is that when you backup the database, the blobs get backed up as well. Best of both worlds. The directories are restricted access as well, typically only SQL itself gets access.
Store the images as files, keep the pointers in the database.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources