Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I have the same problem as somebody described in another post. My application's log files are huge (~1GB), and grep is tedious to use to correlate information from the log files. Right now I use the ''less'' tool, but it is also slower than I would like.
I am thinking of speeding up the search. There are the following ways to do this: first, generate logs in XML and use some XML search tool. I am not sure how much speedup will be obtained using XML search (not much I guess, since non-indexed file search will still take ages).
Second, use an XML database. This would be better, but I don't have much background here.
Third, use a (non-XML) database. This would be somewhat tedious since the table schema has to be written (has it to be done for second option above too?). I also foresee the schema to change a lot at the start to include common use cases. Ideally, I would like something lighter than a full-fledged database for storing the logs.
Fourth, use lucene. It seems to fit the purpose, but is there a simple way to specify the indexes for the current use case? For example, I want to say "index whenever you see the word 'iteration'".
What is your opinion?
The problem is using XML will make your log file even bigger
I would suggest either splitting up your log files by date or lines
otherwise use file based database engines such as sqlite
A gigabyte isn't that big, really. What kind of "correlation" are you trying to do with these log files? I've often found it's simpler to write a custom program (or script) to handle a log file in a particular way than it is to try to come up with a database schema to handle everything you'll ever want to do with it. Of course, if your log files are hard to parse for whatever reason, it may well be worth trying to fix that aspect.
(I agree with kuoson, by the way - XML is almost certainly not the way to go.)
If you can check your logs on Windows, or using Wine, LogParser is a great tool to mine data out of logs, it practically allows you to run SQL queries on any log, with no need to change any code or log formats, and it can even be used generate quick HTML or excel reports.
Also a few years ago, when XML was in the hype I was using XML logs, and XSLT stylesheets to produce views, it was actually kinda nice, but it used way to much memory and it would choke on large files, so you probably DON'T want to use XML.
The trouble with working on log files is that each one has to be queried individually, you'll get a much sharper response if you could create an index of the log files and search/query that instead. Lucene would be my next port of call, then solr.
Maybe you could load your log into Emacs (provided you have sufficient memory) and use the various Emacs features such as incremental search and Alt-X occur.
Disclaimer: I haven't tried this on files > 100MB.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We're developing a new product at work and it will require the use of a lightweight database. My coworkers and I, however, got into a debate over the conventions for database creation. They were of the mindset that we should just build quick outline of the database and go in and indiscriminately add and delete tables and stuff until it looks like what we want. I told them the proper way to do it was to make a script that follows a format similar to this:
Drop database;
Create Tables;
Insert Initial Data;
I said this was better than randomly changing tables. You should only make changes to the script and re-run the script every time you want to update the design of the database. They said it was pointless and that their way was faster (which holds a bit of weight since the database is kind of small, but I still feel it is a bad way of doing things). Their BIGGEST concern with this was that I was dropping the database, they were upset that I was going to delete the random data they put in there for testing purposes. That's when I clarified that you include inserts as part of the script that will act as initial data. They were still unconvinced. They told me in all of their time with databases they had NEVER heard of such a thing. The truth is we all need more experience with databases, but I am CERTAIN that this is the proper way to develop a script and create a database. Does anyone have any online resources that clearly explain this method that can back me up? If I am wrong about this, then please fell free to correct me.
Well, I don't know the details of your project, but I think its pretty safe to assume you're right on this one, for a number of very good reasons.
If you don't have a script that dictates how the database is structured, how will create new instances of it? What happens when you deploy to production or it gets accidentally deleted or the server crashes? Having a script means you don't have to remember all the little details of how it was all set up (which is pretty unlikely even for small databases).
It's way faster in the long run. I don't know about you, but in my projects I'm constantly bringing new databases online for things like unit testing, new branches, and deployments. If I had to recreate the database from by hand every time it would take forever. Yes it takes a little extra time to maintain a database script but it will almost always save you time over the life of the project.
It's not hard to do. I don't know what database you're using but many of them support exporting your schema as a DDL script. You can just start with that and modify it from them on. No need to type it all up. If your database won't do that, it's worth a quick search to see if a 3rd party tool that works with your database will do it for you.
Make sure your check your scripts into your source control system. It's just as important as any other part of your source code.
I think having a data seeding script like you mentioned is a good idea. But keep it as a separate script from the database creation script. This way your can a developer seed script, a unit testing seed script, a production seed script, etc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Recently I came across a scenario in a question:
There are n websites with n pages each and n users visiting the sites....each visit of the user has to be saved and the pages he/she has visited ( not mentioned whether in database or log files, so it's up to the developer )
I decided to go on with it and do something in datastructures but when I discussed this thing with a friend of mine, he said, we can save it in database and this logically sounded correct too.
So we have 3 ways of storing anything in general...log files data-structure database
Now I am really confused, when should one go with data-structures, databases or simply log files, not only for this particular scenario but in a generic way?
What's the real difference?
I understand that this question is primarily opinion based but couldn't get a concrete result while browsing!
Log files are often / usually output-only - these files will rarely, if ever, get read, possibly only read manually. Some types of files may have random access, allowing you to fairly efficiently find a given record by a single index (through binary search), but you can't (easily) have multiple indices on the data in a single file, which is a trivial task for a database. If you just want to log something for manual processing later, a log file can work fine (even if a database can work too).
Databases is the standard in the industry, in that they provide you with persistence, efficient reading and writing, a standard interface and redundancy (but of course they need to be set up correctly).
A pure data structure solution typically doesn't consider persistent storage, as in making sure your data is kept when the program stops running for some reason. If you do want to write to and read from persistent storage, this will often come with a fair bit of complexity to do efficiently and regularly. And multiple / complex indices is a bit of a hassle to cater for. That's not to say data structures can't be used with persistent storage - databases are built using data structures and some data structures are specifically made for disk reads and writes. But you don't want to be figuring this out on a low level - it's best to just let a database take care of it if you need persistence.
You could also combine data structures and databases, using the database as persistent storage and use the data structure to cache the results so you only need to do (slower) writes to the database and you can do (faster) reads from the data structure. This is not uncommon in large systems with external databases. Although anything more complex than a standard map data structure is probably overcomplicating your cache and make indicate a bigger problem with your design.
What you have there sounds like an interview question, for which they may be expecting a data structure solution and simply saying "use a database" may be frowned upon. However, if it's a system design question, you'd almost certainly need to include some sort of a database in your design instead of concerning yourself with data structures.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I work on a project that has very well defined lines of responsibility. There are about six to ten of us and we currently do all of our work in Excel, building a single spreadsheet with maintenance requirements for ships. A couple of times during the project process we stop all work and compile all of the individual spreadsheets into one spreadsheet. Since each person had a well defined area, we don't have to worry about one person overwriting another person's work. It only takes an hour, so it isn't that huge of a deal. Less than optimal, sure, but it gets the job done.
But each person fills out their data differently. I think moving to a database would serve us well by making the data more regimented with validation rules. But the problem is, we do not have any type of share drive or database server where we can host the database, and that won't change. I was wondering if there was a simple solution similar to the way we were handling the Excel spreadsheet. I envisioned a process where I would wipe the old data and then import the new data. But I suspect that will bring up other problems.
I am pretty comfortable building small databases and using VBA and whatnot. This project would probably have about six tables, and probably three that would have the majority of the data for any given project (the others would be reference tables and slow-to-change data). Bottom line is, I am wondering if it is worth it, or should I stick with Excel?
Access 2007 onwards has an option for "Collecting email replies" which can organise flat data, but it can only be a single query that's populated so might be a bit limiting.
The only solution I can think of that's easier than you currently use is to create the DB with some VBA modules that export all new/updated data to an XML/csv file and attached this to an email. You'd then have to create a VBA module that would import the data from these files into the current table.
It's a fair amount of work to get set up but once working might be fairly quick and robust.
Edit, just to add, I have solved a similar problem but I solved it with VB.net and XML files rather than Access.
You can link Access databases to other databases (or import from them). So you can distribute a template database for users to add records to and then email back. When receiving back, you would either import or link them to a master database and do whatever you needed to do with the combined data.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I would like to store data persistently for my application, but I don't really need a full blown relational database. I really could get by with a basic "cache"-like persistent storage where the structure is just a (key, value) pair.
In lieu of a database what are my best, scalable options?
There's always SQLite, a database that's stored in a file. SQLite already has built-in concurrency, so you don't have to worry about things like file locking, and it's really fast for reads.
If, however, you are doing lots of database changes, it's best to do them all at once inside a transaction. This will only write the changes to the file once, as opposed to every time an change query is issued. This dramatically increases the speed of doing multiple changes.
When a change query is issued, whether it's inside a tranasction or not, the whole database is locked until that query finishes. This means that extremely large transactions could adversely affect the performance of other processes because they must wait for the transaction to finish before they can access the database. In practice, I haven't found this to be that noticeable, but it's always good practice to try to minimize the number of database modifying queries you issue.
if you want a 'persistent cache', and already use memcached, check memcachedb. it's a persistent hashtable using the memcached protocol, no need for a new client (but a new daemon)
I ask a similar question recently. Here are some choices:
The Berkley DB is groovy. (also see the Berkley DB Wikipedia entry)
If you are on Windows, then you can use the built in Extensible Storage Engine. This used to be called "Jet Blue".
Microsoft SQL Compact Edition is also free (But not built into the OS.
If you need scalability, then a RDBMS is your best bet. At the most basic level, you can serialize data structures into files - however, then you'd need to account for file locking issues which would limit concurrency.
SQLite is an SQL file-based database engine, which can run without a persistent database daemon (in PHP for example, it runs as an extension), however it too has concurrency issues (read the answers to this question which could help you decide if SQLite is right for you).
Unless you have a really good reason not to use a real DRBMS, I'd suggest you stick with MySQL or other "full-blown" engines.
If you want something really scalable, I wouldn't opt for a flat or XML file. As your data grows, it could kill your performance.
If you will have a lot of data at some stage, I would still opt for a database - I would take a look at something like SQLIte with a very simple schema to suit your needs.
I'm really not sure you should but have you considered just storing info in an XML doc if it really is that light? And if it's not have you considered SQLite?
If your writing a java program that wants an imbedded database look at hsqldb since it is written in java and works much better then sqlite if being called from java programs.
If you are writing Java, then there are Java database implementations (Jared mentions hsqldb, there are others) that you can directly include.
SQLite is fine for static inclusion, however you can also include MySQL within your application if you're using a compatible language such as C.
I think you would appreciate having SQL available as well. XML files just don't cut it any longer, maybe a couple of years ago when writing PDA software, but even the iPhone and Android includes SQLite now.
For Key=Value pairs, you can use INI file format with simple load and save procedures to load it and save it to in-memory hash table.
That can later scale up to anythig, just by changing load and save procedures to work with db.
You could try CounchDB, it is a very flexible document-oriented database that doesn't force you to define a schema up-front. It was written in Erlang and thanks to this it is considered very scalable solution. It can be easily queried through a REST interface.
hsqldb in in-memory mode will give you much better performance than flat file based databases. It's pretty easy to use too. And if the table gets too big, there is an option to cache it on disk too. Check out this performance comparison:
(source: hsqldb.org)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a file containing 250 million website URLs, each with an IP address, page title, country name, server banner (e.g. "Apache"), response time (in ms), number of images and so on. At the moment, these records are in a 25gb flat file.
I'm interested in generating various statistics from this file, such as:
number of IP addresses represented per country
average response time per country
number of images v response time
etc etc.
My question is, how would you achieve this type and scale of processing, and what platform and tools wuld you use(in a reasonable time)?
I am open to all suggestions, from MS SQL on Windows to Ruby on Solaris, all suggestions :-) Bonus points for DRY (don't repeat yourself), I'd prefer not to write a new program each time a different cut is required.
Any comments on what works, and what's to be avoided would greatly be appreciated.
Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.
Step 2: use SQL queries to determine the values of interest.
You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.
Edited:
Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.
Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).
If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.
Log Parser can read your logs and upload them to the Database.
Database Considerations:
For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.
You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.
Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.
Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.
Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.
Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.
P.S.
I will take the risk to include something obvious here but...
I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).
Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.
For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.
One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.
25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.
I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.
I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.
You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.
I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.
If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.
If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).
Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.
Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.
Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.
SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.
If you have more than one computer at your disposal, this is a perfect job for MapReduce.
Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.