I've heard of the H2 database, but how does it work? I'm now scraping data from many websites and don't want to visit the same URL that I've already scraped. Is it possible for me to use the H2 database in that case?
H2 is an open source Java SQL Database. It can be run in both embedded and server mode. It is widely used as an In-memory database.
In-memory database relies on system memory as oppose to disk space for storage of data. Because memory access is faster than disk access. We use the in-memory database when we do not need to persist the data. The in-memory databases are volatile, by default, and all stored data loss when we restart the application.
If you want to persist the data in the H2 database, you should store the data in a file. To achieve the same, you need to change the datasource URL property in your application properties file.
Learn more about H2 here
Related
I need to design a scalable database architecture in order to store all the data coming from flat files - CSV, html etc. These files come from elastic search. most of the scripts are created in python. This data architecture should be able to automate most of the daily manual processing performed using excel, csv, html and all the data will be retrieved from this database instead of relying on populating within csv, html.
Database requirements:
Database must have a better performance to retrieve data on day to day basis and it will be queried by multiple teams.
ER model, schema will be developed for the data with logical relationship.
The database can be within cloud storage.
The database must be highly available and should be able to retrieve data faster.
This database will be utilized to create multiple dashboards.
The ETL jobs will be responsible for storing data in the database.
There will be many reads from the database and multiple writes each day with lots of data coming from Elastic Search and some of the cloud tools.
I am considering RDS, Azure SQL, DynamoDB, Postgres or Google Cloud. I would want to know which database engine would be a better solution considering these requirements. I also want to know how ETL process should be designed- lambda or kappa architecture.
To store the relational data like CSV and excel files, you can use relational database. For flat files like HTML, which doesn't required to be queried, you can simply use Storage account in any cloud service provider, for example Azure.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability. You can restore the database at any point of time. This should be the best choice to store relational data and perform SQL query.
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Your HTML files can be stored here.
The ETL jobs can be performed using Azure Data Factory (ADF). It allows you to connect almost any data source (including outside Azure) to transform the stored dataset and store it into desired destination. Data flow transformation in ADF is capable to perform all the ETL related tasks.
I want to build a program which needs database to be used in it. Is it possible to use a database without pre-required program and internet access on client computer?
The only way to use a database without requiring internet access is if the databases is on the same computer.
You can bundle a database with your application but then you wouldn't have a central database for everyone to update. If that fits your requirements, great. Otherwise you are out of luck.
Regardless, you will need some program to perform persistent storage. While XML is one option, if you want any database like behavior just do a search on the internet for open source databases you could use if you don't want to pay for Oracle or SQL Server.
I want to get an idea on how to achieve this;
I have an application that runs at 5 different geographical locations. Eg: Texas, NY,California, Boston, Washington
This application saves data to a local database, which is located at that location.
I want to do data warehousing, So is it a must to have just have one database (Where all the 5 applications will now save its data in a single database - without having local DBs)
Or is it possible to have 5 local databases, and do data warehousing by retrieving data from those local DBs to a central DB and then performing data warehousing.
Please give me your thoughts and references.
You have three options for this:
you use a single, centrally hosted database server. Typical relational database servers can be directly accessed via network these days: mySQL, Postgresql, Oracle, ... This means you can implement an application which opens a network connection to the database server and uses that remote server to store and retrieve the data as required. Multiple connections are possible at the same time.
you use a single, central database server but put a wrapper around it. So some small network layer application layer acting as a broker. This way you can address that central instance over network, but via standard protocols like for example http.
you use a decentralized approach and install a database instance at each location. Then you need some additional tool to perform a synchronization. For most modern database servers (see above) such tools exist, but the setup is not trivial.
If in doubt and if the load is not that high go with the first alternative.
What would be the best way to insert metadata into a database that need to be logicaly connected files that are stored locally on a web-server?
In general, databases control their own storage. The proper procedure is to load data into tables in the database. This is important, because databases manage storage and memory. In a typical configuration, you don't want to be accessing files being updated by another application. And, you typically don't want to be storing database data over the network.
The general answer to the question is that you want to load data into the database.
That said, many database engines allow you to remotely access data in other databases or through a technology such as ODBC. You can get drivers for flat files, even those stored remotely on the network. However, this is not an optimal set up for querying. Alternatively, databases can be used to manage metadata for remote files, such as image files stored on disk. The purpose is to allow searches through the metadata which, in essence, retrieve file names that are then resolved (either on the client side or server side, depending on the architecture).
You should, perhaps, ask another question with a lot more detail about what you are trying to accomplish and about which database you are using.
I have a standard WinForms application that connects to a SQL Server. The application allows users to upload documents which are currently stored in the database, in a table using an image column.
I need to change this approach so the documents are stored as files and a link to the file is stored in the database table.
Using the current approach - when the user uploads a document they are shielded from how this is stored, as they have a connection to the database they do not need to know anything about where the files are stored, no special directory permissions etc are required. If I set up a network share for the documents I want to avoid any IT issues such as the users having to have access to this directory to upload to or access existing documents.
What are the options available to do this? I thought of having a temporary database where the documents are uploaded to in the same way as the current approach and then a process running on the server to save these to the file store. This database could then be deleted and recreated to reclaim any space. Are there any better approaches?
ADDITIONAL INFO: There is no web server element to my application so I do not think a WCF service is possible
Is there a reason why you want to get the files out of the database in the first place?
How about still saving them in SQL Server, but using a FILESTREAM column instead of IMAGE?
Quote from the link:
FILESTREAM enables SQL Server-based applications to store unstructured
data, such as documents and images, on the file system. Applications
can leverage the rich streaming APIs and performance of the file
system and at the same time maintain transactional consistency between
the unstructured data and corresponding structured data.
FILESTREAM integrates the SQL Server Database Engine with an NTFS file
system by storing varbinary(max) binary large object (BLOB) data as
files on the file system. Transact-SQL statements can insert, update,
query, search, and back up FILESTREAM data. Win32 file system
interfaces provide streaming access to the data.
FILESTREAM uses the NT system cache for caching file data. This helps
reduce any effect that FILESTREAM data might have on Database Engine
performance. The SQL Server buffer pool is not used; therefore, this
memory is available for query processing.
So you would get the best out of both worlds:
The files would be stored as files on the hard disk (probabl faster compared to storing them in the database), but you don't have to care about file shares, permissions etc.
Note that you need at least SQL Server 2008 to use FILESTREAM.
I can tell you how I implemented this task. I wrote a WCF service which is used to send archived files. So, if I were you, I would create such a service which should be able to save files and send them back. This is easy and you also must be sure that the user under which context the WCF service works has permission to read write files.
You could just have your application pass the object to a procedure (CLR maybe) in the database which then writes the data out to the location of your choosing without storing the file contents. That way you still have a layer of abstraction between the file store and the application but you don't need to have a process which cleans up after you.
Alternatively a WCF/web service could be created which the application connects to. A web method could be used to accept the file contents and write them to the correct place, it could return the path to the file or some file identifier.