Reactive Indexing from SQL Server to Elasticsearch with Spring - sql-server

I wanted to use reactive spring to index data from a SQL Server database (using R2DBC) to Elasticsearch (using reactive Elasticsearch). I have an entity class corresponding to the table in SQL Server and I have a model class corresponding to the documents which will be indexed by Elasticsearch.
For both databases (SQL Server & Elasticsearch), I have created repositories:
Repository for SQL Server:
#Repository
public interface ProductRepository extends ReactiveCrudRepository<ProductTbl, Long> {
}
Repository for Elasticsearch:
#Repository
public interface ProductElasticRepository extends ReactiveElasticsearchRepository<Product, String> {
}
Shouldn't I be able to index all documents by calling productElasticRepository.saveAll(productRepository.findAll())?
It doesn't quite work. Either it exceeds the DataBufferLimit and therefore throws an Exception or there is a ReadTimeOutException. When executing, I can see R2DBC creating RowTokens for each data entry of the SQL Server database. However a POST to the Elasticsearch client does only occur until all rows are obtained, which doesn't seem to be how reactive code should work? I am definitely missing something here and hope someone can help.

I cannot figure out what exactly is the problem in your case but I remember I did something similar in the past and I remember a few problems I've encountered.
It depends on how much data you need to migrate. If there are millions of rows then it will definitely fail. In this case you can create an algorithm which is going to have a windows, let's say 5000 rows, read them and then write them to elasticsearch using a batch insert. Then you do this until there are no more rows to read.
Another problem I encountered is that the ElasticSearch WebClient wasn't configured to support the amount of data I was sending in the body.
Another: ElasticSearch has a queue capacity of 200 by default, if you exceed that then you will get an error. This happens if you somehow try to insert the data in parallel.
Another: the connection with the relational database will be interrupted at some point if kept open for a very long time
Remember that elasticsearch is not reactive by default. There is a reactive driver at this point but it is not official.
Another: When doing the migration try to write on a single node with not so many shards.

You should do something like
productRepository.findAll()
.buffer(1000)
.onBackpressureBuffer()
.flatMap(list -> productElasticRepository.saveAll(list))
Also, if you're getting ReadTimeOutException, increase the socket timeout
spring:
data:
elasticsearch:
client:
reactive:
socket-timeout: 5m

Related

Can I use SignalR with SQL Server Backplace on an existing entity code first database?

According to Scaleout with SQL Server you can use SignalR.SqlServer to keep SignalR synced on a load balancing setup. I have an existing MVC 5 website with its own database created using Entity Code First. The article seems to use a dedicated database with service broker enabled and says not to modify the database.
So do I need to have a separate database for this? I know Entity can be picky if the database schema doesn't match and I worry that if I try to use the SignalR Sql Server package with the existing database that the tables it creates will cause a context changed error.
Also can someone provide me with more information about using the Microsoft.AspNet.SignalR.SqlServer package. The article I linked doesn't give a ton of detail and I don't know if I need to change anything in my hub and groups or if it is all handled automatically.
You should be able to, though you'd likely want to separate your entity framework definitions from signalR. You can either put SignalR in a separate database, or give the two a separate schema.
In terms of configuration, you'll need to make an addition to the Startup class of your web project:
public class Startup
{
public void Configuration(IAppBuilder app)
{
var sqlConnectionString = "connection string here";
GlobalHost.DependencyResolver.UseSqlServer(sqlConnectionString);
this.ConfigureAuth(app);
app.MapSignalR();
}
}

Accessing a non JDBC DB using mapreduce

I have a database which is not JDBC enabled where I am able to fire a query and get the result using an input stream. I want to access this using a map reduce program.
For a JDBC enabled database there are "DBInputFormat.java" and "DBConfiguration.java" files in Hadoop which take care of accessing the database and getting the result in a user-defined class which extends DBWritable and Writable interfaces.
Is there a way in which I can access the above mentioned non-JDBC database in the same fashion ?
I am not sure if your DB supports ODBC. If so you can try jdbc:odbc driver with DBInputFormat. I am not sure if this works as never tried this.
Another option which should be your final option is to implement your own FileInputFormat

Scrapy / Python and SQL Server

Is it possible to get the data scraped from websites using Scrapy, and saving that data in an Microsoft SQL Server Database?
If Yes, are there any examples of this being done? Is it mainly a Python issue? i.e. if I find some code of Python saving to an SQL Server database, then Scrapy can do same?
Yes, but you'd have to write the code to do it yourself since scrapy does not provide an item pipeline that writes to a database.
Have a read of the Item Pipeline page from the scrapy documentation which describes the process in more detail (here's a JSONWriterPipeline as an example). Basically, find some code that writes to a SQL Server database (using something like PyODBC) and you should be able to adapt that to create a custom item pipeline that outputs items directly to a SQL Server database.
Super late and completely self promotion here, but I think this could help someone. I just wrote a little scrapy extension to save scraped items to a database. scrapy-sqlitem
It is super easy to use.
pip install scrapy_sqlitem
Define Scrapy Items using SqlAlchemy Tables
from scrapy_sqlitem import SqlItem
class MyItem(SqlItem):
sqlmodel = Table('mytable', metadata
Column('id', Integer, primary_key=True),
Column('name', String, nullable=False))
Add the following pipeline
from sqlalchemy import create_engine
class CommitSqlPipeline(object):
def __init__(self):
self.engine = create_engine("sqlite:///")
def process_item(self, item, spider):
item.commit_item(engine=self.engine)
Don't forget to add the pipeline to settings file and create the database tables if they do not exist.
http://doc.scrapy.org/en/1.0/topics/item-pipeline.html#activating-an-item-pipeline-component
http://docs.sqlalchemy.org/en/rel_1_1/core/tutorial.html#define-and-create-tables

h2 in-memory tables, remote connection

I am having problems with creating an in memory table, using H2 database, and accessing it outside of the JVM it is created and running in.
The documentation structures the url as jdbc:h2:tcp://<host>/mem:<databasename>
I've tried many combinations, but simply cannot get the remote connection to work. Is this feature working, can anyone give me the details of how they used this.
None of the solutions mentioned so far worked for me. Remote part just couldn't connect.
According to H2's official documentation:
To access an in-memory database from another process or from another computer, you need to start a TCP server in the same process as the in-memory database was created. The other processes then need to access the database over TCP/IP or TLS, using a database URL such as: jdbc:h2:tcp://localhost/mem:db1.
I marked the crucial part of the text in bold.
And I found a working solution at this guy's blog:
The first process is going to create the DB, with the following URL:
jdbc:h2:mem:db1
and it’s going to need to start a tcp Server:
org.h2.tools.Server server = org.h2.tools.Server.createTcpServer().start();
The other processes can then access your DB by using the following URL:
"jdbc:h2:tcp://localhost/mem:db1"
And that is it! Worked like a charm!
You might look at In-Memory Databases. For a network connection, you need a host and database name. It looks like you want one of these:
jdbc:h2:tcp://localhost/mem:db1
jdbc:h2:tcp://127.0.0.1/mem:db1
Complete examples may be found here, here and here; related examples are examined here.
Having just faced this problem I found I needed to append DB_CLOSE_DELAY=-1 to the JDBC URL for the tcp connection. So my URLs were:
In Memory : jdbc:h2:mem:dbname
TCP Connection : jdbc:h2:tcp://localhost:9092/dbname;DB_CLOSE_DELAY=-1
From the h2 docs:
By default, closing the last connection to a database closes the
database. For an in-memory database, this means the content is lost.
To keep the database open, add ;DB_CLOSE_DELAY=-1 to the database
URL.
Not including DB_CLOSE_DELAY=-1 means that I cannot connect to the correct database via TCP. The connection is made, but it uses a different version to the one created in-memory (validated by using the IFEXISTS=true parameter)
In SpringBoot: https://www.baeldung.com/spring-boot-access-h2-database-multiple-apps
#Bean(initMethod = "start", destroyMethod = "stop")
public Server inMemoryH2DatabaseaServer() throws SQLException {
return Server.createTcpServer(
"-tcp", "-tcpAllowOthers", "-tcpPort", "9090");
}

How to check if an JPA/hibernate database is up with second-level caching

I have a JSP/Spring application using Hibernate/JPA connected to a database. I have an external program that check if the web server is up every 5 minutes.
The program call a specific URL to check if the web server is still running. The server returns "SUCCESS". Obviously if the server is now, nothing is returned. The request timesout and an alert is raised to inform the sysadmin that something is wrong...
I would like to add another layer to this process: I would like the server to return "ERROR" if the database server is down. Is there a way using Hibernate to check if the database server is alive and well?
What I tought to do was to take an object and try to save it. This would work, but I think it's probably too much for what I want. I could also read(load) an object from the database. But since we use second-level caching for all our objects, the object will be loaded from the cache, and not the database.
What I'm looking for is something like:
HibernateUtils.checkDatabase()
Does such a function exist in Hibernate?
You could use a native query, e.g.
Query query = sess.createNativeQuery("select count(*) from mytable").setCacheable(false);
BigDecimal val = (BigDecimal) query.getSingleResult();
That should force hibernate to use a connection from the pool. If the db is down, I don't know exactly what will be the error returned by hibernate given that it can't get the connection from the pool. If the db is up, you get the result of your query.

Resources