We have an ad search website and all the searches are being done through entity framework directly querying the sql server database.
It was working very well when the database had around 1000 ads, but now it is reaching 300k and lots of users searching. The searches now are very slow (using raw sql didn't help much) and I was instructed to consider Elasticsearch.
I've been some tutorials and I get the idea of how it works now, but what I don't know is:
Should I stop using sql server to store the ads and start using Elasticsearch instead? What about all the other related data? Is Elasticsearch an alternative to sql server?
Each Ad has some related data stored in different tables, how would I load it to Elasticsearch? As a single json element?
I read a lot of "billions of data" handled by Elasticsearch, so I don't think I would have performance problems with 300k rows in it, correct?
Would anybody explain me better these questions?
1- You could still use it; you don't want to search over the complete database, rigth? Just over the ads. It works with a no-sql format, so it is very scalable. It also works with json's so you have an easy form to access it.
2- When indexing data, you should try to add the complete necessary data in the same document(sql row), which is a single json, but in a limited way. Storage is cheap, but computing time isn't.
To index your data, you could either use filebeat, a program a bit similar to logstash, or create your own solution like, making a program that reads data from your db, and then passes it to elasticsearch in bulks.
3- Correct, 300k rows is a small quantity, but it also depends on the memory from where you are hosting elasticsearch.
Hope this helps.
Related
firstly I'm very new to DynamoDB, and AWS services in general - so I'm finding it hard when bombarded with all the details.
My problem is that I have an excel file with my data in CSV format, and I'm looking to add said data to a DynamoDB table, for easy access for the Alexa function I'm looking to build. The format of the table is as follows:
ID, Name, Email, Number, Room
1534234, Dr Neesh Patel, Patel.Neesh#work.com, +44 (0)3424 111111, HW101
Some of the rows have empty fields.
But everywhere I look online, there doesn't appear to be an easy way to actually achieve this - and I can't find any official means either. So with my limited knowledge of this area - I am questioning whether I'm going about this all the entirely wrong way. So firstly, am I thinking about this wrong? Should I be looking at a completely different solution for a backend database? I would have thought this would be a common task but with the lack of support or easy solutions - am I wrong?
Secondly, if I'm going about this all fine - how can it be done? I understand that the DynamoDB requires a specific JSON format - and again there doesn't appear to be a straightforward way to convert my CSV into said format.
Thanks, guys.
I had the same problem when I start using DynamoDB. When you come to distributed, big data system you really need to architect how to move data across the systems. This is where you start with it.
Clearly documented here,
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
Adding more details to understand the process.
Step 1: Convert your csv to json file.
If you have small amount of data, you can use online tools.
http://www.convertcsv.com/csv-to-json.htm
{
"ID": 1534234,
"Name": "Dr Neesh Patel",
"Email": "Patel.Neesh#work.com",
"Number": "+44 (0)3424 111111",
"Room": "HW101"
}
You can see how nicely it formatted remove spaces, etc., Choose the right options and perform your conversion.
If your data is huge, then you need to use big data tools to parallely process those data to convert them.
Step 2: Upload using CLI for small and one time upload
aws dynamodb batch-write-item --request-items file://data.json
If you want to regularly upload the file, you need to create a data pipeline or a different process.
Hope it helps.
DynamoDb is cool. However, before you use it you have to know your data usage patterns. For your case, if you're only every going to query the DynamoDb table by ID then it is great. If you need to query by any one or combination of columns then well there are solutions for that:
Elastisearch in conjunction with DynamoDb (which can be expensive), secondary indexes on the
DynamoDb table (understand that each secondary index is creating a
full copy of your DynamoDb table with the columns you choose to store
in the index),
Elasticache in conjunction with DynamoDb (for tying searches back to the ID
column),
RDS instead of DynamoDb ('cause a sql-ish db is better when
you don't know your data usage patterns and you just don't want to
think about it),
etc.
It really depends on how much data you have and how you'll query the data that should define your architecture. For me it would come down to weighing cost and performance of each of the options available.
In terms of getting the data into your DynamoDb or RDS table:
AWS Glue may be able to work for you
AWS Lambda to programmatically get the data into your data store(s)
perhaps others
I need a document based database where I can update single documents but when I get them I'll get all documents at once.
In my app I have no need to search and there will be up to 100k documents.
The db should be hosted on a dedicated machine or VM and have a web api.
What is the best choice?
This is probably not so smart to do. If you only update one doc, why would you want to retrieve all the other 99,999 documents when you know they didn't change?
Unless you are making a computation based on that, in which case you probably want the DB to do that for you and only fetch the result?
Ok so I'm working on an ASP MVC Web Application that queries a fairly large amount of data from an SQL Server 2008. When the application starts, the user is presented with a search mask that includes several fields. Using the search mask, the user can search for data in the data base and also filter the search by specifying parameters in the mask. In order to speed up searching I'm storing the result set returned by the data base query in the server session. During subsequent searches I can then search the data I have in the session thus avoiding unecessary trips to the DB.
Since the amount of data that can be returned by a data base query can be quite large, the scalability of the web application is severily limited. If there are, let's say, 100 users using the application at the same time, the server will keep search results in its session for each separate user. This will eventually eat up quite a bit of memory. My question now is, what's the best alternative to storing the data in session? The query in the DB can take quite a while at times so, at the moment, I would like to avoid having to run the query on subsequent searches if the data I already retrieved earlier contains the data that is now being searched for. Options I've considered are creating a temp table in the DB in my search query that stores the retrieved data and which can be used for subsequent searches. The problem with that is, I don't have all too much experience with SQL Server so I don't know if the SQL Server would create temp tables for each user if there are multiple users performing the search. Are there any other possibilities? Could the idea with the temp table in the SQL Server work or would it only lead to memory issues on the SQL Server? Thanks for the help! :)
Edit: Thanks a lot for the helpful and insightful answers, guys! However, I failed to mention a detail that's kind of important. When I query the database, the format of the result set can vary from user to user. This is because the user can decide which columns the result table can have by selecting columns from a predefined multiselect box in the search mask. If user A wants ColA, ColB and ColC to be displayed in his result table, he selects those values from the multiselect box in the search mask. User B, however, might select ColA and ColC only. Therefore, caching the results in a single table for all users might be a bit tricky since the table columsn are not necessarily going to be the same for all users. Therefore, I'm thinking, I'll almost have to use an alternative that saves each user's cached table separately. The HTML5 Local Storage alternative option mentioned below sounds interesting. Since this is an intranet application, it might be fair to assume (or require) that users have an up to date browser that supports HTML5. What do you guys think? Again, thanks for the help :)
If you want to cache query results, they'll have to be either on the web server or client in some form or another. All options will require memory, and since search results are user-specific, that memory usage will increase as a linear function of the number of current users.
My suggestions are to limit the number of rows returned from SQL (with TOP) and/or to look into optimizing your query on the SQL end. If your DB query takes a noticeable amount of time there's a good chance it can be optimized in SQL.
Have you already tought about the NoSql databases?
The idea of a NoSql database is to store information that is optimized for reading or writing and is accessed with 'easy queries' (for example a look-up on search terms). They scale easily horizontally and would allow you to search trough a whole lot of data very fast (Think of Google's BigData for example!)
if HTML5 is a possibility, you could use Local Storage.
You could try turning on sql session state.
http://support.microsoft.com/kb/317604
Plus: Effortless, you will find out if this fixes the memory pressure and has acceptable performance (i.e. reading and writing the cache to the DB). If the perf is okay, then you may want to implement the sort of thing that sql session does yourself because there is a ...
Down side: If you aren't aggressive about removing it from session, it will be serialized and deserialized on each request on unrelated pages.
where do I find a howto to set up elasticSearch using Postgres?
My field sizes will be about 350mb, yes, MB, each in size. I have a
text output of all of the US Code and all decisions from all the courts,
the Statutes at Large, pretty much everything you would find in a library,
and I need to be able to do full text searches and return the exact point
in the field to the app to return the exact page in PDF form. Postgres
can easily handle the datastore, but I've never used elasticSearch and
have no idea of how it integrates into the indexing, etc.
As of 2015, there's ZomboDB (https://github.com/zombodb/zombodb). As the author, I'm a bit biased, but it's quite powerful. ;)
It's a Postgres extension and Elasticsearch plugin that allows you to "CREATE INDEX"s that use a remote Elasticsearch cluster, and it exposes a fairly powerful query language for performing full-text searches.
Because it's an actual index in Postgres, the ES cluster is automatically synchronized as you INSERT/UPDATE/DELETE records. As such, there's no need for asynchronous synchronization processes.
Additionally, because it's an actual index, it is transaction-safe, which means concurrent Postgres sessions will only see results that are consistent with their current transaction.
Here's a link to ZomboDB's tutorial. It should give you an idea of how easy ZomboDB is to use.
There is an application that you can use to import SQL Server, Oracle, Postgresql MySQL, etc. in to an ElasticSearch index.
http://code.google.com/p/ogr2elasticsearch/
Please let me know if you have any trouble building or using it. ~Adam
You can explore using pgsync.
PGSync is an open-source middleware (written in python) for syncing data from Postgres to Elasticsearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch.
Githib link: https://github.com/toluaina/pgsync
Its possible to insert/update/delete postgres data in elasticsearch without middle ware other than the pgsql_http extension. Using triggers you can get a pretty much real-time index update.
You can also query elasticsearch and use the results within postgres to do joins etc with other tables/data in your database.
See the elasticsearch examples: https://github.com/sysadminmike/pgsql-http_examples
Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance