How to populate DynamoDB tables - database

firstly I'm very new to DynamoDB, and AWS services in general - so I'm finding it hard when bombarded with all the details.
My problem is that I have an excel file with my data in CSV format, and I'm looking to add said data to a DynamoDB table, for easy access for the Alexa function I'm looking to build. The format of the table is as follows:
ID, Name, Email, Number, Room
1534234, Dr Neesh Patel, Patel.Neesh#work.com, +44 (0)3424 111111, HW101
Some of the rows have empty fields.
But everywhere I look online, there doesn't appear to be an easy way to actually achieve this - and I can't find any official means either. So with my limited knowledge of this area - I am questioning whether I'm going about this all the entirely wrong way. So firstly, am I thinking about this wrong? Should I be looking at a completely different solution for a backend database? I would have thought this would be a common task but with the lack of support or easy solutions - am I wrong?
Secondly, if I'm going about this all fine - how can it be done? I understand that the DynamoDB requires a specific JSON format - and again there doesn't appear to be a straightforward way to convert my CSV into said format.
Thanks, guys.

I had the same problem when I start using DynamoDB. When you come to distributed, big data system you really need to architect how to move data across the systems. This is where you start with it.
Clearly documented here,
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
Adding more details to understand the process.
Step 1: Convert your csv to json file.
If you have small amount of data, you can use online tools.
http://www.convertcsv.com/csv-to-json.htm
{
"ID": 1534234,
"Name": "Dr Neesh Patel",
"Email": "Patel.Neesh#work.com",
"Number": "+44 (0)3424 111111",
"Room": "HW101"
}
You can see how nicely it formatted remove spaces, etc., Choose the right options and perform your conversion.
If your data is huge, then you need to use big data tools to parallely process those data to convert them.
Step 2: Upload using CLI for small and one time upload
aws dynamodb batch-write-item --request-items file://data.json
If you want to regularly upload the file, you need to create a data pipeline or a different process.
Hope it helps.

DynamoDb is cool. However, before you use it you have to know your data usage patterns. For your case, if you're only every going to query the DynamoDb table by ID then it is great. If you need to query by any one or combination of columns then well there are solutions for that:
Elastisearch in conjunction with DynamoDb (which can be expensive), secondary indexes on the
DynamoDb table (understand that each secondary index is creating a
full copy of your DynamoDb table with the columns you choose to store
in the index),
Elasticache in conjunction with DynamoDb (for tying searches back to the ID
column),
RDS instead of DynamoDb ('cause a sql-ish db is better when
you don't know your data usage patterns and you just don't want to
think about it),
etc.
It really depends on how much data you have and how you'll query the data that should define your architecture. For me it would come down to weighing cost and performance of each of the options available.
In terms of getting the data into your DynamoDb or RDS table:
AWS Glue may be able to work for you
AWS Lambda to programmatically get the data into your data store(s)
perhaps others

Related

AWS DynamoDB and Storing User Data with Transactional Data in one table

I am looking at ways to store user data alongside transactional data like orders and invoices. Normally, I would use a relational database like postgresql, but I wanted to know if it would be a good idea to store the user data along with their transactional data in one noSQL table like DynamoDB?
I would assume if you did that you would structure your data to either use objects or arrays to store the orders or invoices but I'm not sure if that is the best was to go about it.
EDIT
So after doing some more research and trying understand how to fit everything into a single table design I found this article in the AWS documentation. I decided to organise my data into collections using a combinaton of the primary key and the sort key. The sort key is used to determine collections (i.e., orders, customer-data, etc). This solution is perfect for my use case because I can keep all the user data (including transactions like orders) in one dynamodb table.
In short, don't do that. DynamoDB is a great tool, but you need to understand it first. It's not just a no-sql, it's also a distributed one. It gives great performance, scalability and pricing. But modeling is trickier. You can not build requests as you please, those has to be taken into consideration when you design your model. Read about queries vs scans and global vs local indexes. When you get that you might try reading about Single Table Design. It should give you an idea about the limitations of the DynamoDB.

How to integrate Elasticsearch in my search

We have an ad search website and all the searches are being done through entity framework directly querying the sql server database.
It was working very well when the database had around 1000 ads, but now it is reaching 300k and lots of users searching. The searches now are very slow (using raw sql didn't help much) and I was instructed to consider Elasticsearch.
I've been some tutorials and I get the idea of how it works now, but what I don't know is:
Should I stop using sql server to store the ads and start using Elasticsearch instead? What about all the other related data? Is Elasticsearch an alternative to sql server?
Each Ad has some related data stored in different tables, how would I load it to Elasticsearch? As a single json element?
I read a lot of "billions of data" handled by Elasticsearch, so I don't think I would have performance problems with 300k rows in it, correct?
Would anybody explain me better these questions?
1- You could still use it; you don't want to search over the complete database, rigth? Just over the ads. It works with a no-sql format, so it is very scalable. It also works with json's so you have an easy form to access it.
2- When indexing data, you should try to add the complete necessary data in the same document(sql row), which is a single json, but in a limited way. Storage is cheap, but computing time isn't.
To index your data, you could either use filebeat, a program a bit similar to logstash, or create your own solution like, making a program that reads data from your db, and then passes it to elasticsearch in bulks.
3- Correct, 300k rows is a small quantity, but it also depends on the memory from where you are hosting elasticsearch.
Hope this helps.

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Designing a generic unstructured data store

The project I have been given is to store and retrieve unstructured data from a third-party. This could be HR information – User, Pictures, CV, Voice mail etc or factory related stuff – Work items, parts lists, time sheets etc. Basically almost any type of data.
Some of these items may be linked so a User many have a picture for example. I don’t need to examine the content of the data as my storage solution will receive the data as XML and send it out as XML. It’s down to the recipient to convert the XML back into a picture or sound file etc. The recipient may request all Users so I need to be able to find User records and their related “child” items such as pictures etc, or the recipient may just want pictures etc.
My database is MS SQL and I have to stick with that. My question is, are there any patterns or existing solutions for handling unstructured data in this way.
I’ve done a bit of Googling and have found some sites that talk about this kind of problem but they are more interested in drilling into the data to allow searches on their content. I don’t need to know the content just what type it is (picture, User, Job Sheet etc).
To those who have given their comments:
The problem I face is the linking of objects together. A User object may be added to the data store then at a later date the users picture may be added. When the User is requested I will need to return the both the User object and it associated Picture. The user may update their picture so you can see I need to keep relationships between objects. That is what I was trying to get across in the second paragraph. The problem I have is that my solution must be very generic as I should be able to store anything and link these objects by the end users requirements. EG: User, Pictures and emails or Work items, Parts list etc. I see that Microsoft has developed ZEntity which looks like it may be useful but I don’t need to drill into the data contents so it’s probably over kill for what I need.
I have been using Microsoft Zentity since version 1, and whilst it is excellent a storing huge amounts of structured data and allowing (relatively) simple access to the data, if your data structure is likely to change then recreating the 'data model' (and the regression testing) would probably remove the benefits of using such a system.
Another point worth noting is that Zentity requires filestream storage so you would need to have the correct version of SQL Server installed (2008 I think) and filestream storage enabled.
Since you deal with XML, it's not an unstructured data. Microsoft SQL Server 2005 or later has XML column type that you can use.
Now, if you don't need to access XML nodes and you think you will never need to, go with the plain varbinary(max). For your information, storing XML content in an XML-type column let you not only to retrieve XML nodes directly through database queries, but also validate XML data against schemas, which may be useful to ensure that the content you store is valid.
Don't forget to use FILESTREAMs (SQL Server 2008 or later), if your XML data grows in size (2MB+). This is probably your case, since voice-mail or pictures can easily be larger than 2 MB, especially when they are Base64-encoded inside an XML file.
Since your data is quite freeform and changable, your best bet is to put it on a plain old file system not a relational database. By all means store some meta-information in SQL where it makes sense to search through structed data relationships but if your main data content is not structured with data relationships then you're doing yourself a disservice using an SQL database.
The filesystem is blindingly fast to lookup files and stream them, especially if this is an intranet application. All you need to do is share a folder and apply sensible file permissions and a large chunk of unnecessary development disappears. If you need to deliver this over the web, consider using WebDAV with IIS.
A reasonably clever file and directory naming convension with a small piece of software you write to help people get to the right path will hands down, always beat any SQL database for both access speed and sequential data streaming. Filesystem paths and file names will always beat any clever SQL index for data location speed. And plain old files are the ultimate unstructured, flexible data store.
Use SQL for what it's good for. Use files for what they are good for. Best tools for the job and all that...
You don't really need any pattern for this implementation. Store all your data in a BLOB entry. Read from it when required and then send it out again.
Yo would probably need to investigate other infrastructure aspects like periodically cleaning up the db to remove expired entries.
Maybe i'm not understanding the problem clearly.
So am I right if I say that all you need to store is a blob of xml with whatever binary information contained within? Why can't you have a users table and then a linked(foreign key) table with userobjects in, linked by userId?

Data Correlation in large Databases

We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)
On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.
Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.
Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.

Resources