As part of use case, We generate some invoice documents that are transported in ships and cargo. Each Document contains details of the container and their contents. As part of this Ships and Cargo, We need to store these invoice documents for 15 years and retrieve them back.
Here are the details -
Container Name | Origin Port | Destination Port -> Invoice Name
We need to able to retrieve the Invoice name using container name, origin port or destination port or combination of columns (Similar to SQL).
Each invoice will be at-least 40 to 70 MB.
Any suggestions on building this. We use AWS as cloud. I just need some pointers which can help me get Started.
One approach is to use RedShift + Athena backed by Spark Jobs.
Invoices of that size likely include images and don't fit well in database but your retrieval needs sound very much like a database. I've don't work with clients in the past with similar needs and utilized Redshift for the analytics on the relational data and S3 for storage of large non-relational data (images). The data table in Redshift can have a json (super type) column that contains pointers to any S3 objects and descriptors for what the S3 object is. Any number (up to the max size of the json which is large) of S3 objects can be referenced by a single row in the data table.
Related
I'm trying to migrate a SQL Server table using AWS DMS to a DynamoDb target.
The table structure is as follows:
|SourceTableID|Title |Status|Display|LongDescription|
|-------------|-----------|------|-------|---------------|
|VARCHAR(100) |VARCHAR(50)|INT |BIT |NVARCHAR(MAX) |
Every field is being migrated without errors and is present in my target DynamoDb table except for the LongDescription column. This is because it is a NVARCHAR(MAX) column.
According to the documentation:
The following limitations apply when using DynamoDB as a target:
AWS DMS doesn't support LOB data unless it is a CLOB. AWS DMS converts
CLOB data into a DynamoDB string when migrating data.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.DynamoDB.html
Source Data Types for SQL Server
|SQL Server Data Types|AWS DMS Data Types|
|----------------------------------------|
|NVARCHAR (max) |NCLOB, TEXT |
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
Depending on my task configuration the following two scenarios occur:
Limited LOB mode: Information for the LongDescription column is being migrated properly to DynamoDb, however the text is truncated
Full LOB mode: Information for the LongDescription column is NOT migrated properly to DynamoDb
How can I correctly migrate an NVARCHAR(MAX) column to DynamoDb without losing any data?
Thanks!
Progress Report
I have already tried migrating to an S3 target. However it looks like S3 doesnt support Full LOB
Limitations to Using Amazon S3 as a Target
Full LOB mode is not supported.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
I cannot use the compress T-SQL command in order to store the LongDescription column as a binary, since my SQLServer version is 2014
I tried to run the migration task to Limited LOB mode and use the maximum byte size as the limit. My maximum byte size is 45155996 so I set 46000KB as the limit. This results in an error as follows:
Failed to put item into table 'TestMigration_4' with data record with source PK column 'SourceTableID' and value '123456'
You might want to check this AWS' best practices page for storing large items/attributes in DynamoDB:
If your application needs to store more data in an item than the DynamoDB size limit permits, you can try compressing one or more large attributes or breaking the item into multiple items (efficiently indexed by sort keys). You can also store the item as an object in Amazon Simple Storage Service (Amazon S3) and store the Amazon S3 object identifier in your DynamoDB item.
I actually like the idea of saving your LongDescription in S3 and referencing its identifier in DynamoDB. I never tried, but an idea would be to use their DMS ability to create multiple migration tasks to perform this, or even create some kind of ETL solution as last resort, making use of DMS' CDC capability. You might want to get in touch with their support team to make sure it works.
Hope it helps!
I have a blob (storage account) that is housed on Azure. I also have a sql server table that is housed on Azure. I have a couple of questions
Is it possible to create a join between the blob and the table
Is it possible to store all of the information in the blob?
The table has address information on it and I wanted to be able to pull that information from the table and associate it or join it to the proper image by the ID in the sql table (if that is the best way)
Is it possible to create a join between the blob and the table?
No.
Is it possible to store all of the information in the blob?
You possibly could (by storing the address information as blob metadata) but it is not recommended because then you would lose searching capability. Blob storage is simply an object store. You won't be able to query on address information.
The table has address information on it and I wanted to be able to
pull that information from the table and associate it or join it to
the proper image by the ID in the sql table (if that is the best way)
Recommended way of doing this is storing the images in blob storage. Each blob in blob storage gets a unique URL (https://account.blob.core.windows.net/container/blob.png) that you can store in your database along with other address fields (e.g. create a column called ImageUrl and store the URL there).
Azure Storage (blob, in your case) and SQL Server are completely separate, independent data stores. You cannot do joins, transactions, or really any type of query, across both at the same time.
What you store in each is totally up to you. Typically, people store searchable/indexable metadata within a database engine (such as SQL Server in your case), and non-searchable (binary etc) content in bulk storage (such as blobs).
As far as "best way"? Not sure what you're looking for, but there is no best way. Like I said, some people will store anything searchable in their database. On top of this, they'd store a url to specific blobs that are related to that metadata. There's no specific rule about doing it this way, of course. Whatever works for you and your app...
Note: Blobs have metadata as well, but that metadata is not indexable; it would require searching through all blobs (or all blobs in a container) to perform specific searches.
I need to build a chat application. I also need to store all chat transcripts into some storage - SQL database, Table storage or other Azure mechanism.
It should store about 50 Mega of characters per day. Each bulk of text should be attach to a specific customer.
My question:
What is the best way to store such amount of text in Azure?
Thanks.
I would store them in Azure Tables, using conversationId as the partition key, and messageID as the rowKey. That way, you can easilly aggregate your statistics based on those two, and quickly retrieve conversations.
[Background]
Now I am creating WCF for keeping and getting articles of our university.
I need to save files and metadata of these files.
My WCF need to be used by 1000 person a day.
The storage will contains about 60000 aticles.
I have three different ways to do it.
I can save metadata(file name, file type) in sql server to create unique id) and save files into Azure BLOB storage.
I can save metadata and data into sql server.
I can save metadata and data into Azure BLOB storage.
What way do chose and why ?
If you suggest your own solution, it will be wondefull.
P.S. Both of them use Azure.
I would recommend going with option 1 - save metadata in database but save files in blob storage. Here're my reasons:
Blob storage is meant for this purpose only. As of today an account can hold 500TB of data and size of each blob can be of 200 GB. So space is not a limitation.
Compared to SQL Server, it is extremely cheap to store in blob storage.
The reason I am recommending storing metadata in database is because blob storage is a simple object store without any querying capabilities. So if you want to search for files, you can query your database to find the files and then return the file URLs to your users.
However please keep in mind that because these (database server and blob storage) are two distinct data stores, you won't be able to achieve transactional consistency. When creating files, I would recommend uploading files in blob storage first and then create a record in the database. Likewise when deleting files, I would recommend deleting the record from the database first and then removing blob. If you're concerned about having orphaned blobs (i.e. blobs without a matching record in the database), I would recommend running a background task which finds the orphaned blobs and delete them.
Currently putting together a base POC of architecture for an application that will be storing large amounts of records and each record will have one field that contains a few thousand characters.
e.g.
TableID int
Field1 nvarchar(50)
Field2 nvarchar(50)
Field2 nvarchar(MAX)
This is all being hosted in Azure. We have one webjob that does the work to obtain the data and populate it into the data store and then another webjob comes through periodically and processes the data.
Currently the data is just stored in an Azure SQL Database. I'm just worried once the record count turns into the many millions it's going to be incredibly inefficient to store/process/retrieve the data this way.
Advice required on the best way to store this in Azure. Wanted to start trying the fact that we keep the rows in Azure SQL but the large field's data is pushed into another repository (e.g. Data lake, DocumentDB) and has a reference back to the SQL record therefore the SQL calls are still lean and big data is stored somewhere else. Is this a clean manor of doing it or am I totally missing something?
Azure Table Storage can help with this solution - it is a NoSQL KeyValue store. Each entity can be up to 1MB in size. You could also use individual blobs as well. There is a design guide that includes a full description of how to design Table Storage solutions for scale - including patterns for using Table Storage along with other repositories see Table Design Guide
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/