"No entities to label" has any cost in AWS Ground truth labeling? - amazon-sagemaker

I'm using amazon SageMaker Ground Truth to label texts, during the process I noticed that there is the option "No entities to label" and, I was wondering: if I select this option does the object still incur a cost to process?

Related

Should I use a NoSQL database for a transactional app?

So I am building a points system app for my high school (50% because I thought it was a great idea 50% because I want to educate myself about databases).
Essentially, my cafeteria uses a coupon system (physical paper) to provide a currency to buy food. Customers give cash in exchange for these coupons. I want to create a system and an app in which the customer gives cash, and the staff could log that 'customer x has y points in the system, therefore they can buy food worth y amount of points'. For this, you obviously need a database which is very durable, and preferably ACID compliant, as transactions such as these are very important and data loss could be devastating in these situations.
Except, for the data model I have, I don't have any relational data. If I opted for NoSQL, I could create a JSON doc like this for every student:
{"id": 0000000, "name": "DarthKitten", "balance": 30, "account_frozen": 0}
Whenever a customer deposits or withdraws points, it's a very simple write function to 'Balance'. There could be a single JSON doc which would contain the food names and prices (as NoSQL is schemaless) which the cafeteria could modify to their desires using the app. The app would pull this data and subtract it from the balance. I am also planning to add a Redis Cache over whichever database to increase checkout speed.
People here on Stack Overflow say you NEED to use an RDBMS for these situations as they are secure and ACID compliant, but when I look at a NoSQl database like CouchDB, which sports high levels of data durability, I see no point in using an RDBMS at all. Thoughts from more experienced devs? And no, I don't want comments about the viability of my school actually using this; whether they do or not I'm still learning something about programming. Thank you :).

Is Graph database really good for monitoring IT enterprise?

I have a simple question about graph database utility to monitor IT enterprise.
Many graph database vendors are suggesting that graph databases are optimal for monitoring IT resources. Such like Ref 1 or Ref 2
In order to monitor IT resources, it is necessary to manage log data from IT enterprise effectively.
However, there are many books for graph database modeling that leads us to model by the following time tree method.
I think that the time tree method will generate too much edges and will not be good for the performance of the graph database.
So... Is Graph database really good for monitoring IT enterprise ?
In my opinion, it seems that the questioner is confused with "the monitor the IT asset as an example of the excellent aspects to trace the complex relationship" and "the monitor log data for device history management".
For an IT enterprise that provides numerous services, if one logical or physical system fails, its impact can affect other assets associated with that asset.
(This is similar to the disease spreading in networks.)
The graph database is very appropriate for configuring the system to track it. (please find the following figure)
enter image description here
Next, let's talk about the writer's question, it is a little more fundamental, in order to identify the cause of a specific malfunction in IT asset management, analysis of log data of asset should also be done.
In this part, I understand how to model time series data on a graph database.
In conclusion, RDB is better than GDB in the area of log data analysis for history management. (To be specific, Time series database is the best.)
The reason is also for the same reason as the questioner mentioned.
(It is true that Graphdb is better than rdb in terms of edge travers, but paradoxically, modeling to generate too much edge to this blind faith that is rather exerts an adverse effect on performance. This is called Dense network problem.)
Therefore, it is recommended to implement the log data analysis function by constructing it in time series database and linking it by using DB Link, FDW, or using multimodel db of GDB + RDB such as AgensGraph, Oracle or OrientDB.

advice on appropriate database for click logging reporting

I am about to build a service that logs clicks and transactions from an e-commerce website. I expect to log millions of clicks every month.
I will use this to run reports to evaluate marketing efforts and site usage (similar to Google Analytics*). I need to be able to make queries, such as best selling product, most clicked category, average margin, etc.
*As some actions occur at later times and offline GA doesn´t fullfill all our needs.
The reporting system will not have a heady load and it will only be used internally.
My plan is to place loggable actions in a que and have a separate system store these to a database.
My question is what database I should use for this. Due to corporate IT-policy I do only have these options; SimpleDB (AWS), DynamoDB (AWS) or MS SQL/My SQL
Thanks in advance!
Best regards,
Fredrik
Have you checked this excelent Amazon documentation page ? http://aws.amazon.com/running_databases/ It helps to pick the best database from their products.
From my experience, I would advise that you do not use DynamoDB for this purpose. There is no real select equivalent and you will have hard time modeling your data. It is feasible, but not trivial.
On the other hand, SimpleDB provides a select operation that would considerably simplify the model. Nonetheless, it is advised against volumes > 10GB: http://aws.amazon.com/running_databases/
For the last resort option, RDS, I think you can do pretty much everything with it.

Why are relational databases having scalability issues?

Recenctly I read some articles online that indicates relational databases have scaling issues and not good to use when it comes to big data. Specially in cloud computing where the data is big. But I could not find good solid reasons to why it isn't scalable much, by googling. Can you please explain me the limitations of relational databases when it comes to scalability?
Thanks.
Imagine two different kinds of crossroads.
One has traffic lights or police officers regulating traffic, motion on the crossroad is at limited speed, and there's a watchdog registering precisely what car drove on the crossroad at what time precisely, and what direction it went.
The other has none of that and everyone who arrives at the crossroad at whatever speed he's driving, just dives in and wants to get through as quick as possible.
The former is any traditional database engine. The crossroad is the data itself. The cars are the transactions that want to access the data. The traffic lights or police officer is the DBMS. The watchdog keeps the logs and journals.
The latter is a NOACID type of engine.
Both have a saturation point, at which point arriving cars are forced to start queueing up at the entry points. Both have a maximal throughput. That threshold lies at a lower value for the former type of crossroad, and the reason should be obvious.
The advantage of the former type of crossroad should however also be obvious. Way less opportunity for accidents to happen. On the second type of crossroad, you can expect accidents not to happen only if traffic density is at a much much lower point than the theoretical maximal throughput of the crossroad. And in translation to data management engines, it translates to a guarantee of consistent and coherent results, which only the former type of crossroad (the classical database engine, whether relational or networked or hierarchical) can deliver.
The analogy can be stretched further. Imagine what happens if an accident DOES happen. On the second type of crossroad, the primary concern will probably be to clear the road as quick as possible, so traffic can resume, and when that is done, what info is still available to investigate who caused the accident and how ? Nothing at all. It won't be known. The crossroad is open just waiting for the next accident to happen. On the regulated crossroad, there's the police officer regulating the traffic who saw what happened and can testify. There's the logs saying which car entered at what time precisely, at which entry point precisely, at what speed precisely, a lot of material is available for inspection to determine the root cause of the accident. But of course none of that comes for free.
Colourful enough as an explanation ?
Relational databases provide solid, mature services according to the ACID properties. We get transaction-handling, efficient logging to enable recovery etc. These are core services of the relational databases, and the ones that they are good at. They are hard to customize, and might be considered as a bottleneck, especially if you don't need them in a given application (eg. serving website content with low importance; in this case for example, the widely used MySQL does not provide transaction handling with the default storage engine, and therefore does not satisfy ACID). Lots of "big data" problems don't require these strict constrains, for example web analytics, web search or processing moving object trajectories, as they already include uncertainty by nature.
When reaching the limits of a given computer (memory, CPU, disk: the data is too big, or data processing is too complex and costly), distributing the service is a good idea. Lots of relational and NoSQL databases offer distributed storage. In this case however, ACID turns out to be difficult to satisfy: the CAP theorem states somewhat similar, that availability, consistency and partition tolerance can not be achieved at the same time. If we give up ACID (satisfying BASE for example), scalability might be increased.
See this post eg. for categorization of storage methods according to CAP.
An other bottleneck might be the flexible and clever typed relational model itself with SQL operations: in lots of cases a simpler model with simpler operations would be sufficient and more efficient (like untyped key-value stores). The common row-wise physical storage model might also be limiting: for example it isn't optimal for data compression.
There are however fast and scalable ACID compliant relational databases, including new ones like VoltDB, as the technology of relational databases is mature, well-researched and widespread. We just have to select an appropriate solution for the given problem.
Take the simplest example: insert a row with generated ID. Since IDs must be unique within table, database must somehow lock some sort of persistent counter so that no other INSERT uses the same value. So you have two choices: either allow only one instance to write data or have distributed lock. Both solutions are a major bottle-beck - and is the simplest example!

Alternative to Apatar for scripted data migration

I'm looking for the fastest-to-success alternative solution for related data migration between Salesforce environments with some specific technical requirements. We were using Apatar, which worked fine in testing, but late in the game it has started throwing the dreaded socket "connection reset" errors and we have not been able to resolve it - it has some other issues that are leading me to ditch it.
I need to move a modest amount of data (about 10k rows total) between several sandboxes and ultimately to a production environment. The data is spread across eight custom objects. There is a four-level-deep master-detail relationship, which obviously must be preserved.
The target environment tables are 100% empty.
The trickiest object has a master-detail and two lookup fields.
Ideally, the data from one table near the top of the hierarchy should be filtered by a simple WHERE, and then children not under matching rows not migrated, but I'll settle for a solution that migrates all the data wholesale.
My fallback in this situation is going to be good old Data Loader, but it's not ideal because our schema is locked down and does not contain external ID fields, so scripting a solution that preserves all the M-D and lookups will take a while and be more error prone than I'd like.
It's been a long time since I've done a survey of the tools available, and don't have much time to do one now, so I'm appealing to the crowd. I need an app that will be simple (able to configure and test very quickly), repeatable, and rock-solid.
I've always pictured an SFDC data migration app that you can just check off eight checkboxes from a source environment, point it to a destination environment, and it just works, preserving all your relationships. That would be perfect. Free would be nice too. Does such a shiny thing exist?
Sesame Relational Junction seems to best match what you're looking for. I haven't used it, though; so, I can't comment on its effectiveness for what you're attempting.
The other route you may want to look into is using the Bulk API or using the Data Loader CLI with Task Scheduling.
You may find this information (below), from an answer to a different question, helpful.
Here is a list of integration services (other than Apatar):
Informatica Cloud
Cast Iron
SnapLogic
Boomi
JitterBit
Sesame Relational Junction
Information on other tools, to integrate Salesforce with other databases, is available here:
Salesforce Web Services API
Salesforce Bulk API
Relational Junction has a unique feature set that supports cloning, splitting, and merging of Salesforce orgs, and will keep the relationships intact in a one-pass load. It works like this:
Download source org to an empty database schema (any relationship DBMS)
Download target org to a second empty database schema
Run some scripts to condition the data; this varies by object. Sesame provides guidance and sample scripts, but essentially you have to set a control field to tell Relational Junction to create or update Salesforce. This is also where you may need to replace source ID's with target ID's if some objects have been pre-populated during sandbox creation
Replicate the second database to the target org
Relational Junction handles the socket disconnects, timeouts, and whatever havoc happens during the unload/reload process gracefully and without creating duplicates.
This process was developed for a proof of concept at a large Silicon Valley network vendor in 2007, who became a customer. The entire down and up of 15 GB of data took 46 hours, plus about 2 days of preparation.

Resources