Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Using the api to analyze a twitter stream I am getting very similar results for openness for pretty much everybody. How can I train a corpus to generate a different output
Unfortunately, you can't. Also, I am afraid twitter is not the best source for this kind of analysis since each tweet has just a little piece of text. Watson Personality Insights works better with large text samples, and most probably, twitter sentences are too short to provide enough information for this kind of analysis (even if you concatenate several tweets in the same text sample).
But, if you're getting meaningful results for the other dimensions, what I'd suggest you to do is to ignore the openness information and try to calculate it using another algorithm (your own?) or even checking if just removing this dimension does not provide good enough results for you.
There are some nice tips here -- https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/personality-insights/science.shtml and some references to papers that can help you understand the algorithm internals.
You cannot train Watson Personality Insights at the current version. But there may be alternatives.
From your message it is not clear to me if you are receiving too similar results for individual tweets or entire twitter streams. In the first case, as Leo pointed out in a different answer, please note that you should aim to provide enough information for any analysis to be meaningful (this is 3,000+ words, not just a tweet). In the second case, I would be a bit surprised if your scores are still so similar with so much text (how many tweets per user?), but this may still happen depending on the domain.
If you are analyzing individual tweets you may also benefit from user Tone Analyzer (in Beta as of today). Its "social tone" is basically the same model as Personality Insights, and gives some raw scores even for small texts. (And by the way you get other measures such as emotions and writing style).
And in any case (small or large inputs), we encourage users to take a look at the raw scores in their own data corpus. For example, say you are analyzing a set of IT support calls (I am making this up), you will likely find some traits tend to be all the same because the jargon and writing style is similar in all of them. However, within your domain there may be small differences you may want to focus, ie. there is still a 90% percentile, a lowest 10% in each trait... So you might want to do some data analysis on Personality Insights raw_score (api reference) or just the score in Tone Analyzer (api reference) and draw your own conclusions.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I've quite understood what the Raft is and implemented it in MIT6.824 distributed system. I also know what's the basic Paxos, I've not implemented this yet, so I can't grab all details of it. For Multi-Paoxs, I'm even more confused, i.e., WHY it can eliminate lots of Prepare RPC? I know the answer should be Multi-Paxos can have a fixed leader along with noMoreAccepted response from other peers to determine if reduce Prepare RPC. But, I can't get it in detailed level, why and how it works
I want to get more some recommendations, articles, sample code or anything that can help for Multi-Paxos,
I've read Paxos made live and Paxos made simple, those two papers can give me a basic idea about what's Paxos and how it works
I've watched https://www.youtube.com/watch?v=YbZ3zDzDnrw&t=3035s&ab_channel=DiegoOngaro several times as well, it's a great talk, but it does not involve too many details
To answer your specific question:
WHY it can eliminate lots of Prepare RPC?
In the paper Paxos Made Simple page 10 it says:
A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139.
That is saying that if a leader broadcasts Prepare(135,n) which is a prepare for instance 135 using ballot number n then it is valid that this can be defined as applying to all instances >=135 that are not yet fixed. We can reason that it is safe for any node to be "spamming" out prepare messages for an infinite number of the unfixed positions in our log stream. This is because for each position each acceptor uses the rules of Paxos for that position. We can compress that infinite set of prepare messages down to a single one that applies to all higher unfixed positions. We then eliminate all but one prepare message for the term of a stable leader. So it is fantastic optimisation.
You asked about any example code. I wrote an implementation of multi-paxos using functional programming in Scala that aims to be true to the paper Paxos Made Simple over at https://github.com/trex-paxos/trex. The core state is PaxosData, the message protocol is at the bottom of PaxosProtcol and the algorithm is a set of message matching functions in PaxosAlgorithm. The algorithm takes the current immutable state and an immutable message as input and outputs the next immutable state for the node. Common behaviours are written as partial functions that have full unit tests. These partial functions are composed into complete functions used by leaders, followers and candidate leaders. There is a write up at this blog.
It adds additional messages to the basic set as optimisations speed up log replication. Those involve some implementation details that Lamport does not get into in his paper. An example is that negative acknowledgements are used to pass information between nodes to try to avoid interrupting a stable leader due to only one failed network link between a node and the leader. TRex tries to keep those features to a minimum to create a basic but complete solution.
An answer that you might find helpful about Multi-Paxos is this one that discusses why Multi-Paxos is called that https://stackoverflow.com/a/26619261/329496
There is also this one about how the original Part-Time Parliament paper uses a leader and also describes a stable leader running multi-Paxos https://stackoverflow.com/a/46012211/329496
Finally, you might enjoy my defence of Paxos post The Trial Of Paxos Algorithm.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am preparing for the system design interview, and since I have little experience with this topic, I bought the "Grokking the system design interview" course from educative.io, which was recommended by several websites.
However I read it, I think I did not manage to understand several things, so if someone could answer my questions, that would be helpful.
Since I have no experience with nosql, I find it difficult to chose the proper db system. Several times the course just do not give any reasoning why it chose one db over another one. For example in chapter "Designing Youtube or Netflix" the editors chose mysql for db role with no explanation. In the same chapter we have the following non-functional requirements:
"The system should be highly available. Consistency can take a hit (in
the interest of availability); if a user doesn’t see a video for a
while, it should be fine."
Following the above hint and taking into account the size of the system and applying the material in the "CAP theorem" chapter for me it seems or Cassandra and CouchDB would be a better choise. What do I miss here?
Same question goes for "Designing Facebook’s Newsfeed"
Is CAP theorem still applicable?
What I mean is: HBase is according to the chapter "CAP theorem" good at consistency and partition tolerance, but according to the HBase documentation, it also supports High Availibility since version 2.X. So it seems to me that it is a one fits all / universal solution for db storage which goes against CAP theorem, unless they sacrificed something for HA. What do I miss here?
The numbers are kind of inconsistent around the course about how much RAM/storage/bandwidth can a computer handle, I guess they are outdated. What are the current numbers for a, regular computers, b, modern servers?
Almost every chapter has a part called "Capacity Estimation and Constraints", but what is calculated here changes from chapter to chapter. Sometimes only storage is calculated, often bandwidth too, sometimes QPS is added, sometimes there are task specific metrics. How do I know what should I calculate for a specific task?
Thanks in advance!
Each database is different and fulfills different requirements. I recommend you read dynamo-paper, and familiarize yourself with the rest of the terminology used in it (two-phase locking, leader/follower, multi-leader, async/sync replication, quorums), and know what guarantees the different databases provide. Now to the questions:
MySQL can be configured to prioritize Availability at the cost of Consistency with its asynchronous replication model (the leader doesn't wait for acknowledgement from its followers before committing a write; if a leader crashes before the data gets propagated to the followers, the data is lost), so it can be one of the suitable solutions here.
From the documentation of HBase, HBase guarantees strong consistency, even at the cost of availability.
The promise of high availability is for reads, not for writes i.e. for reading stale data while the rest of the system recovers from failure and can accept additional writes.
because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time.
Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable).
The numbers used are estimates by the candidate i.e. you decide what are the specs of a single hypothetical server, and how many servers you would need in order to scale and accommodate the storage/throughput requirement.
You don't know in advance (although you can make a guess based on the requirements e.g. if it's a data storage system, a streaming service etc., I still wouldn't recommend it). Instead, you should ask the interviewer what area they are interested in, and you make estimates for it. The interview, especially the system design part, is a discussion, don't follow a template to the letter. You recognize the different areas you can tackle about the system, and approach them based on the interviewer's interest.
I want to improve ibm’s Watson assistant results.
So, I want to know the algorithm to determine a dialog in Watson assistant’s conversations.
Is it a svm algorithm?
A paper is welcome.
There are a number of ML/NLP technologies under the covers of Watson Assistant. So it's not just a single algorithm. Knowing them is not going to help you improve your results.
I want to improve ibm’s Watson assistant results.
There are a number of ways.
Representative questions.
Focus on getting true representative questions from the end users. Not only in the language that they use, but if possible from the same medium you plan to use WA on (eg. Mobile device, Web, Audio).
This is the first factor that reduces accuracy. Manufacturing an intent can mean you build an intent that a customer may never ask (even if you think they do). Second you will use language/terms with similar patterns. This makes it harder for WA to train.
Total training questions
It's possible to train an intent with one question, but for best results 10-20 example questions. Where intents are close together then more examples are needed.
Testing
The current process is to create what is called a K-Fold Cross validation (sample script). If your questions are representative then the results should give you an accurate indicator of how well it is performing.
However, it is possible to overfit the training. So you should use a blind set. This is a 10-20% of all questions (Random sample). They should never be used to train WA. Then run them against the system. Both your Blind + K-Fold should fall within 5% of each other.
You can look at the results of the K-Fold to fix issues, but blind set you should not. Blinds can go stale as well. So try to create a new blind set after 2-3 training cycles.
End user testing.
No matter how well your system is trained, I can guarantee you that new things will pop up when put in front of end users. So you should plan to have users test before you put it into production.
When getting users to test, ensure they understand the general areas it has been trained on. You can do this with user stories, but try not to prime the user into asking a narrow scoped question.
Example:
"Your phone is not working and you need to get it fixed" - Good. They will ask questions you will never have seen before.
"The wifi on your phone is not working. Ask how you would fix it". - Bad. Very narrow scope and people will mention "wifi" even if they don't know what it means.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.