Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have seen a lot of keyword research/analysis applications, such as Market Samurai: Keyword Analysis Tool, and SEMRush keyword tool.
My question is how can they get stats about those keywords ? are they using google api to achieve that ?
I fail to see how can a software not connected to google search database get information about monthly searches, competition ...etc.
Thanks.
For Search Volume, Paid Competition, CPC data, most of these tools get it in one of three ways.
They can get it directly from Google via the AdWords API (requires "Standard Access" and must meet RMF requirements).
Another way is to get it from a third-party who can pull data from Google updated monthly such as GrepWords.
Using their own models with various data sources from third parties, mixed possibly Google's statistics and other click stream data, and applying machine learning algorithms to make predictions that can even rival Google's own data.
For Keyword Difficulty (KD) or Organic Competition scores, all tools provide an estimate of how difficult it might be to rank high organically for a specific keyword. Tools will typically use a combination of techniques. Below is a short list of what they may include:
Search Engine Result Pages (SERP) analysis
Each keywords' SERP density analysis
Analysis of competitors for each keyword
Word difficulty and frequency
Backlink and domain authority analysis of competitors
and many other indicators
A few tools and where they get their Search Volume and CPC data:
SEMRush uses an algorithm to estimate their traffic (source: spoke with them at a conference in 2016).
ahrefs uses a third party to get click stream data and pairs it with data from Google
MOZ uses a third party to get their Google data and click stream data (source: spoke with team).
KWFinder reports that their data is the same as Google Keyword Planner.
Twinword Ideas actually gets their data directly from Google (source: I work there).
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Using the api to analyze a twitter stream I am getting very similar results for openness for pretty much everybody. How can I train a corpus to generate a different output
Unfortunately, you can't. Also, I am afraid twitter is not the best source for this kind of analysis since each tweet has just a little piece of text. Watson Personality Insights works better with large text samples, and most probably, twitter sentences are too short to provide enough information for this kind of analysis (even if you concatenate several tweets in the same text sample).
But, if you're getting meaningful results for the other dimensions, what I'd suggest you to do is to ignore the openness information and try to calculate it using another algorithm (your own?) or even checking if just removing this dimension does not provide good enough results for you.
There are some nice tips here -- https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/personality-insights/science.shtml and some references to papers that can help you understand the algorithm internals.
You cannot train Watson Personality Insights at the current version. But there may be alternatives.
From your message it is not clear to me if you are receiving too similar results for individual tweets or entire twitter streams. In the first case, as Leo pointed out in a different answer, please note that you should aim to provide enough information for any analysis to be meaningful (this is 3,000+ words, not just a tweet). In the second case, I would be a bit surprised if your scores are still so similar with so much text (how many tweets per user?), but this may still happen depending on the domain.
If you are analyzing individual tweets you may also benefit from user Tone Analyzer (in Beta as of today). Its "social tone" is basically the same model as Personality Insights, and gives some raw scores even for small texts. (And by the way you get other measures such as emotions and writing style).
And in any case (small or large inputs), we encourage users to take a look at the raw scores in their own data corpus. For example, say you are analyzing a set of IT support calls (I am making this up), you will likely find some traits tend to be all the same because the jargon and writing style is similar in all of them. However, within your domain there may be small differences you may want to focus, ie. there is still a 90% percentile, a lowest 10% in each trait... So you might want to do some data analysis on Personality Insights raw_score (api reference) or just the score in Tone Analyzer (api reference) and draw your own conclusions.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I want to see whether Microsoft provide a similar service to Google BigQuery.
I want to run some queries on a database with the size of ~15GB and I want the service to be on the cloud.
P.S: Yes. I have google already but did not find anything similar.
The answer to your question is NO: Microsoft does not offer (yet) a real time big data query service where you pay as you perform queries. Which does not means you won't get a solution to your problem in Azure.
Depending on your need you may have two options on Azure:
SQL Data Warehouse: A new Azure based columnar database service in preview http://azure.microsoft.com/fr-fr/documentation/services/sql-data-warehouse/ which according to Microsoft can scale up to petabytes. Assuming that your data is structured (relational) and that you need sub second response time it should do the job you expect.
HDInsight is hadoop managed service https://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ which can deal better with semi structured data but is more oriented to batch processing. It contains Hive which is also SQL like but you won't get instant query response-time. You could go for this option if you are expecting to do calculations on a batch mode and store the aggregated result set somewhere else.
The main difference of these products and BigQuery is the prizing model in BigQuery you pay as you perform queries but in Micrisoft options you pay based on the resources you allocate, which can be very expensive if you data is really big.
I think if the expected usage is occasional BigQuery will be much cheaper, Misrosoft options will be better for intense use, but of course you will need to do a detailed prize comparison to be sure.
To get an idea of what BigQuery really is, and how it compares to a relational database (or Hadoop for that matter), take a look at this doc:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Take a look at this:
http://azure.microsoft.com/en-in/solutions/big-data/.
Reveal new insights and drive better decision making with Azure HDInsight, a Big Data solution powered by Apache Hadoop. Surface those insights from all types of data to business users through Microsoft Excel.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am looking for a non-SQL database.
My requirements are as follow:
Should be able to store >10 billion records
Should consume only 1 gb of memory atmost.
User request should take less than 10 ms. (including processing time)
Java based would be great.(i need to access it from java and also if anytime I need to modify the database code )
The database will hold e-commerce search records like number of searches ,sales , product bucket,product filters...and many more...the database now is a flat file and I show now some specific data to users.The data to be show I configure prior and then according to that configuration users can send http request to view data. I want to make things more dynamic and people can view data without prior configuration....
In other words I want to built a fast analyzer which can show users what the user request for.
The best place to find names of non-relational databases is the NoSQL site. Their home page has a pretty comprehensive list, split onto various categories - Wide Column Store, Key-value Pair, Object, XML, etc. Find out more.
You don't really give enough information about your requirements. But it sounds like kdb+ meets all of the requirements that you've stated. But only if you want to get to grips with the rather exotic (and very powerful) Q language.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Does anybody know how data in Google Analytics is organized? Difficult selection from large amounts of data they perform very-very fast, what structure of database is it?
AFAIK Google Analytics is derived from Urchin. As it has been said it is possible that since now Analytics is part of the Google family it is using MapReduce/BigTable. I can assume that Google had integrated the old format of Urchin DB with the new BigTable/MapReduce.
I found this links which talk about Urchin DB. Probably some of the things are still in use at the moment.
http://www.advanced-web-metrics.com/blog/2007/10/16/what-is-urchin/
this says:
[snip] ...still use a proprietary database to store reporting data, which makes ad-hoc queries a bit more limited, since you have to use Urchin-developed tools rather than the more flexible SQL tools.
http://www.urchinexperts.com/software/faq/#ques45
What type of database does Urchin use?
Urchin uses a proprietary flat file database for report data storage. The high-performance database architecture handles very high traffic sites efficiently. Some of the benefits of the data base architecture include:
* Small database footprint approximately 5-10% of raw logfile size
* Small number of database files required per profile (9 per month of historical reporting)
* Support for parallel processing of load-balanced webserver logs for increased performance
* Databases are standard files that are easy to back up and restore using native operating system utilitiesv
More info about Urchin
http://www.google.com/support/urchin45/bin/answer.py?answer=28737
Long time ago I used to have a tracker and on their site they were discussing about data normalization: http://www.2enetworx.com/dev/articles/statisticus5.asp
There you can find a bit of info of how to reduce the data in DB and maybe it is a good start in research.
BigTable
Google Publication: Chang, Fay, et al. "Bigtable: A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008):
Bigtable is used by more than sixty Google products and projects,
including Google Analytics, Google Finance, Orkut, Personalized
Search, Writely, and Google Earth.
I'd assume they use their 'Big Table'
I can't know exactly how they implement it.
But because I've made a product that extracts non-sampled, non-aggregated data from Google Analytics I have learned a thing or two about the structure.
I makes sense that the data is populated via BigTable.
BT offers localization data awareness and map/reduce querying across n-nodes.
Distinct counts
(Whether a data service can provide distinct counts or not is a simple measure of flexibility of a data model - but it's typically also a measure of cost and performance)
Google Analytics is not built to do distinct counts even though GA can count users across almost any dimension - but it can't count e.g. Sessions per ga:pagePath?
How so...
Well they only register a session with the first pageView in a session.
This means that we can only count how many landingpages that have had a session.
We have no count for all the other 99% of pages on your site. :/
The reason for this is that Google made the choice NOT to count discount counts at all. It simply doesn't scale well economically when serving millions of sites for free.
They needed an approach where they could avoid counting distinct. Distinct count is all about sorting, grouping lists of ids for every cell in data intersection.
But...
Isn't it simple to count the distinct number of session on a ga:pagePath value?
I'll answer this in a bit
The User and data partitioning
The choice they made was to partition data on users (clientIds or userIds)
Because when they know that clientId/userId X is only present in a certain table in BT, they can run a map/reduce function that counts users and they don't have to be concerned that the same user is present in another dataset and be forced to store all clientIds/userIds in a list - group them - and then count them - distinct.
Since the current GA tracking script is called Universal Analytics they have to be able to count users correct. Especially when focusing on cross-device tracking.
OK, but how does this affect session count?
You have a set of users, each having multiple sets of sessions each having a list of page hits.
When counting within a specific session looking for a pagePaths, you will find the same page multiple times but you will not count the page more than once.
You need to write down you've already seen this page before.
When you have traversed all pages within that session you need only count the session once per page. This procedure requires a state/memory. And since the counting process is probably done in parallel on the same server. You can't be sure that a specific session is handled by the same process. Which makes the counting even more memory consuming.
Google decided not to chase that rabit any longer and just ignore that the session count is wrong for pagePath and other hit scoped dimensions.
"Cube" storage
The reason I write "cube" is that I don't know exactly if they use traditional a OLAP cube structure, but I know they have up to 100 cubes populated for answering different dimension/metric combinations.
By isolation/grouping dimensions in smaller cubes, data won't explode exponentially like it would if they put all data in a single cube.
The drawback is that not all data combinations are allowed. Which we know is true.
E.g. ga:transactionId and ga:eventCategory can't be queried together.
By choosing this structure the dataset can scale well economical and performance-wise
Many places and applications in the Google portfolio use the MapReduce algorithm for storage and processing of large quantities of data.
See the Google Research Publications on MapReduce for further information and also have a look at page 4 and page 5 of this Baseline article.
Google analytics runs on 'Mesa: Geo-Replicated, Near Real-Time, Scalable DataWarehousing'.
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf
"Mesa is a highly scalable analytic data warehousing systemthat stores critical measurement data related to Google’sInternet advertising business."