Articles About Data Acquisition from Existing Sources - database

I'm writing a report about best practices for data acquisition from existing sources. However, I'm having trouble finding many articles or papers about what I need. Most of what I find is either data acquisition from the source, e.g. a weather station, or gloss over really quickly the part where you talk to a different team about getting access to their data.
Esentially, I'm looking for articles that list some good questions to ask when you talk to other teams in your orginization about getting access to their data so the BI's can use it. Even a search term I could use that would better describe what I'm looking for. At first I thought maybe data wrangling could be it, but all my searches yield articles about manipulating the data to what you need.
To recap, I'm after some pointers to articles or search terms I could use to find the articles that cover a set of best practices that you should follow when talking to other departments about getting access to their databases for BI purposes.

First of all, I recommend paying attention to your search engine, there are many more of them than just Google. In my practice, there are often cases when it was possible to find rare information that could not be found by one search engine, but managed by others.
Here are the results of the query "how the analyst communicates with other departments" from Google search:
A set of tips for a communication analyst -
https://inside.getyourguide.com/blog/2020/7/28/10-tips-on-effective-communication-in-business-intelligence
A set of tips for an analyst on interdepartmental communication -
https://www.phocassoftware.com/business-intelligence-blog/four-ways-to-promote-interdepartmental-communication
An article about communication-oriented modeling to improve the efficiency of an analyst -
https://www.crystalloids.com/news/fully-communication-oriented-information-modelling
And here are the results of the same query but in DuckDuckGo:
A set of general interdepartmental communication tips -
https://www.themuse.com/advice/how-to-communicate-better-with-other-departments
An article on effective communication for an analyst - https://www.northeastern.edu/graduate/blog/communicating-with-data/
A set of tips for an analyst to improve communication skills -
https://www.modernanalyst.com/Resources/Articles/tabid/115/ID/1994/10-Ways-to-Hone-Your-Communication-Skills-as-a-Business-Analyst.aspx
And these are only 2 search engines, and there are also Bing, Ecosia, Yandex, Startpage. With the right query and a variety of search tools, you can get any information of interest.

Related

Are RDF / Triple Stores suited for storing application data? (as opposed to the graph Metadata)

I'm trying to create a small web application for a "personal information manager" / wiki kind of tool where I can take notes in the form of HTML snippets (or maybe Markdown), annotate them with some https://schema.org/ microdata and store both the snippet and the metadata somewhere for querying.
My understanding so far is that most semantic data stores (triple/quad stores, or databases supporting RDF) are better suited for storing and querying mainly the metadata. So I'll probably also want some traditional store of some sort (relational, document store, key-value, or even a non-rdf graph db) where I can store the full text of each note and maybe some other bits like time of last access, user-id that owns the note, etc, and also perform traditional (non-semantic) fulltext queries.
I started looking for stores that would allow me to store both data and metadata in a single place. I found a few: Ontotext GraphDB, Stardog, MarkLogic, etc. All of these seem to do exactly what I want, but have some pretty limiting free license terms that really discourage me from studying them in depth: I prefer to study open technologies that I could potentially use on a real product.
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
Right now I'm just leaning towards using Apache Jena for the RDF store and SPARQL queries, and something completely independent for the rest of the data (PostgreSQL most likely).
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
Not necessarily, no, though there certainly are some cases in which that distinction may be useful. But most RDF databases offer scalable storage for both data and metadata. The only requirement is that your (meta)data is represented as RDF. If you're worried about performance of things like text queries, most of them offer support for full-text indexing through Lucene, Solr, or Elasticsearch.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
This is really not the right place to ask this question. Tool recommendations are considered off-topic on StackOverflow since they attract biased answers. But as said, there's plenty of tools, both open-source/free and commercial, that you can look into. I suggest you pick one you like the look of, experiment a bit, and perhaps talk to that particular tool's community to explain what you're trying to do. Apache Jena and Eclipse Rdf4j are two popular open-source projects, but there's plenty of others.

Cassandra data modeling for social network with follower and following actions

I want to know anybody can describe how should be Cassandra data modeling for a social network that allow its users to follow each other, and has timeline and some common features that are in social networks like Twitter.
I found twissandra on Github but that was confusing for me.
Please if you can describe how following and follower tables should be in Cassandra or provide links to tutorials
Despite relational databases schema design in which the queries that will be performed have a major impact only in the context of optimization, schema design for Cassandra is query oriented: this means that you need first to figure out the kind information you will ask for in order to be able to design an effective Cassandra instance. A wrong schema design can kill Cassandra's performances.
Therefore, regarding your question, you should first have a complete picture of your context, and then go back to the design phase.
I have personally found really useful the material provided by Datastax Academy. They are free, you will need only to register. I would suggest you to first take a look at the Cassandra system architecture if you are not familiar with it in order to fully understand the schema design choices, and then look at the main design principles.
Regarding the methodology to be used, I don't think there's an established one right now. I would suggest using the Chebotko Diagrams which are well explained in this article.

Freebase: Is it worth it to base my company's entire database on it?

I'm with a company that is building a venue / artist database for live music and recently came across Freebase. It looks very compelling, even if the data isn't there for new, up-and-coming bands. For those of you who have worked with Freebase, I have a couple questions:
Are there downsides to integrating all of the data entry with Freebase? We are not looking to sell or privatize this information.
What are the weaknesses of Freebase, with regards to usability?
Disclosure: I work on Freebase at Google.
The music data in Freebase is one of our strongest areas and is going to continue to get broader and richer as we continue to load more datasets. For example, we import data from MusicBrainz, clean it up and match the topics against existing topics in Freebase to avoid duplicates.
In terms of downsides, you should be prepared to work with a lot of data. For example, Freebase currently has 4 musical artists named "John Smith" which may or may not be useful for your application but you'll still need to figure out which one(s) map to the John Smith that your users are interested in. We call this "reconciliation" and its necessary so that your app knows precisely which topics to query the API for.
Since you mentioned music venues I should also point out that while Freebase has a lot of data about places, we don't yet have a geosearch API so you'd need to roll your own if that's something you need.
Since anyone can edit Freebase, you should also consider using as_of_time to protect your site against vandalism.
Freebase is great for developers because you can easily jump in and clean up bad data or add missing topics. However, one area that has always been a challenge is loading large amounts of data from outside of Google. We've built the OpenRefine which allows folks to upload datasets, but these datasets must pass a QA process that takes some time to complete. Its necessary to have these QA processes to maintain the level of quality in Freebase, but it does slow down the process of loading large datasets.
I really hope that you choose to make use of Freebase music data to build your company. I know that there are already a number of music startups happily using our data.

How to get book metadata?

My application needs to retrieve information about any published book based on a provided ISBN, title, or author. This is hardly a unique requirement---sites like Amazon.com, Chegg.com, and even software like Book Collector seem to be able to do this easily. But I have not been able to replicate it.
To clarify, I do not need to search the entire database of books---only a limited subset which have been inputted, as in a book collection. The database would simply allow me to tag the inputted books with the necessary metadata to enable search on that subset of books. So scale is not the issue here---getting the metadata is.
The options I have tried are:
Scrape Amazon. Scraping the regular Amazon pages was not very robust to things like missing authors, and while scraping the smaller mobile pages was faster, they shared the same issues with robustness of extraction. Plus, building this into an application is a clear violation of Amazon's Terms of Service.
Scrape the Library of Congress. While this seems to have fewer legal ramifications, ease and robustness were again issues.
ISBNdb.com API. While the service is free up to a point, and does a good job of returning the necessary metadata, I need to do this for over 500 books on a daily basis, at which point this service costs money proportional to use. I'd prefer a free or one-time payment solution that allows me to do the same.
Google Book Data API. While this seems to provide the information I need, I cannot display the book preview as their terms of service requires.
Buy a license to a database of books. For example, companies like Ingram or Baker & Taylor provide these catalogs to retailers and libraries. This solution is obviously expensive, so I'm hoping that there's a more elegant solution I've missed. But if not, and someone on SO has had a good experience with a particular database, I'm willing to go with that.
I've tried to describe my approach in detail so others with fewer books can take advantage of the above solutions. But given my requirements, I'm at my wits' end for retrieving book metadata.
Since it is unlikely that you have to retrieve the same 500 books every day: store the data retrieved from isbndb.com in a database and fill it up book by book.
Instead of scraping Amazon, you can use the API they expose for their affiliate program: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It allows about 3k requests per hour and returns well-formed XML. It requires you to set a link to the book that you show the information about, and you must state that you are an affiliate partner.
This might be what you're looking for. They even offer a complete download!
https://openlibrary.org/data
As it seems, a lot of libraries and other organisations make information such as "ISBN" available through MAchine-Readable Cataloging aka MARC, you can find more information about it here as well.
Now knowing the "right" term to search for I discovered WorldCat.org.
Maybe this whole MARC thing gives you a new kind of an idea :)

Google App Engine - Using managed relations

I am experimenting with App engine. One of my stumbling blocks has been the support for managed relations or lack there off, this is further compounded by the lack of join support.
Without getting into details of the issues I have run into (which I will post under different topic), I would like to ask two things.
1. Has any one of you used managed relations in something substantial. If so if you can share some best practices that will help.
2. Is there any good comprehensive example(s) that you have come across which you can point me at.
Thanks in advance.
I think this answer might disappoint you, but before you develop on the app engine you should read it anyway, and confirm this in the docs.
No. No one on the app engine has used managed relations for anything 'substantial', simply because Bigtable is not built for managed relations. It is a sharded and sorted array, and as such is a very different kind of data structure than what you would normally use.
Now there are attempts to set up managed relationships - the GAE/Java team is pushing JDO features that come close to this, and there's more info on this blog, but this simply isn't the natural state of things on the app engine, and you'll very quickly run into problems if you decide to spend too much time wrapping yourself in a leaky abstraction.
Its a lot easier to actually look at what bigtable really is - there are a ton of videos on the google i/o pages for 2010 and 2009 that do a fantastic job of explaining that, and then figure out ways to map your problem according to the capabilities of the datastore. It may sound unreasonable, but think about it... the GAE is a tool that can do certain things exceedingly well, and if you can figure out your problem in terms of ideas like object stores, sets, merge joins, task queues, pre-computation and caching, then you can use this tool to kick ass.

Resources