need sample data for e-commerce master research - dataset

In my MSC research, I have to build an eCommerce app (like Amazon, eBay, etc) but a location based one . I need a freely available sample data for the store. So is there some freely available dataset available that represents a set of products, like groceries, movies, books, cars, apps, electronics, weapons, library, etc? .I need its size to be adequate for analysis. I need a database with at least 200 customers , 1000 products of different categories and 1000 orders. The customers data have to include information such as age, sex, location, education.

You could try the Microsoft Contoso BI Demo Dataset. I have not used it, so I do not know if it will meet all of your requirements. However, given that it is used to demonstrate BI functionality across all of MS' BI products, one would hope it was fairly comprehensive.

Related

How big tech companies share databases across multiple teams?

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you
In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here
If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2
There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

Which database to use for highly-connected data model with frequent schema changes?

I am currently working on a project, where we use web mining (web crawlers) to build up a company database. At the moment we employ PostgreSQL as our main database, however, I feel it will cause a lot of problems in the future, because as our crawler develop and extract more data we'll see many schema changes/additions.
Some examples:
At the moment we store one address per company, but at one point we might want to store multiple addresses. (1 - 1 relationship transforms to 1 - n, or even n - n)
Companies in different industries have very different attributes, e.g. for we have a lot of NULL-fields in our relational schema at the moment.
Different degrees of information available, e.g. for some companies we only know the CEO's name, which could be stored in a single attribute of the company. For other companies we might want to use a relationship to a Person relation, because we have a photo, birthdate, CV etc... (the schema is not fixed)
What kind of database would be suited for such a task? I've looked into MongoDB, Neo4J, OrientDB. Some requirements which are important for us:
No license fee for non-open-source commercial projects
Should scale to storing 100GB - 1000TB, while executing OLAP queries for displaying a web interface (company query interface) in millisecond range

the approach to building a web application

This is the scenario i have:
im developing a web app that will list down all the details of a car that the user picks from a list. I have a database of all car models, makes, sizes, prices etc. Besides i also have the price trend for the past 5 yrs. You may assume that i have a few of such tables and the data volume is about 10s of thousands of records.
My online application should be able to let the user pick his choice of one car model and optionally provide his address. With just this user input, i want to be able to generate a pdf report with the following information:
Comparison of selected car model with other cars manufactured in the same country. (e.g, if user selected, honda, i want to compare it with toyota, which comes from the same country)
Comparion of selected car with other car of similar type (eg. sedan vs sedan)
Price trend of the car for the last 5 yrs - Nearest car workshops in user's neighbourhood within a radius of 10km (if user has given me his address)
i will be drawing out several other data from my database.
I would like to present this report instantly, say within 3 minutes to the user. So now the question, is, what software/tools/program/database etc should i be using, taking in consideration the huge amount of data and the need to present this in the fastest possible time as a pdf report?
There are whole lot of possibilities. You can use PHP (or) Java (or) .Net (or) so on...for web application, MySQL, SQL Server, Oracle etc., as database (If data is really big and grows like anything daily you may consider Hbase also) It dependency on how soon you want your product out in the market and how much scalable it should be and how much comfortable you are with any of those technologies.
Some technologies support nice user interface, some may not but strong in other area of web application.
How much money/time you have for development, licensing also plays role in deciding answer for this question.

Good Product Management Software

Currently, we have information about our products in a variety of places.
ERP, Various Databases, etc. Generally we're using SQL Server to store most of the database.
We want to create a centralized place where we can store all the information related to our products and start replacing these disjoint databases (with the exception of the ERP).
What are some good software packages that will handle thousands of products, have a HTML interface and perhaps a Win32 client as well. Be able to handle Web Friendly Data, internal only data and be customizable. (If I want to add a product features section, I do that... if I want to set the attributes to be in a specific order when displayed I'd like to be able to order them).
Our ERP has some information, but it's not normalized data, it's not standardized and won't hold everything about the product.
Are there any good software packages that will maintain a database of products that we can hook into ( for product information, reporting, showing on the web etc. ) or do I have to roll my own?
The problem isn't in rolling the database, it's all about the interface and the ability to add attributes to products, add different things in different languages etc.
Any ideas?
http://www.pimcore.org/
PHP & MySQL based
handle thousands of products without any problems
incl. DAM
Versioning, Scheduling, Permissions, ....
Easy interfaces to connect ERP, ...
Web-Client but no Win32
The WCMS component can be turned off.
Cheers
I think the best way to manage product in an organization is using ERP software because its save the lot of time and you can easily handle your database in a single location.

need sample data for e-commerce class project

In my CS course project this fall, we have to build a little eCommerce app (like Amazon, eBay, etc). We are free to build any type of eCommerce/store app. Since I don't have a preference for what app to build, perhaps it may be easier to decide based on freely available sample data for the store. So is there some freely available dataset available that represents a set of products, like groceries, movies, books, cars, apps, electronics, weapons, library, etc? It doesn't have to be real but as long as it can save me a few hours of entering data, it will be worthwhile. An open data format for the dataset would be useful, a MySQL database would be great.
Perhaps I should use the Northwind database from MSSQL?
Is there a "Northwind" type database available for MySQL?
I haven't looked at all the reference in this post but it looks promising: Where can I find sample databases with common formatted data that I can use in multiple database engines?
Any suggestions eCommerce sample datasets?
At this link you can find some e-commerce datasets in Comma-Separated Values format, namely some snapshots of Amazon, Google Products, ABT and Buy.
http://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
You can try the nopCommerce sample data. Download from http://nopcommerce.codeplex.com. During installation, tick the "create sample data" box. You can then use the data within your own application. The schema is pretty easy to understand.

Resources