I have an interesting problem. We receive the feed files from our customers which contains the products along with their information. We log each of the feed request received from our customers in a database.
The Problem is that given a feed file, we need to get all the feed requests which has the same list of products in the given feed file.Every feed request has nearly 2million candidate feeds for matching?
Let me to summarize the probelem, just to make sure that we are on the same page.
The application may get a Feed Request, which contains list of products. Every time it happens, you log FR in db, and in addition you want to check for all FRs in the past which contained the same products set, is that right?
If so, an idea is to generate a hash key for a list of products within a FR. In that way every FR in db, has its own hash - which corresponds to list of products this FR contained.
Eg.
Feed Request came to the app, and it contains products 2, 1, 3. The
app sorts products identities: [1, 2, 3], and then generate hash:
h([1, 2, 3]) = abc. Then, all you need to look for previous FRs with
the same products set, is to generate a query: "get all records from feed requests, where
hash is equal to "abc" ".
Such comparison is not very expensive if you index the data in the right way, even if there are milions of records.
Related
I'm building a side project and I'm currently developing the frontend using React. I have a question regarding paginated API responses and Redux (this is my first time using Redux).
Let's say that I have an /invoices API endpoint which provides a paginated list of invoices. I can also append query params to the endpoint such as ?is_paid=false to filter those invoices. Should I have a store for invoices and another store for filtered invoices?
My idea was that for non filtered invoices, I would just get the first page and add it to the store, than if the user wants the 2nd page, I'll append those to the store as well etc. So if a user wants to go back, no requests would be made. However, I'm unsure if I should do the same for filtered invoices as well because there are several filters that can be made.
Your terminology is a little bit off. You definitely only want to have one store instance. But you can have lots of different properties in your store.
Your store should contain a property which is a dictionary of all loaded invoices keyed by a unique invoice id. You only need to have one record per invoice even though that invoice might appear in many different filtered lists. These are your "entities".
You also need to know which invoices are in each list, so you should have another property in your store with that data. These are your "collections". You just need to store the ids of the invoices here. You'll get the complete invoice from the entities property.
Usually the key that I use for the collections is the path of the URL. So your store structure would look something like:
{
invoices: {
entities: {
1: {/* complete invoice record */},
2: {/* complete invoice record */},
...
},
collections: {
"/": [99, 98, 97,...],
"/page/2/": [89, 88, 87,...],
"/?is_paid=false": [99, 92, 87, ...],
"/?is_paid=false&someFilter=x": [92, 35, 21, ...],
}
}
I am trying to figure out the best way to store trip itinerary data into DynamoDB. Just for your info, my code is written in Python3 and I am using Boto3 to interact with DynamoDB.
After researching on this resource - https://schema.org/Trip, this is what I think would be the data classes of the objects.
from marshmallow_dataclass import dataclass
from typing import List, Optional
#dataclass(frozen=True)
class Itinerary:
id: str
startTime: int
endTime: int
dayTripId: str
placeName: str
placeCategory: str
estimatedCost: float
#dataclass(frozen=True)
class DayTrip:
id: str
day: str
parentTripId: str
date: Optional[str]
itinerary: List[Itinerary]
#dataclass(frozen=True)
class UserTrip:
tripId: str
userId: str
tripName: str
subTrip: List[DayTrip]
Essentially, the structure is as follows:
A person can have many UserTrips
A UserTrip can consist of one day or multiple day of DayTrip, e.g. Day 1, Day 2, Day 3
A DayTrip can have one or multiple places to visit (Itinerary)
An Itinerary is the lowest level that describes the place to visit
It wouldn't be good to store the UserTrip as is, with nested JSON structure consisting of DayTrip, then Itinerary, right? It would mean that the subTrip attribute of a particular UserTrip will be a huge chuck of JSON. So I think everyone here would agree this is a no, no. Is that correct?
Another alternative that I could think of was to store only the id of each entity. What I mean by this is, for example, a UserTrip will have its subTrip attribute containing a list of the DayTrip id. This means there will be another table to store DayTrip items and we can connect it to the corresponding UserTrip via the parentTripId attribute. And so on for the list of Itinerary.
Using this approach, I will have 3 x tables as follows:
user-trip-table to store UserTrip where subTrip will contain the list of DayTrip.ids
user-day-trip-table to store DayTrip where itinerary will contain the list of Itinerary.ids. The parentTripId will enable the mapping back to the original UserTrip
user-itinerary-table to store Itinerary where it can be mapped back to the original DayTrip via dayTripId attribute.
I am not sure if this is a good practice as there will be a lot of lookups happening and asynchronous operations are not possible here. This is because, to fetch the Itinerary, I need to wait for the completion of GetItem operation to get UserTrip, then, I will have the ids of the DayTrip and then, I will do another GetItem to fetch the DayTrip, then, finally, another GetItem to fetch the Itinerary.
Could the community here suggest a better, simpler solution?
Thanks!
Regarding the data structure, I don't see an absolute need for DayTrip, as you can get all that data from Itinerary. So in UserTrip I would keep a list of Itineraries instead of a list of DayTrips.
It wouldn't be good to store the UserTrip as is, with nested JSON
structure consisting of DayTrip, then Itinerary, right? It would mean
that the subTrip attribute of a particular UserTrip will be a huge
chuck of JSON. So I think everyone here would agree this is a no, no.
Is that correct?
Actually this is recommended in NoSQL databases, to have all data denormalised/embedded in the object. You use more storage, but avoid joins/processing. But keep in mind DynamoDB's item size limitation (currently 400KB).
In general, in NoSQL, you need to create your schema based on the queries you will need. For example in your case, you want to fetch all Itineraries of a UserTrip. Simply add userTripId to the Itinerary table. Create a GSI on Itinerary
with userTripId as hash key so you can query it efficiently. This way you will get all itinerary objects of a user trip.
My page URL is with 3 variables: id=N, num=N and item=N ; N = 1, 100. They are integers and an example of URL's given:
page.php?id=1&num=24&item=12
I want to count the visit with no duplicates. How I think it's should work:
If cookie exist don't add a new value to database, else set cookie and increment value in database.
I used a $_COOKIE[''] array, to identify if page was visited:
$_COOKIE['product[id]'] = $_GET['id'];
$_COOKIE['product[num]'] = $_GET['num'];
$_COOKIE['product[item]'] = $_GET['item'];
The problem appeared when the path it's different:
page.php?id=1&num=24&item=12&page=0#topView
I can't query the data base each time when a person access the page, it's because there are, 1000+ unique visits.
My questions is: how can I count in mod unique each page visit?
Note:
page.php?id=1&num=24&item=12
or
page.php?id=1&num=24&item=15
or
page.php?id=2&num=24&item=12
these links gives me an unique product info, depending by variables.
Thank you!
Since my comment is the solution for your problem, I am converting it as an aswer.
"I can't query the data base each time when a person access the page, it's because there are, 1000+ unique visits." - I doubt you can. In my opinion - you should. Count all the page accesses and when you want to grab the final results, do grouping by ip, id, num, item. Putting all the data into the database will also give you a brief view of the most popular pages. Even further, you will be able to see what pages are being accessed more times by one unique user and identify the reasons. The more data is better. It won't take much of your database.
There is a mistake in your finally decided algorithm. Imagine the id is 11, num is 5 and item is 7. And another with id = 1, num = 15, item = 7. Hope you see what I mean. :P Put it like that
md5($_GET['id'].'-'.$_GET['num'].'-'.$_GET['item']);
so it is really unique.
For example you make a search for a hotel in London and get 250 hotels out of which 25 hotels are shown on first page. On each page user has an option to sort the hotels based on price, name, user-reviews etc. Now the intelligent thing to do will be to only get the first 25 hotels on the first page from the database. When user moves to page 2, make another database query for next 25 hotels and keep the previous results in cache.
Now consider this, user is on page 1 and sees 25 hotels sorted by price and now he sorts them based on user-ratings, in this case, we should keep the hotels we already got in cache and only request for additional hotels. How is that implemented? Is there something built in any language (preferably php) or we have to implement it from scratch using multiple queries?
This is usually done as follows:
The query is executed with order by the required field, and with a top (in some databases limit) set to (page_index + 1) * entries_per_page results. The query returns a random-access rowset (you might also hear of this referred to as a resultset or a recordset depending on the database library you are using) which supports methods such as MoveTo( row_index ) and MoveNext(). So, we execute MoveTo( page_index * entries_per_page ) and then we read and display entries_per_page results. The rowset generally also offers a Count property which we invoke to get the total number of rows that would be fetched by the query if we ever let it run to the end (which of course we don't) so that we can compute and show the user how many pages exist.
I would like to store some information as follows (note, I'm not wedded to this data structure at all, but this shows you the underlying information I want to store):
{ user_id: 12345, page_id: 2, country: 'DE' }
In these records, user_id is a unique field, but the page_id is not.
I would like to translate this into a Redis data structure, and I would like to be able to run efficient searches as follows:
For user_id 12345, find the related country.
For page_id 2, find all related user_ids and their countries.
Is it actually possible to do this in Redis? If so, what data structures should I use, and how should I avoid the possibility of duplicating records when I insert them?
It sounds like you need two key types: a HASH key to store your user's data, and a LIST for each page that contains a list of related users. Below is an example of how this could work.
Load Data:
> RPUSH page:2:users 12345
> HMSET user:12345 country DE key2 value2
Pull Data:
# All users for page 2
> LRANGE page:2:users 0 -1
# All users for page 2 and their countries
> SORT page:2:users By nosort GET # GET user:*->country GET user:*->key2
Remove User From Page:
> LREM page:2:users 0 12345
Repeat GETs in the SORT to retrieve additional values for the user.
I hope this helps, let me know if there's anything you'd like clarified or if you need further assistance. I also recommend reading the commands list and documentation available at the redis web site, especially concerning the SORT operation.
Since user_id is unique and so does country, keep them in a simple key-value pair. Quering for a user is O(1) in such a case... Then, keep some Redis sets, with key the page_id and members all the user_ids..