I have a scrapper which gets news-articles over the day by different sources.
I want to display data like 'most common words in the last 30 days (in source X)' on my page.
For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content.
With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.
I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count. I came across two points here:
1 - How do I create a table for this? Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article. I could say I will only save the first 20, but I don't really like the idea.
2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. So I won't be able to differentiate any further. I chose the script to run for each day to reduce the data that I will need to send to the front on demand.
Don't create a table with a separate column for each of the first 20 words. Please. I beg you. Just don't.
Two possible approaches.
Use a fulltext search feature in your DBMS. You didn't tell us which one you use, so it's hard to be more specific.
Preprocess: Create a table with columns article_id, word_number, and word. This table will have a large number of rows, one for each word in each article. But that's OK. SQL databases are made for handling vast tables of simple rows.
The unique key on the table contains two columns: article_id and word_number. A non-unique key for searching should contain word, article_id, word_number.
When you receive an incoming article, assign it an article_id number. Then break it up into words and insert each word into the table.
When you search for a word do SELECT article_id FROM words WHERE word=?. Fast. And you can use SQL set manipulation to do more complex searches.
When you remove an article from your archive, DELETE the rows with that article_id value.
To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50.
Related
Our product is using Google Datastore as the application database. Most of the entities use IDs of type Long and some of type String. I noticed that the IDs of type Long are not in consecutive order.
Now we are exporting some big tables, with around 30 - 40 million entries, to json files for some business purposes. Initially we expected that a simple query like "ofy().load().type(ENTITY.class).startAt(cursor).limit(BATCH_LIMIT).iterator()" will help us iterate through the entire content of that specific table, starting from the first entry and ending with the most recently created one. We are working in batches and storing the cursor after every batch, so that the next task can load the batch and resume.
But after noticing that an entity created some minutes ago can have an ID smaller than the ID of another entity created 1 week ago, we are wondering if we should consider a content freeze during this export period. On one hand it's critical to make a good export and not to miss older data up to a specific date, on the other hand a content freeze longer than 1 day is a problem for our customers.
What do you advice us to do?
Thanks,
Cristian.
I do not think you need to worry about uniqueness of your id. Datastore build on top of Bigtable with 6 tables.
first table stores entities
second stores entities by kind
third stores indexes for the property values in the ascending order
fourth to store indexes for the property values in the descending order
fifth stores indexes for multiple properties together
sixth keeps a track of the next unique ID for Kind
Format is something like this.
[application ID]-[namespace]-[Kind]-[ID]
It is garanties of uniqueness each entities.
Yes, the format on that table is [Application ID]-[Kind Name] and the value is the next value. Let say you have kind products and that table will look like this |key(yourapp-products), Next ID(3)|. Now you created new entity for kind products it will be assigned to ID(3) and the row on that table will get new value |key(yourapp-products), Next ID(4)|. Also to mention that table has only one row since we have only one kind products.
Do you specify ID yourself or let datastore generate itself? It sounds like you have "Pre-allocating IDs" issue, just speculating but for every batch you need sort Kind.allocate_ids(size=blah) that way you can keep sequence.
Say we have a fruits that that is having a high number of reads but also inserts though almost not update nor delete.
We have 2 columns that stores values that have a small number of options. Lets say categories[Banana, apple, orange or pear] and status[finished, ongoing, spoiled, destroyed or ok].
Finally, we have a column last name of owner.
Notes:
I am going to searchs sometimes by category and other by status.
In all cases, lastname will be used for the search.
I will always perform exact match on categories/status but start with in last name.
Ex of common queries:
SELECT * FROM fruit_table WHERE category='BANANA' and last_name LIKE 'Cool%'
SELECT * FROM fruit_table WHERE status='Spoiled' and last_name LIKE 'Co%'
SELECT * FROM fruit_table WHERE category='BANANA' and last_name LIKE 'smith%'
How can I prepare it so we have low response time? Will a index help(taking into account that the values in the column are not disperse at all)?Might bitmap index help here? What about partitioning?
Finally, Apologies about the title, I did not know how to formulate it properly.
Bitmap indexes help immensely with items that have a limited number
of available choices.
A standard b-tree index (or non-clustered in SQL Server) will work well
for the last_name column.
I would do those two first, as they are easy and then see how things work.
It is generally a bad practice to prematurely optimize. However, adding indices is quick way to increase speed without much effort. For more information on indices in Oracle, read this question.
For my website, I want to make something that works a bit like the tags on Stackoverflow - so some fields will have an autocompleter, and the autocompleter will display the number of times that other users have selected each suggested value. I suppose I'd have a database structure like this:
Articles
ArticleID
Content
TagId
Tags
TagId
TagName
Occurances
With the idea being that Occurances represents the number of times each TagId is referenced from the Articles table.
What is the best way to implement this? I could add/subtract from the occurances column on each of the stored procedures that update the article table, but I might miss one, and anyway, there is are some difficulties with this if a user removes a tag from something (as its easy to add 1 to the field for the newly added tag, but harder to work out which tag is being replaced.)
There is lots I don't understand about sql-server. Is there a more robust way of counting occurances like this, that the database system will deal with itself? It would be ok if the data was cached once a day or something.
To be able to have more than one tag attached to an article, you will have to add another table that connects the article table to the tag table. It's called a 'many to many' relation.
article
article_id
content
article_tag
article_id
tag_id
tag
tag_id
tagname
Doing like this, article 1 can be attached to tag 2, and the next row can be 1 and 3 and so on, so one article points to many tags. To count a certain tag, you join the Article_Tag and Tag tables, and and count the rows in Article_Tag where Tag.tagname = 'mysql', for examle.
You can create an indexes view that aggregates all the counts you need and is automatically maintained:
create view TagCounts
with schemabinding
as select TagId, count_big(*) as Occurances
from dbo.ArticleTags
group by TagId;
go
create unique clustered index cdxTagCounts on TagCounts (TagId);
go
Now the TagCounts.Occurances field is automatically maintained by SQL Server whenever you insert/delete/update the Articles table. You can query it like:
select Occurances from dbo.TagCounts with (noexpand) where TagId = ...;
And you can cache the result with LinqToCache, as such a query matches the restrictions of Query Notifications.
The trade off of using a pre-aggregated indexed view is scalability: as update of any article updates the count of Occurances for the tags of the article, an exclusive lock is required to update this count. Which implies that only one transaction can use a TagId at any moment. Depending on your traffic and on other elements of your design this restriction may or may not be acceptable.
The other alternative is a table of counts. Front ends (your ASP.Net farm) read this counts and then they update the in-memory count for each operation, keeping track of the delta from the counts in the table. Periodically the front ends merge their deltas into the table (eg. every 5 minutes) and refresh the in-memory table. This way front ends see a stale version of the truth, but an user sees immediate feedback of its actions: because of session stickiness his HTTP requests are processed by the same front end, and thus he see immediately his own article updates triggering modifications to the tag counts. User though do no immediately see the updates from other users that are load-balanced to another front end. Because a crash of the front end (or a process recycle...) will loose the deltas kept so far, the count table will drift in time away from the truth and would have to be periodically updated to the true count in the database.
If you which even more accuracy (all users see the true count immediately) then you can do something based on fast in-memory key value stores, which would be basically the same as my first proposal but with much higher throughput/lower latency, perhaps something based on memcached + redis. I'm not acquainted with SO architecture, but I believe they may be doing something similar.
You could use this query to get the number of occurances by tag:
SELECT Tags.TagId, COUNT(Articles.TagId) as Occurances
FROM Articles
JOIN Tags ON Articles.TagId
GROUP BY Tags.TagId
It could be used in a view or stored procedure, and you can set up your website's cache to requery it as often as required.
If you are using a relational database, the correct way to handle this problem is to NOT store the occurrences on the table itself, but rather dynamically query the number of occurrences on the articles table.
If you don't do it this way, you're stuck coding update queries every time you add/delete a row...generally not nice. If you query dynamically, you won't have an occurrences column in the table, but rather will get that information in your eg. presentation/model layer code.
Use:
SELECT COUNT(*) FROM ARTICLES WHERE TagId = 'xxx' ;
This line is part of iterating code.
I need to store multiple 4 letters strings for each database row but the amount of 4 letter strings could be different every time.
So would it be easier to setup a new table and add a new row for each 4 letter string with the id of the related row in the other table ?
For normalisation reasons and performance as well as being able to later perform efficient queries, you would want to store it in a related table.
Main : ID, other columns
Related : Main_ID, 4-letter-string
If there is nothing else you will store in the Main table, then just store them as multiple rows, and relate via a common ID.
You can store it on one record and still search efficiently, if FULLTEXT searching is turned on, but I doubt your 4-letter strings are natural language words, so it may not suit as well.
Probably a noob question, but I'll go for it nevertheless.
For sake of example, I have a Person table, a Tag table and a ContactMethod table. A Person will have multiple Tag records and multiple ContactMethod records associated with them.
I'd like to have a forgiving search which will search among several fields from each table. So I can find a person by their email (via ContactMethod), their name (via Person) or a tag assigned to them.
As a complete noob to FTS, two approaches come to mind:
Build some complex query which addresses each field individually
Build some sort of lookup table which concatenates the fields I want to index and just do a full-text query on that derived table.
(Feel free to edit for clarity; I'm not in it for the rep points.)
If your sql server supports it you can create an indexed view and full text search that; you can use containstable(*,'"chris"') to read all the columns.
If it doesn't support it as the fields are all coming from different tables I think for scalability; if you can easily populate the fields into a single row per record in a separate table I would full text search that rather than the individual records. You will end up with a less complex FTS catalog and your queries will not need to do 4 full text searches at a time. Running lots of separate FTS queries over different tables at the same time is a ticket to query performance issues in my experience. The downside with doing this is you lose the ability to search for Surname on its own; if that is something you need you might need to look at an alternative.
In our app we found that the single table was quicker (we can't rely on customers having enterprise sql at hand); so we populate the data with spaces into an FTS table through an update sp then our main contact lookup runs a search over the list. We have two separate searches to handle finding things with precision (i.e. names or phone numbers) or just for free text. The other nice thing about the table is it is relatively easy and low cost to add further columns to the lookup (we have been asked for social security number for example; to do it we just added the column to the update SP and we were away with little or no impact.
One possibility is to make a view which has these columns: PersonID, ContentType, Content. ContentType would be something like "Email", "PhoneNumber", etc... and Content would hold that. You'd be searching on the Content column, and you'd be able to see what the person's ID is. I'm not 100% sure how full text search works though, so I'm not sure if you could use that on a view.
The FTS can search multiple fields out-of-the-box. The CONTAINS predicate accepts a list of columns to search. Also CONTAINSTABLE.