Datastream for historical intraday stock/index data - dataset

currently I am about to write my master thesis. And I need a historical intraday stock/index data of USA and Europe, i.e. stock prices and also index prices with 5 minute invertal from Jan. 1, 2017.
I can use Eikon through my university, but I counld find any data which I wanted. Maybe I did something wrong or have no idea. So I wanna ask you whether Eikon offers such historical intraday data or not. If not, could you recommend me another datastream for that?
Many thanks

Related

I'd like to hear thoughts about using time-series databases for this project:

The project is to collect longitudinal data on inmates in a state prison system with the goals of recognizing time-based patterns and empowering prison justice advocates. The question is what time-series DB should I use?
My starting point is this article:
https://medium.com/schkn/4-best-time-series-databases-to-watch-in-2019-ef1e89a72377
and it's looking like the first 3 (InfluxDB, TimescaleDB, OpenTSDB) are on the table, but not so much the last one (I'm dealing with much more than strictly numerical data)
Project details:
Currently I'm using Postgres and plan to update the schema to look like (in broad strokes):
low-volatility fields like: id number, name, race, gender, date-of-birth
higher-volatility fields like: current facility, release date, parole eligibility date, etc
time-series admin data: begin current, end current, period checked. Where this shows the time period the above 2 data fields are current and how often they were checked for changes.
I'm thinking it would be better to move to a time-series db and keep track of each individual update instead of having some descriptive info associated with a start date, end date, and period checked field. (like valid 2020-01-01 to 2021-08-25, checked every 14 days)
What I want to prioritize is speed of pulling reports (like what percentage of inmates grouped by certain demographics have exited the system before serving 90% of their sentence?) over insert throughput and storage space. I'm also interested in hearing opinions on ease of learning, prominence in the industry, etc.
My background:
I'm a bootcamp-educated generalist in data science with a background in CS. I've worked with SQL (Postgres, SQLite) and NoSQL (Mongo) databases in the past, and my DB-modeling ability is from an undergrad databases class. I'm most familiar with Java and Python (and many of the data science python packages), but learning a new language isn't a huge hurdle.
Thanks for your time!

What storage mechanism can I use to store the data related to user interaction of my website for a day

I store information about which items were accessed. That's it initially. I will store the id and type of item that were accessed. For example in a relational table it would be.
id type view
1 dairy product 100
2 meat 88
Later on, in the end of the day, I will transfer this data to the actual table of the product.
products
id name view
1 Cheesy paradise 100
This is a web site, I don't want to update the table everytime the user visits a product. Because the products are in relational database and it would be very unprofessional. I want to make a service in Nodejs that when the user visits a product and stay for 5 secs and roll the page to the bottom I increment a high speed storage and in the end of the day I updated the related products in "one go".
I will handle only 300 visits in diferent products a day. But, of course, I want to my system to grow and it will handle keeping track of 1 thousand of products per minute, for example. In my mind when I though about this feature I thought about using Mongo. But I don't know it seems so much for this simple task. What tecnology can fit this situation better?
I would recommend MongoDB, since you are mostly "dumping" data into a database. That also allows you in the future to dump more information then you will now, no matter what kind of documents you dump now. Mongo is totally fine for a "dump" database structure.

Making a table with fixed columns versus key-valued pairs of metadata?

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?
TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.
Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not nullĀ“, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

What is the best way to store historical data in Solr?

I'm keeping a record of the contents of 100 refrigerators (named fridgeA. fridgeB, etc.) on a daily basis. Everyday the contents change, so one may have Milk, Eggs, and Soda today, but tomorrow might have Milk, Soda, Celery, and Apples.
I need to be able to perform a search for all fridges that contains Eggs on a specific date or date range.
What is the best way to store this data in Solr?
Make sure you store the current date in your data.
Then you can use a date range when you query:
creationDate:[2013-11-25T23:59:59.999Z/DAY TO 2013-11-26T23:59:59.999Z+1YEAR/DAY]
Or you can query a specific day:
creationDate:[2013-11-29T23:59:59.999Z/DAY]
See the wiki or this other helpful answer

Suggestion for designing an Access DB to store chronological snapshots of Excel sheet with ~100 rows and ~100 columns?

I'm an Excel user trying to solve this one problem, and the only efficient way I can think of is do it by a database. I use arrays in programming VBA/Python and I've queried from databases before, but never really designed a database. So I'm here to look for suggestion on how to structure this db in Access.
Anyway, I currently maintain a sheet of ~50 economics indicators for ~100 countries. It's a very straightforward sheet, with
Column headers: GDP , Unemployment , Interest Rate, ... ... ... Population
And Rows:
Argentina
Australia
...
...
Yemen
Zambia
etc.
I want to take snapshots of this sheet so I can see trends and run some analysis in the future. I thought of just keep duplicating the worksheet in Excel but it just feels inefficient.
I've never designed a database before. My question would be what's the most efficient way to store these data for chronological snapshots? In the future I will probably do these things:
Queue up a snapshot for day mm-dd-yy in the past.
Take two different data point of a metric, of however many countries, and track the change/rate of change etc.
Once I can queue them well enough I'll probably do some kind of statistical analysis, which just requires getting the right data set.
I feel like I need to create an individual table for each country and add a row to every country table every time I take a snapshot. I'll try to play with VBA to automate this.
I can't think of any other way to do this with less tables? What would you suggest? Is it a recommended practice to use more than a dozen tables for this task?
There are a couple of ways of doing this,
Option 1
Id suggest you probably only need a single table, something akin to,
Country, date_of_snapshot, columns 1-50 (GDP etc..)
Effective you would add a new row for each day and each country,
Option 2
You could also use a table atructured as below though this would require more complex queries which may be too much for access,
Country, datofsnapshot, factor, value
with each factor GDP etc... getting a row for each date and country

Resources