We aggregate cinema data. Our primary dataset includes movie showtimes, ticket prices and admissions. We combine this data with all sorts of supporting data, including data that we get from YouTube, Twitter and weather reports. The end result is a comprehensive time-series dataset describing the entire theatrical movie release window.
Michael Stone Explains His Scale of Evil
The goal is to predict movie performance far into the future. Every time a person reserves or purchases a ticket from either of these cinemas, we capture a snapshot describing attributes of every seat in the auditorium. This adds up to 1.
We went through several providers:. Google 2. Amazon 3. We got USD k in startup credits from Google. This was the primary deciding factor for choosing their services. This was a known bug that is fixed in newer PostgreSQL versions. The lack of response from the support acknowledging the issue was a big enough red-flag to move on. I am glad we did move on, because it has been 8 months since we have raised the issue, and the version of PostgreSQL has not been updated:.
As Amazon announced Timestream their own time-series database , it became clear that this requirement will not be addressed in the foreseeable future this issue has been already open for 2 years.
Then we moved to Aiven. It had all the extensions that I needed including TimescaleDB , it did not lock us in with a particular server provider meaning we could host our Kubernetes cluster on either of the Aiven. However, what I have overlooked is that you do not get superuser access.
This resulted in numerous issues e. When this happened, support offered to upgrade the instance to one with a larger volume. While this is a fine solution, it caused a longer than necessary outage. Someone with SSH access could have diagnosed and fixed this issue in couple of minutes. And when we started to experience continuous outages due to what later turned out to be a bug in TimescaleDB extension used by Aiven. Despite me giving shit to Aiven. Tolerating my questions that are already covered in documentation and aiding with troubleshooting issues.
All this time I was trying to avoid the unavoidable — managing the database ourselves. Now we are renting our own hardware and maintain the database.
Therefore, you must plan for what features you will require in the future. For a simple database that will not grow into billions of records and does not require custom extensions, I would pick either without a second thought the near instant ability to scale the instance, migrate servers to different territories, point-in-time recovery, built-in monitoring tools and managed replication saves a lot of time. If your business is all about the data and you know that you will require custom hardware configuration and whatnot, then your best bet is hosting and managing the database yourself.
If I would start over and would have spent time to estimate how quick and how large we are going to grow, I would have used bare-metal setup and hired a freelance DBA from the first day. My primary criteria for choosing managed services was the reduced management overhead.
I assumed that the cost and hardware is going to be about the same. I thought that the materialized views is a good enough feature on its own to learn PostgreSQL. In contrast, I thought I will never run scripts in the database MySQL teaches you that database is only for storing data and all logic must be implemented in the application code. Two years later, we got rid of most materialized views and we are using hundreds of custom procedures. But before that, there were multiple botched attempts at using materialized views. There were only two rules to adhere:.
There is nothing wrong with the above query. This approach worked for a long time. However, as the number of records grew to millions and billions the time it took to refresh materialized views grew from a couple of seconds to hours. If you are not familiar with materialized views, then it is worth noting that you can only refresh the entire materialized view; there is no way to refresh a subset of a view based on a condition.
I tried to solve the issue by breaking down MVs into multiple smaller MVs, e. The benefit of this approach is that:. We broke-down one long-transaction into many shorter transactions. We are able to use indexes to speed up the JOINs.
We are able to refresh individual materialized views some data changes more often than the other. The downside of this approach is that it proliferated the number of materialized views that we use and required to develop a custom solution to orchestrate refreshing of the materialized views.
At the time, it seemed reasonable and I went with it. A separate program was written to perform materialization using these instructions. In general, this approach worked well. However, we soon outgrew this approach. A view that requires to scan an entire table was not feasible for large tables with billions of records. I have described how we have used MVs to effectively extend a table. This approach did not scale with large tables. Thus the third iteration was born: instead of using materialized views to extend the base table, create materialized views that abstract a data domain. This approach works and we continue to use several such materialized views.
Rotherham: In the face of such evil, who is the racist now? - Telegraph
While the latter approach covered all our day to day operations, we still needed to run queries on the historical data. Running these queries without materialized views would take a lot of index planning for individual queries. Furthermore, running long transactions against the master instance would have prevented autovacuum and caused table bloat.
I could have created a logical replication and allowed analysts to run whatever queries on that instance without blocking autovacuum. If a fire or a labor strike disables one node in a supply network, another outfit can just as easily slot in, without the company that commissioned the goods ever becoming aware of it. Instead, each node has to talk only to its neighboring node, passing goods through a system that, considered in its entirety, is staggeringly complex.
In this way, these physical infrastructures distributed all over the world are very much like the invisible network that makes them possible: the internet. By the time goods surface as commodities to be handed through the chain, purchasing at scale demands that information about their origin and manufacture be stripped away.
Ethan Jewett explained the problem to me in terms of a theoretical purchase of gold:. In some sense all gold is the same, so you just buy the cheapest gold you can get. But if you look at it in another way, it matters how it was mined and transported. And then all of the sudden, every piece of gold is a little bit different. And so it becomes very difficult to compare these things that, in terms of your actual manufacturing process, are almost exactly the same. As Jewett described this state of affairs, I felt a jolt of recognition.
A programmer need only know about the module with which she is working, because managing the complexity of the entire system would be too much to ask of any single individual. From there, the notion of modularity proliferated wildly, as a way of thinking about and structuring everything from organizations to economics to knitting.