Foundations of Data Systems¶

Reliable, Scalable, and Maintainable Systems¶

The Internet was done so well that most people think of it as a natural resource like ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error free? ~ Alan Kay

Applications today are data-intensive, as opposed to compute-intensive.
- CPU power rarely is a limiting factor
- Bigger problems are the amount, complexity of data and speed with which the data is changing.
We think about databases, queues, caches as different tools;
- A database and a message queue have superficial (high level #todo is high level the correct layman description here?) similarity - both store data for some time.
- But, they have very different access patterns, resulting in different performance characteristics and different implementations.
Tricky Questions that come up when you are designing a data system/service.
- How do you ensure that the data remains correct and complete, even when things go wrong internally?
- How do you provide consistently good performance to clients, even when parts of the system are degraded?
- How do you scale to handle an increase in load?
- What does a good API for the service look like?

Reliability¶

The system should work correctly, even in the face of adversity.

For a software, reliability roughly means the following
- It performs functions as per user expectations
- It tolerates the user making mistake or using software in unexpected ways.
- Its performance is good enough for the required use case, under the expected load and data volume.
- The system prevents any unauthorised access and abuse.
There might be times when we might choose to sacrifice reliability in order to reduce development cost, or operational cost, but be very conscious when cutting these corners.

Fault¶

The things that can go wrong are faults.
Fault is different From #Failure.

Fault - A component of the system deviating from its spec

Human Errors¶

A study found that majority of outages were caused by configuration errors by human operators, and only 10~25% were hardware faults.

How do we make our systems more reliable, in spire of unreliable humans?¶

Designing system with less opportunities for human error.
- Well Designed Abstractions, APIs, admin interfaces making the right thing easier and wrong thing difficult to do.
Decoupling the places where people tend to make mistakes from places where these mistakes could cause failures.
- Provide fully-featured sandbox environments for people to experiment in.
Test thoroughly at all levels, from unit tests to system wide integration tests.
Ensure ways for quick & easy recovery from human errors.
- Rollbacks, tools to recompute data, etc.
Set up clear and detailed monitoring, for performance metrics and error rates.
- This can give us heads up on what is happening and for understanding failures.

Hardware Faults¶

Hard Disks are reported ass having a mean time to failure (MTTF) of 10~50 years. Thus, in a storage cluster with 10,000 disks, we should expect on average 1 disk to fail every day. - The first response is to add redundancy to individual hardware components, e.g. RAM chips, Hard disks, etc. to reduce failure rate of the system. - When one component dies, the redundant component can take its place while broken one is being fixed.

Multi-Machine Redundancy¶

With increasing data volumes and computing demands for applications, rate of hardware failure has increased, this results in more applications using larger number of machines to become fault tolerant.
- These provide a way to perform a hot-swap, if one pod goes down, it is replaced by another which was watching the first pod's health.
A system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system ( #rollingUpgrade)

Software Faults¶

Cascading Failures¶

Systematic errors are hard to anticipate, and since they are correlated across node, they cause many more system failures than uncorrelated hardware faults comparatively.
A small fault in one component triggers a fault in another component triggering further faults.

Failure¶

The System as a whole stops providing value (required service) to the user.

Fault Tolerance¶

Systems that can anticipate faults and can cope with them are fault-tolerant or resilient.
We can only tolerate to certain kinds of faults, we can never make a system 100% fault tolerant.
- What if Earth goes boom? You would need budget to setup a server on a different planet.
counter intuitively, it can make sense to increase the rate of faults by triggering them deliberately. Reference -> The Netflix Chaos Monkey

Scalability¶

The ability of a system to cope with increased load. - Increased load is one common reason for software degradation. - Discussing Scalability - Doesn't mean, saying X is scalable and Y doesn't scale. - It means, considering questions such as "If the system grows in a particular way, what are our options for coping with the growth?" or "How can we add computing resources to handle the additional load?"

What is Load?¶

A few Load Parameters (PS. Best choice of parameters for load still depends on your system architecture):
- Requests per second
- Ratio of reads to writes in DB
- Number of simultaneously active users
- Hit rate on a cache

Fan Out¶

Term borrowed from electrical engineering, where it describes number of logic gate inputs that are attached to another gate's output.
In transaction processing systems, we use fan out to describe the number of requests to other services that we need to make in order to serve one incoming request.
E.g. Twitter's Scaling Problem - each user follows many people and each user is followed by many people.
- Notifying each user about new tweets from all the users they follow and vice versa can lead to scaling problems as this number grows.
- Twitter had two options
- Posting a tweet would insert the new tweet into a global collection of tweets.
  - When a user would request their home feed, look up all the people they following, find all their tweets for each of those users, merge them (sorted by time).
- Maintain a cache for each user's home timeline -- e.g. a mailbox of tweets for each recipient user.
  - When a user posts a tweet, look up people who follow that user, insert the new tweet into each of their home timeline caches (Personal mail box per user).
  - The request to read home timeline is now cheap, as it has already been pre-computed.
Twitter had started with approach 1, but now majorly use approach 2 for normal users and approach 1 for highly followed users (to avoid posting to millions of personal mailboxes for celebrity accounts)
Twitter is thus moving to a Hybrid of both the above approaches.

Throughput¶

Number of records we can process per second
total time it takes to run a job on a dataset of a certain size.
In an ideal world, running time of a batch job is the size of dataset divided by the throughput.
- In practice, running time is often higher, due to skew (unever spreading of data), needing for the slowest task to complete, merge time for the end result.

Response Time¶

The time between a client sending a request and receiving a response.
Usually more important than throughput in an online application's setting/environment.

Latency v/s Response Time¶

Latency is the duration that a request is waiting to be handled -- during which it is latent i.e. awaiting service.
Response time is what the client sees, besides the actual time to process the request (Service time), it includes network delays and queueing delays.

Maintainability¶

Keywords ( #todo add definition for the following)¶

Datastore
message queue