Foundations of Data Systems¶
Reliable, Scalable, and Maintainable Systems¶
The Internet was done so well that most people think of it as a natural resource like ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error free? ~ Alan Kay
-
Applications today are data-intensive, as opposed to compute-intensive.
- CPU power rarely is a limiting factor
- Bigger problems are the amount, complexity of data and speed with which the data is changing.
-
We think about databases, queues, caches as different tools;
- A database and a message queue have superficial (high level #todo is high level the correct layman description here?) similarity - both store data for some time.
- But, they have very different access patterns, resulting in different performance characteristics and different implementations.
-
Tricky Questions that come up when you are designing a data system/service.
- How do you ensure that the data remains correct and complete, even when things go wrong internally?
- How do you provide consistently good performance to clients, even when parts of the system are degraded?
- How do you scale to handle an increase in load?
- What does a good API for the service look like?
Reliability¶
The system should work correctly, even in the face of adversity.
-
For a software, reliability roughly means the following
- It performs functions as per user expectations
- It tolerates the user making mistake or using software in unexpected ways.
- Its performance is good enough for the required use case, under the expected load and data volume.
- The system prevents any unauthorised access and abuse.
-
There might be times when we might choose to sacrifice reliability in order to reduce development cost, or operational cost, but be very conscious when cutting these corners.
Fault¶
- The things that can go wrong are faults.
- Fault is different From #Failure.
Fault - A component of the system deviating from its spec
Human Errors¶
A study found that majority of outages were caused by configuration errors by human operators, and only 10~25% were hardware faults.
How do we make our systems more reliable, in spire of unreliable humans?¶
- Designing system with less opportunities for human error.
- Well Designed Abstractions, APIs, admin interfaces making the right thing easier and wrong thing difficult to do.
- Decoupling the places where people tend to make mistakes from places where these mistakes could cause failures.
- Provide fully-featured sandbox environments for people to experiment in.
- Test thoroughly at all levels, from unit tests to system wide integration tests.
- Ensure ways for quick & easy recovery from human errors.
- Rollbacks, tools to recompute data, etc.
- Set up clear and detailed monitoring, for performance metrics and error rates.
- This can give us heads up on what is happening and for understanding failures.
Hardware Faults¶
Hard Disks are reported ass having a mean time to failure (MTTF) of 10~50 years. Thus, in a storage cluster with 10,000 disks, we should expect on average 1 disk to fail every day. - The first response is to add redundancy to individual hardware components, e.g. RAM chips, Hard disks, etc. to reduce failure rate of the system. - When one component dies, the redundant component can take its place while broken one is being fixed.
Multi-Machine Redundancy¶
- With increasing data volumes and computing demands for applications, rate of hardware failure has increased, this results in more applications using larger number of machines to become fault tolerant.
- These provide a way to perform a hot-swap, if one pod goes down, it is replaced by another which was watching the first pod's health.
- A system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system ( #rollingUpgrade)
Software Faults¶
Cascading Failures¶
- Systematic errors are hard to anticipate, and since they are correlated across node, they cause many more system failures than uncorrelated hardware faults comparatively.
- A small fault in one component triggers a fault in another component triggering further faults.
Failure¶
The System as a whole stops providing value (required service) to the user.
Fault Tolerance¶
- Systems that can anticipate faults and can cope with them are fault-tolerant or resilient.
- We can only tolerate to certain kinds of faults, we can never make a system 100% fault tolerant.
- What if Earth goes boom? You would need budget to setup a server on a different planet.
- counter intuitively, it can make sense to increase the rate of faults by triggering them deliberately. Reference -> The Netflix Chaos Monkey
Scalability¶
The ability of a system to cope with increased load. - Increased load is one common reason for software degradation. - Discussing Scalability - Doesn't mean, saying X is scalable and Y doesn't scale. - It means, considering questions such as "If the system grows in a particular way, what are our options for coping with the growth?" or "How can we add computing resources to handle the additional load?"
What is Load?¶
- A few Load Parameters (PS. Best choice of parameters for load still depends on your system architecture):
- Requests per second
- Ratio of reads to writes in DB
- Number of simultaneously active users
- Hit rate on a cache
Fan Out¶
- Term borrowed from electrical engineering, where it describes number of logic gate inputs that are attached to another gate's output.
- In transaction processing systems, we use fan out to describe the number of requests to other services that we need to make in order to serve one incoming request.
-
E.g. Twitter's Scaling Problem - each user follows many people and each user is followed by many people.
- Notifying each user about new tweets from all the users they follow and vice versa can lead to scaling problems as this number grows.
- Twitter had two options
- Posting a tweet would insert the new tweet into a global collection of tweets.
- When a user would request their home feed, look up all the people they following, find all their tweets for each of those users, merge them (sorted by time).
- Maintain a cache for each user's home timeline -- e.g. a mailbox of tweets for each recipient user.
- When a user posts a tweet, look up people who follow that user, insert the new tweet into each of their home timeline caches (Personal mail box per user).
- The request to read
home timeline
is now cheap, as it has already been pre-computed.
-
Twitter had started with approach 1, but now majorly use approach 2 for normal users and approach 1 for highly followed users (to avoid posting to millions of personal mailboxes for celebrity accounts)
- Twitter is thus moving to a Hybrid of both the above approaches.
Throughput¶
- Number of records we can process per second
- total time it takes to run a job on a dataset of a certain size.
- In an ideal world, running time of a batch job is the size of dataset divided by the throughput.
- In practice, running time is often higher, due to skew (unever spreading of data), needing for the slowest task to complete, merge time for the end result.
Response Time¶
- The time between a client sending a request and receiving a response.
- Usually more important than throughput in an online application's setting/environment.
Latency v/s Response Time¶
- Latency is the duration that a request is waiting to be handled -- during which it is latent i.e. awaiting service.
- Response time is what the client sees, besides the actual time to process the request (Service time), it includes network delays and queueing delays.
Maintainability¶
Keywords ( #todo add definition for the following)¶
- Datastore
- message queue