Skip to content

Foundations of Data Systems

Reliable, Scalable, and Maintainable Systems

The Internet was done so well that most people think of it as a natural resource like ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error free? ~ Alan Kay

  • Applications today are data-intensive, as opposed to compute-intensive.

    • CPU power rarely is a limiting factor
    • Bigger problems are the amount, complexity of data and speed with which the data is changing.
  • We think about databases, queues, caches as different tools;

    • A database and a message queue have superficial (high level #todo is high level the correct layman description here?) similarity - both store data for some time.
    • But, they have very different access patterns, resulting in different performance characteristics and different implementations.
  • Tricky Questions that come up when you are designing a data system/service.

    • How do you ensure that the data remains correct and complete, even when things go wrong internally?
    • How do you provide consistently good performance to clients, even when parts of the system are degraded?
    • How do you scale to handle an increase in load?
    • What does a good API for the service look like?

Reliability

The system should work correctly, even in the face of adversity.

  • For a software, reliability roughly means the following

    • It performs functions as per user expectations
    • It tolerates the user making mistake or using software in unexpected ways.
    • Its performance is good enough for the required use case, under the expected load and data volume.
    • The system prevents any unauthorised access and abuse.
  • There might be times when we might choose to sacrifice reliability in order to reduce development cost, or operational cost, but be very conscious when cutting these corners.

Fault

  • The things that can go wrong are faults.
  • Fault is different From #Failure.

    Fault - A component of the system deviating from its spec

Human Errors

A study found that majority of outages were caused by configuration errors by human operators, and only 10~25% were hardware faults.

How do we make our systems more reliable, in spire of unreliable humans?
  • Designing system with less opportunities for human error.
    • Well Designed Abstractions, APIs, admin interfaces making the right thing easier and wrong thing difficult to do.
  • Decoupling the places where people tend to make mistakes from places where these mistakes could cause failures.
    • Provide fully-featured sandbox environments for people to experiment in.
  • Test thoroughly at all levels, from unit tests to system wide integration tests.
  • Ensure ways for quick & easy recovery from human errors.
    • Rollbacks, tools to recompute data, etc.
  • Set up clear and detailed monitoring, for performance metrics and error rates.
    • This can give us heads up on what is happening and for understanding failures.
Hardware Faults

Hard Disks are reported ass having a mean time to failure (MTTF) of 10~50 years. Thus, in a storage cluster with 10,000 disks, we should expect on average 1 disk to fail every day. - The first response is to add redundancy to individual hardware components, e.g. RAM chips, Hard disks, etc. to reduce failure rate of the system. - When one component dies, the redundant component can take its place while broken one is being fixed.

Multi-Machine Redundancy
  • With increasing data volumes and computing demands for applications, rate of hardware failure has increased, this results in more applications using larger number of machines to become fault tolerant.
    • These provide a way to perform a hot-swap, if one pod goes down, it is replaced by another which was watching the first pod's health.
  • A system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system ( #rollingUpgrade)
Software Faults
Cascading Failures
  • Systematic errors are hard to anticipate, and since they are correlated across node, they cause many more system failures than uncorrelated hardware faults comparatively.
  • A small fault in one component triggers a fault in another component triggering further faults.

Failure

The System as a whole stops providing value (required service) to the user.

Fault Tolerance

  • Systems that can anticipate faults and can cope with them are fault-tolerant or resilient.
  • We can only tolerate to certain kinds of faults, we can never make a system 100% fault tolerant.
    • What if Earth goes boom? You would need budget to setup a server on a different planet.
  • counter intuitively, it can make sense to increase the rate of faults by triggering them deliberately. Reference -> The Netflix Chaos Monkey

Scalability

The ability of a system to cope with increased load. - Increased load is one common reason for software degradation. - Discussing Scalability - Doesn't mean, saying X is scalable and Y doesn't scale. - It means, considering questions such as "If the system grows in a particular way, what are our options for coping with the growth?" or "How can we add computing resources to handle the additional load?"

What is Load?
  • A few Load Parameters (PS. Best choice of parameters for load still depends on your system architecture):
    • Requests per second
    • Ratio of reads to writes in DB
    • Number of simultaneously active users
    • Hit rate on a cache
Fan Out
  • Term borrowed from electrical engineering, where it describes number of logic gate inputs that are attached to another gate's output.
  • In transaction processing systems, we use fan out to describe the number of requests to other services that we need to make in order to serve one incoming request.
  • E.g. Twitter's Scaling Problem - each user follows many people and each user is followed by many people.

    • Notifying each user about new tweets from all the users they follow and vice versa can lead to scaling problems as this number grows.
    • Twitter had two options
    • Posting a tweet would insert the new tweet into a global collection of tweets.
      • When a user would request their home feed, look up all the people they following, find all their tweets for each of those users, merge them (sorted by time).
    • Maintain a cache for each user's home timeline -- e.g. a mailbox of tweets for each recipient user.
      • When a user posts a tweet, look up people who follow that user, insert the new tweet into each of their home timeline caches (Personal mail box per user).
      • The request to read home timeline is now cheap, as it has already been pre-computed.
  • Twitter had started with approach 1, but now majorly use approach 2 for normal users and approach 1 for highly followed users (to avoid posting to millions of personal mailboxes for celebrity accounts)

  • Twitter is thus moving to a Hybrid of both the above approaches.
Throughput
  • Number of records we can process per second
  • total time it takes to run a job on a dataset of a certain size.
  • In an ideal world, running time of a batch job is the size of dataset divided by the throughput.
    • In practice, running time is often higher, due to skew (unever spreading of data), needing for the slowest task to complete, merge time for the end result.
Response Time
  • The time between a client sending a request and receiving a response.
  • Usually more important than throughput in an online application's setting/environment.
Latency v/s Response Time
  • Latency is the duration that a request is waiting to be handled -- during which it is latent i.e. awaiting service.
  • Response time is what the client sees, besides the actual time to process the request (Service time), it includes network delays and queueing delays.

Maintainability

Keywords ( #todo add definition for the following)
  • Datastore
  • message queue