Foundations of Data Systems

Designing Data-Intensive Applications - Part I

References

Book "Designing Data-Intensive Applications"
- ZH Ver. :《数据密集型应用系统设计》

Reliable, Scalable & Maintainable Systems

Reliability ( 可靠性 )

Tolerating hardware & software faults, human errors
- The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity
- ( 当出现意外情况如硬件、软件故障、人为失误等, 系统应可以继续正常运转 : 虽然性能可能有所降低, 但确保功能正常 )

Scalability ( 可拓展性 / 可伸缩性 )

Measuring load & performance : Latency percentiles, throughput
- As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
- ( 随着规模的增长, 例如数据量、流量或复杂性, 系统应以合理的方式来匹配这种增长 )

Maintainability ( 可维护性 )

Operability, simplicity & evolvability ( 可演化性 )
- Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
- ( 随着时间的推移, 许多新的人员参与到系统开发和运维, 以维护现有功能或适配新场景, 系统都应高效运转 )

Application Types

Data-intensive 数据密集型
Compute-intensive 计算密集型

Data Systems

Database
- Store data so that they, or another application, can find it again later
  - ( 数据库 : 存储数据, 之后应用可再次访问 )
Cache
- Remember the result of an expensive operation, to speed up reads
  - ( 高速缓存 : 缓存那些复杂或操作代价昂贵的结果, 以加快下一次访问 )
Search Index
- Allow users to search data by keyword or filter it in various ways
  - ( 索引 : 按照关键字搜索数据, 并支持各种过滤 )
Stream processing
- Send a message to another process, to be handled asynchronously
  - ( 流式处理 : 持续发送消息至另一个进程, 处理采用异步方式 )
Batch processing
- Periodically crunch a large amount of accumulated data
  - ( 批处理 : 定期处理大量的累积数据 )

Others

Full-text search server ( 全文索引服务 )
- e.g.: Elasticsearch / Solr ( both from Lucene )
Rolling upgrade ( 滚动升级 )

Reliability

Expectations

Continuing to work correctly, even when things go wrong.
- ( 即使发生了某些错误, 系统仍可以继续正常工作 )

Concepts

Faults
- The things that can go wrong
- ( 错误/故障 : 可能出错的事情 )
Fault-tolerant or Resilient
- Systems that anticipate faults and can cope with them
- ( 容错/弹性 : 系统可应对错误 )

Fault 故障 / Failure 失效

Differences
- Fault : One component of the system deviating from its spec
  - ( 故障 : 组件偏离其正常规格 )
- Failure : A failure is when the system as a whole stops providing the required service to the user
  - ( 失效 : 系统作为一个整体停止, 无法向用户提供所需的服务 )
Targets
- It is impossible to reduce the probability of a fault to zero;
- therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
Build reliable systems from unreliable parts
- ( 在不可靠组件基础上, 构建可靠性系统 )

Scalability

Describing Load

Load Parameters ( 负载参数 )
- Web server : Requests Per Second ( RPS ) / Queries Per Second ( QPS )
- Database : Ratio of reads to writes ( 写入比例 )
- Chat room : Number of simultaneously active users ( 在线人数 )
- Cache : Hit rate ( 命中率 )
Example : Twitter
- Main operations
  - Post tweet : avg rps 4.6k , peak rps 12k ( Nov 2012 )
  - Home timeline : avg rps 300k
- Challenge : Fan-out ( 扇出 )
  - Implementation
    - Pull ( 拉模型 )
    - Push ( 推模型 )
    - Push & Pull ( 推拉结合 )
- ommitted here … ( 重要! 详见原书例 )

Describing Performance

Look at it in two ways:

When you increase a load parameter and keep the system resources (CPU, mem‐ ory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Performance Numbers ( 性能指标 )

Throughput ( 吞吐量 ) : The number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size
Response Time ( 响应时间 ) : The time between a client sending a request and receiving a response

Differ latency from response time

Response Time = Service Time + Network Delays + Queueing Delays
- Service Time ( 服务时间 ) : the actual time to process the request
Latency : The duration that a request is waiting to be handled
- during which it is latent, awaiting service

Random Additional Latency ( 每次请求的响应时间, 由于许多因素的影响而不同 )

a context switch to a background process ( 上下文切换进程调度 )
the loss of a network packet and TCP retransmission ( 网络数据包丢失和 TCP 重传 )
a garbage collection pause ( 垃圾回收暂停 )
a page fault forcing a read from disk ( 缺页中断磁盘IO )
mechanical vibrations in the server rack ( 甚至是服务器支架的机械振动 )
…

Percentiles

Response Time

The mean is not a very good metric if you want to know your "typical" response time,
- because it doesn't tell you how many users actually experienced that delay.
- Usually it is better to use percentiles ( 百分位数 ) .
And median response time : half your requests return in less than the median, and half your requests take longer than that.
- This makes the median a good metric if you want to know how long users typically have to wait…
- The median is also known as the 50th percentile, and sometimes abbreviated as p50.
* Strictly speaking, the term "average" doesn't refer to any particular formula,
- but in practice it is usually understood as the arithmetic mean: given n values, add up all the values, and divide by n.

Percentiles ( 百分位数 )

Response time thresholds ( 响应时间阈值 )
- p95 : e.g., if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
- p99 / p999 / etc.
- mean = p50
High percentiles of response times, also known as tail latencies ( 尾部延迟 / 长尾效应 ), are important because they directly affect users' experience of the service.

Service Level Objectives ( SLOs ) ( 服务质量目标 ) and Service Level Agreements ( SLAs ) ( 服务质量协议 )

Percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availability of a service.
- e.g.: An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under 1 s ( if the response time is longer, it might as well be down ), and the service may be required to be up at least 99.9% of the time.

Queueing delays ( 排队延迟 )

Queueing delays often account for a large part of the response time at high percentiles.
As a server can only process a small number of things in parallel ( limited, for example, by its number of CPU cores ),
- it only takes a small number of slow requests to hold up the processing of subsequent requests -- an effect sometimes known as head-of-line blocking.
- Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete.
Due to this effect, it is important to measure response times on the client side.

Percentiles in Practice

Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete.
Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow ( an effect known as tail latency amplification ( 长尾效应 ) ).
- ( 即使只有很小百分比的请求缓慢, 如果某用户总是频繁产生这种调用, 最终总体变慢的概率就会增加, 即长尾效应 )

Cope with Load

Approaches for Coping with Load ( 应对负载增加的方法 )

Scale Up & Scale Out

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load.
People often talk of a dichotomy between ( 做取舍 )
- scaling up ( vertical scaling, moving to a more powerful machine ) and
  - 垂直拓展 ( 即升级到更强大的机器 )
- scaling out ( horizontal scaling, distributing the load across multiple smaller machines ).
  - 水平拓展 ( 即将负载分布到多个更小的机器 )
Distributing load across multiple machines is also known as a shared-nothing architecture.
- 在多台机器上分配负载也被称为无共享体系结构
A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can't avoid scaling out.
- 在单台机器上运行系统通常更简单, 然而高端机器可能非常昂贵, 且拓展水平有限, 最终往往还是无法避免需要水平拓展
In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.
- 实际上, 好的架构通常要做些实际取舍. 例如, 使用几个强悍的服务器仍可以比大量的小型虚拟机来得更简单便宜

Elastic ( 弹性 )

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase,
- whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system).
An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.

Others

Easy of use 易用性
In an early-stage startup or an unproven product it's usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.
- ( 对于初创公司或尚未定型的产品, 快速迭代推出产品功能, 往往比应对不可知的拓展性更为重要 )

Mantainability

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance :
- fixing bugs,
- keeping its systems operational,
- investigating failures ( 故障排查 ) ,
- adapting it to new platforms,
- modifying it for new use cases,
- repaying technical debt ( 偿还技术债 ) ,
- and adding new features.
We can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ( 过时的系统 ) ourselves.
- To this end, we will pay particular attention to three design principles for software systems…

Operability 可运维性 : Making Life Easy for Operations

Make it easy for operations ( 运维 ) teams to keep the system running smoothly ( 平稳运行 ) .

Simplicity 简单性 : Managing Complexity

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
- ( Note this is not the same as simplicity of the user interface. 跟用户界面的简单性不一样 )

Evolvability 可演化性 : Making Change Easy

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.
- aka. extensibility, modifiability, or plasticity.
- ( 可延伸性, 易修改性, 可塑性 )

Nonfunctional Requirements

security 安全性
reliability
compliance 合规性
scalability
compatibility 兼容性
mantainability 可运维性

Data Models and Query Languages

Relational Model 关系模型
Document Model 文档模型
NoSQL : Not Only SQL
polyglot persistence ( 混合持久化 )
- use both relational and nonrelational datastores

Driving forces behind the adoption of NoSQL databases

A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
A widespread preference for free and open source software over commercial database products
Specialized query operations that are not well supported by the relational model
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

The Object-Relational Mismatch ( 对象-关系不匹配 )

ORM - Object-Relational Mapping

Schema flexibility in the document model

Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents.
XML support in relational databases usually comes with optional schema validation.
No schema ( 无模式 ) means that arbitrary keys and values can be added to a document,
- and when reading, clients have no guarantees as to what fields the documents may contain.

Many-to-One and Many-to-Many Relationships

If your application has mostly one-to-many rela‐ tionships (tree-structured data) or no relationships between records, the document model is appropriate.
The relational model can handle simple cases of many-to-many relationships, …

Schema flexibility ( 模式灵活性 ) in the document model

Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents.
- XML support in relational databases usually comes with optional schema validation.
No schema means that arbitrary keys and values can be added to a document,
- and when reading, clients have no guarantees as to what fields the documents may contain.

Data locality for queries ( 查询的数据局部性 )

A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB's BSON).
If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality.
If data is split across multiple tables, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.

Query Languages for Data

Imperative ( 命令式 ) : IMS, CODASYL ( Conference on Data System Languages )
- Tells the computer to perform certain operations in a certain order.
  - You can imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.
Declarative ( 声明式 ) : SQL, CSS, XSL ( XPath expression )
- Just specify the pattern of the data you want -- what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) -- but not how to achieve that goal.
  - It is up to the database system's query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.
- It is attractive because it is typically more concise and easier to work with than an imperative API.
  - But more importantly, it also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.

MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between:
- the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework.
It is based on the map ( aka. collect ) and reduce ( aka. fold or inject ) functions that exist in many functional programming languages.
MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines.
- Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations, but there are also many distributed implementations of SQL that don't use MapReduce.

Graph-Like Data Models ( 图状数据模型 )

The relational model can handle simple cases of many-to-many relationships,
- but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.
A graph consists of two kinds of objects: vertices ( aka. nodes or entities ) and edges ( aka. relationships or arcs ).

Property Graphs ( 属性图 )

In the property graph model, each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices
- A collection of properties (key-value pairs)
Some important aspects of this model are:
- 1. Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
- 2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph -- i.e., follow a path through a chain of vertices -- both forward and backward. (That's why Example 2-2 has indexes on both the tail_vertex and head_vertex columns.)
- 3. By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.
The Cypher Query Language
- Cypher is a declarative query language for property graphs, created for the Neo4j graph database.
Graph Queries in SQL
- 可以用关系型数据库实现图查询; 但是查询语句较长, 也更难理解; 说明用关系型模型来实现图模型, 还是不太适合

Triple-Stores and SPARQL ( 三元存储 … )

The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas.
- It is nevertheless worth discussing, because there are various tools and languages for triple-stores that can be valuable additions to your toolbox for building applications.
Semantic web ( 语义网 )
- The semantic web is fundamentally a simple and reasonable idea
  - websites already publish information as text and pictures for humans to read, so why don't they also publish information as machine-readable data for computers to read?
RDF data model
- RDF - Resource Description Framework 资源描述框架
SPARQL query language
- SPARQL is a query language for triple-stores using the RDF data model.
- ( It is an acronym for SPARQL Protocol and RDF Query Language, pronounced "sparkle." )
- It predates Cypher, and since Cypher's pattern matching is borrowed from SPARQL, they look quite similar.

The Foundation: Datalog

omitted …

Storage and Retrival

数据存储与检索

OLTP - Two families of storage engines :

log-structured storage engines
page-oriented storage engines ( such as B-trees )

Data Structures That Power Your Database

Log

The word log is often used to refer to application logs, where an application outputs text that describes what's happening.
In this book, log is used in the more general sense: an append-only sequence of records.
- It doesn't have to be human-readable; it might be binary and intended only for other programs to read.

Hash Indexes

Possible Indexing Strategy

Let's say our data storage consists only of appending to a file.
Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file -- the location at which the value can be found.
Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys).
When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

Disk Space

… we only ever append to a file -- so how do we avoid eventually running out of disk space?
A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file.

Compaction

… since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction.
Segments are never modified after they have been written, so the merged segment is written to a new file.
The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files.
After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments -- and then the old segment files can simply be deleted.

Each segment has its own in-memory hash table, mapping keys to file offsets.

In order to find the value for a key, we first check the most recent segment's hash map; if the key is not present we check the second-most-recent segment, and so on.
The merging process keeps the number of segments small, so lookups don't need to check many hash maps.

Some of the issues that are important in a real implementation are :

File format CSV is not the best format for a log. It's faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).
Deleting records If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log seg‐ ments are merged, the tombstone tells the merging process to discard any previ‐ ous values for the deleted key.
Crash recovery If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment's hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. Bitcask speeds up recovery by storing a snapshot of each segment's hash map on disk, which can be loaded into mem‐ ory more quickly.
Partially written records The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control As writes are appended to the log in a strictly sequential order, a common imple‐ mentation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

Why don't you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons:

Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives.
Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
- For example, you don't have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together. ( 不用担心旧值和新值混杂在一起的文件 )
Merging old segments avoids the problem of data files getting fragmented over time.

Limitations of Hash Table Index

The hash table must fit in memory, so if you have a very large number of keys, you're out of luck.
- In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well.
- It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly ( 要求高精度的/复杂的 ) logic.
Range queries ( 范围查询 ) are not efficient.

SSTables and LSM-Trees

SSTable : Sorted String Table

We also require that each key only appears once within each merged segment file ( the compaction process already ensures that ).

SSTables have several big advantages over log segments with hash indexes:

Merging segments is simple and efficient, even if the files are bigger than the available memory.
- ( 方便使用归并算法, 将多个输入段并发合并 ( 成一个输出段 ) )
In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory.
- ( 有序的数据, 可以进行范围查询 )
Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk.
- Each entry of the sparse ( 稀疏的 ) in-memory index then points at the start of a compressed block.
- Besides saving disk space, compression also reduces the I/O bandwidth use.

Making an LSM-tree out of SSTables

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.
- A full-text index is much more complex than a key-value index but is based on a similar idea : given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word.
- This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list).
- In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed.

Performance optimizations

The LSM-tree algorithm can be slow when looking up keys that do not exist in the database : you have to check the memtable, then the segments all the way back to the oldest ( possibly having to read from disk for each one ) before you can be sure that the key does not exist.
- In order to optimize this kind of access, storage engines often use additional Bloom filters.
- ( A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys. )
There are also different strategies to determine the order and timing of how SSTables are compacted and merged.
- The most common options are size-tiered and leveled compaction.
- LevelDB and RocksDB use leveled compaction ( hence the name of LevelDB ), HBase uses size-tiered, and Cassandra supports both.
- In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.
- In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate "levels", which allows the compaction to proceed more incrementally and use less disk space.

B-Tree

B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size ( sometimes bigger ), and read or write one page at a time.
- This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
Each page can be identified using an address or location, which allows one page to refer to another -- similar to a pointer, but on disk instead of in memory.
- We can use these page references to construct a tree of pages
The number of references to child pages in one page of the B-tree is called the branching factor.
- In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.

Comparing B-Trees and LSM-Trees

Advantages of LSM-trees

Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables.
- This effect -- one write to the database resulting in multiple writes to the disk over the course of the database's lifetime -- is known as write amplification.
- It is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.
In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk.
- LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification ( although this depends on the storage engine configuration and workload ), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree.
- ( 写入性能, 部分受制于 "写入放大" 的问题 )

Downsides of LSM-trees

The compaction process can sometimes interfere with the performance of ongoing reads and writes.
- Even though storage engines try to perform compaction incrementally and without affecting concurrent access, disks have limited resources, so it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation.
- The impact on throughput and average response time is usually small, but at higher percentiles the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable.
- ( 压缩过程有时会干扰读写; LSM-Tree 写入速度通常很快, 但偶尔会很慢, 而 B-tree 速度更稳定可预测 )
The disk's finite write bandwidth needs to be shared between the initial write ( logging and flushing a memtable to disk ) and the compaction threads running in the background.
- When writing to an empty database, the full disk bandwidth can be used for the initial write, but the bigger the database gets, the more disk bandwidth is required for compaction.
- ( 存储的数据越多, 压缩过程需要更多的磁盘I/O带宽, 会导致写入性能下降 )
If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes.
- ( 压缩速度跟不上写速度 )
An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments.
- This aspect makes B-trees attractive in databases that want to offer strong transactional semantics : in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
- ( B-Tree 比 LSM-Tree 加锁更容易, 便于实现事务 )

Other Indexing Structures

Storing values within the index

The key in an index is the thing that queries search for, but the value can be one of two things : it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere.
- In the latter case, the place where rows are stored is known as a heap file, and it stores data in no particular order ( it may be append-only, or it may keep track of deleted rows in order to overwrite them with new data later ).
- The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present : each index just references a location in the heap file, and the actual data is kept in one place.
When updating a value without changing the key, the heap file approach can be quite efficient: the record can be overwritten in place, provided that the new value is not larger than the old value.
- The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space.
- In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer ( 间接指针 ) is left behind in the old heap location.

Clustered Index ( 聚集索引 )

In some situations, the extra hop ( 跳转 ) from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index.
For example, **in MySQL's InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key ( rather than ( 而不是 ) a heap file location )

Covering Index ( 覆盖索引 )

A compromise between a clustered index ( storing all row data within the index ) and a nonclustered index ( storing only references to the data within the index ) is known as a covering index or index with included columns, which stores some of a table's columns within the index.
- This allows some queries to be answered by using the index alone ( in which case, the index is said to cover the query ) .

Multi-column indexes ( 多列索引 )

omitted…

Keeping everything in memory

Products such as VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model, and the vendors claim that they can offer big performance improvements by removing all the overheads associated with managing on-disk data structures.
- RAMCloud is an open source, in-memory key-value store with durability ( using a log-structured approach for the data in memory as well as the data on disk ) .
- Redis and Couchbase provide weak durability by writing to disk asynchronously.
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don't need to read from disk.
- Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.

Transaction Processing or Analytics?

Transaction : referring to a group of reads and writes that form a logical unit.

A transaction needn't necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties.
Transaction processing just means allowing clients to make low-latency reads and writes -- as opposed to batch processing jobs, which only run periodically (for example, once per day).

OLTP

An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user's input.
- Because these applications are interactive, the access pattern became known as online transaction processing ( OLTP ) .

OLAP

Queries for data analytics are often written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence).
- In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing ( OLAP ) .

Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or event stream
Primarily used by	End user/customer, via web application	Internal analyst, for decision support
What data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes (GB~TB)	Terabytes to petabytes (TB~PB)

Data Warehousing

These OLTP systems are usually expected to be highly available and to process transactions with low latency, since they are often critical to the operation of the business.
- They are usually reluctant ( 不情愿的 ) to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions.
A data warehouse, by contrast, is a separate database that analysts can query to their hearts' content, without affecting OLTP operations.
- The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
- Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.
- This process of getting data into the warehouse is known as Extract–Transform–Load (ETL).

The divergence between OLTP databases and data warehouses

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytic queries.
- There are many graphical data analysis tools that generate SQL queries, visualize the results, and allow analysts to explore the data ( through operations such as drill-down and slicing and dicing ( 向下钻取 / 切片 / 切丁 ) ).
On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface.
- However, the internals of the systems can look quite different, because they are optimized for very different query patterns.
- Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

Stars and Snowflakes: Schemas for Analytics

Many data warehouses are used in a fairly formulaic ( 公式化的 ) style, known as a star schema ( 星型模式 ) ( also known as dimensional modeling ( 维度建模 ) ).
- At the center of the schema is a so-called fact table ( 事实表 ) .
  - Each row of the fact table represents an event that occurred at a particular time.
  - Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later.
  - However, this means that the fact table can become extremely large.
  - Some of the columns in the fact table are attributes.
- Other columns in the fact table are foreign key references to other tables, called dimension tables ( 维度表 ).
  - As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.
A variation of this template is known as the snowflake schema ( 雪花模型 ), where dimensions are further broken down into subdimensions.
- Snowflake schemas are more normalized ( 规范化的 ) than star schemas, but star schemas are often preferred because they are simpler for analysts to work with.
In a typical data warehouse, tables are often very wide : fact tables often have over 100 columns, sometimes several hundred.
- Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysis.

Column-Oriented Storage

( 列式存储 )

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem.
- Dimension tables are usually much smaller ( millions of rows ), so in this section we will concentrate primarily on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time ("SELECT *" queries are rarely needed for analytics).
- ( icehe : 例如 "外卖骑手群组" 的计算就可能需要超过 4~5 个维度 )

How can we execute the query efficiently?

In most OLTP databases, storage is laid out in a row-oriented fashion:
- all the values from one row of a table are stored next to each other.
Document databases are similar:
- an entire document is typically stored as one contiguous sequence of bytes.
The idea behind column-oriented storage is simple:
- don't store all the values from one row together, but store all the values from each column together instead.
If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.
The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data.
- Fortunately, column-oriented storage often lends itself very well to compression.
- One technique that is particularly effective in data warehouses is bitmap encoding ( 位图编码 ) .
Often, the number of distinct values in a column is small compared to the number of rows.
- ( For example, a retailer may have billions of sales transactions, but only 100,000 distinct products ).
- We can now take a column with n distinct values and turn it into n separate bitmaps :
  - one bitmap for each distinct value, with one bit for each row.
  - The bit is 1 if the row has that value, and 0 if not.
If n is very small ( for example, a country column may have approximately 200 distinct values ), those bitmaps can be stored with one bit per row.
- But if n is bigger, there will be a lot of zeros in most of the bitmaps ( we say that they are sparse ( 稀疏的 ) ). In that case, the bitmaps can additionally be run-length encoded ( 游程编码 ).
- This can make the encoding of a column remarkably compact.

Column-oriented storage and column families ( 面向列的存储和列族 )

Cassandra and HBase have a concept of column families, which they inherited from Bigtable.
However, it is very misleading to call them column-oriented :
- within each column family, they store all columns from a row together, along with a row key, and they do not use column compression.
Thus, the Bigtable model is still mostly row-oriented.

Sort Order in Column Storage

Note that it wouldn't make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row.
- We can only reconstruct a row because we know that the kth item in one column belongs to the same row as the kth item in another column.
…

Writing to Column-Oriented Storage

Most of the load consists of large read-only queries run by analysts.
- Column-oriented storage, compression, and sorting all help to make those read queries faster.
- However, they have the downside of making writes more difficult.
An update-in-place approach, like B-trees use, is not possible with compressed columns.
- If you wanted to insert a row in the middle of a sorted table, you would most likely have to rewrite all the column files.
- As rows are identified by their position within a column, the insertion has to update all columns consistently.

Encoding and Evolution

数据编码与演化
- REST - Representational State Transfer ( 具象状态传输 )
- RPC - Remote Procedure Calls ( 远程过程调用 )

Old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions:

Backward compatibility ( 向后兼容 / 向过去兼容 )
- Newer code can read data that was written by older code.
Forward compatibility ( 向前兼容 / 向未来兼容 )
- Older code can read data that was written by newer code.

Formats for Encoding Data

Programs usually work with data in (at least) two different representations:

1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on.
- These data structures are optimized for efficient access and manipulation by the CPU ( typically using pointers ).
2. When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes ( for example, a JSON document ).

Representation Translation

The translation from the in-memory representation to a byte sequence is called encoding
- ( aka. serialization or marshalling ),
and the reverse is called decoding
- ( aka. parsing, deserialization, unmarshalling ).

Terminology Clash

Serialization is unfortunately also used in the context of transactions, with a completely different meaning.
To avoid overloading the word we'll stick with encoding ( in this book ) , even though serialization is perhaps a more common term.

Language-Specific Formats

Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example :

Java : java.io.Serializable
- Third-party : Kryo
Ruby : Marshal
Python : pickle
…

Some deep problems of using programming language built-in encoding libraries : ( 详见原书, 以下简述 )

数据编码跟编程语言绑定 : 一个编程语言难以解析另一个编程语言编码的数据
为了能够成功解码, 允许实例化任何类; 这可能导致允许执行任意代码, 破坏程序
- In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes.
- This is frequently a source of security problems : if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code.
比起编码结构的兼容性, 更优先考虑编码过程的快速和简便
- Versioning data is often an afterthought in these libraries :
  - as they are intended for quick and easy encoding of data,
  - they often neglect the inconvenient problems of forward and backward compatibility.
Efficiency ( CPU time taken to encode or decode, and the size of the encoded structure ) is also often an afterthought.
- For example, Java's built-in serialization is notorious for its bad performance and bloated encoding.

JSON, XML, and Binary Variants

JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a popular topic of debate). Besides the superficial syntactic issues, they also have some subtle problems : ( 详见原书, 以下简述 )

浮点数精度丢失问题 : JavaScript 的 Number 只支持 2^53 的精度
支持 Unicode, 但二进制编码只能转换为 Base64 的 ASCII 文本存储, 数据大小膨胀 33%
模式 ( schema ) 支持不够便捷, 需要硬编码适当的编码/解码逻辑
CSV 没有任何 schema, 语法较弱, 不是所有解析器都能正确处理转义符以及逗号的转义

Binary encoding ( 二进制编码 )

JSON is less verbose than XML, but both still use a lot of space compared to binary formats.
This observation led to the development of a profusion of binary encodings for JSON and for XML.
- JSON : MessagePack, BSON, BJSON, UBJSON, BISON, Smile / …
- XML : WBXML / Fast Infoset / …
MessagePack format example ( 详见原书例 )

Thrift and Protocol Buffers

Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries that are based on the same principle.
- Protocol Buffers from Google
- Thrift from Facebook
Both Thrift and Protocol Buffers require a schema for any data that is encoded.
- Thrift : You would describe the schema in the Thrift interface definition language ( IDL )

# Thrift interface definition language
struct Person {
    1: required string userName,
    2: optional i64 favoriteNumber,
    3: optional list<string> interests
}

# Protocol Buffers
message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

Confusingly, Thrift has 2 different binary encoding formats :

BinaryProtocol
CompactProtocol
( DenseProtocol 只支持 C++ 实现, 没有跨语言实现 ) …

Avro

Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers and Thrift.
It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop's use cases.
Avro also uses a schema to specify the structure of the data being encoded.
- It has two schema languages : one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.
The writer's schema and the reader's schema
- The key idea with Avro is that the writer's schema and the reader's schema don't have to be the same -- they only need to be compatible.
- When data is decoded (read), the Avro library resolves the differences by looking at the writer's schema and the reader's schema side by side and translating the data from the writer's schema into the reader's schema.
  - If the writer's schema and the reader's schema have their fields in a different order, because the schema resolution matches up the fields by field name.
  - If the code reading the data encounters a field that appears in the writer's schema but not in the reader's schema, it is ignored.
  - If the code reading the data expects some field, but the writer's schema does not contain a field of that name, it is filled in with a default value declared in the reader's schema.
( 其它详见原书 )

Modes of Dataflow

The most common ways how data flows between processes:

Via databases
Via service calls ( REST and RPC )
Via asynchronous message passing

Dataflow Through Services: REST and RPC

This approach is often used to decompose a large application into smaller services by area of functionality, such that one service makes a request to another when it requires some functionality or data from that other service.
- This way of building applications has traditionally been called a service-oriented architecture (SOA), more recently refined and rebranded as microservices architecture.
A key design goal of a service-oriented/microservices architecture is
- to make the application easier to change and maintain by making services independently deployable and evolvable.
For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams.
In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API— precisely what we've been talking about in this chapter. ( icehe : 注意不要破坏兼容性 )

The problems with remote procedure calls (RPCs)

The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process ( this abstraction is called flocation transparency ).
Although RPC seems convenient at first, the approach is fundamentally flawed.

A network request is very different from a local function call:

A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. ( 不可预测 )
- A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control.
- Network problems are common, so you have to anticipate them, for example by retrying a failed request.
A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes). ( icehe : 意外情况导致更多的返回结果类型, 例如超时失败 )
- A network request has another possible outcome: it may return without a result, due to a timeout.
- In that case, you simply don't know what happened: if you don't get a response from the remote service, you have no way of knowing whether the request got through or not.
If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost. ( 幂等性问题 )
- In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication ( idempotence 幂等 ) into the protocol.
- Local function calls don't have this problem.
Every time you call a local function, it normally takes about the same time to execute. ( 不可控的响应时长 )
- A network request is much slower than a function call, and its latency is also wildly variable: at good times it may complete in less than a millisecond, but when the network is congested or the remote service is overloaded it may take many seconds to do exactly the same thing.
When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. ( 无法传指针; 可能要传递的参数是很大的对象 )
- When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network.
- That's okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.
The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another.
- This can end up ugly, since not all languages have the same types -- recall JavaScript's problems with numbers greater than 2^53, for example.
  - This problem doesn't exist in a single process written in a single language.

Message-Passing Dataflow

Asynchronous message-passing systems are somewhere between RPC and databases

They are similar to RPC in that a client's request ( usually called a message ) is delivered to another process with low latency.
They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker ( also called a message queue or message-oriented middleware ), which stores the message temporarily.

Using a message broker has several advantages compared to direct RPC :

It can act as a buffer if the recipient is unavailable or overloaded,
- and thus improve system reliability.
It can automatically redeliver messages to a process that has crashed,
- and thus prevent messages from being lost.
It avoids the sender needing to know the IP address and port number of the recipient
- ( which is particularly useful in a cloud deployment where virtual machines often come and go ).
It allows one message to be sent to several recipients.
It logically decouples the sender from the recipient
- ( the sender just publishes messages and doesn't care who consumes them ).

Message brokers

omitted…

Distributed actor frameworks

The actor model is a programming model for concurrency in a single process.
- Each actor typically represents one client or entity, it may have some local state ( which is not shared with any other actor ), and it communicates with other actors by sending and receiving asynchronous messages.
- Message delivery is not guaranteed : in certain error scenarios, messages will be lost.
- Since each actor processes only one message at a time, it doesn't need to worry about threads, and each actor can be scheduled independently by the framework.
Location transparency ( 位置透明性 ) works better in the actor model than in RPC, because the actor model already assumes that messages may be lost, even within a single process.
omitted…

Part II. Distributed Data

There are various reasons why you might want to distribute a database across multiple machines:

Scalability ( 可伸缩性 / 拓展性 )
- If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.
Fault tolerance / high availability ( 容错性 / 高可用性 )
- If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy.
- When one fails, another one can take over ( 接管 ).
Latency ( 延迟 )
- If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them.
- That avoids the users having to wait for network packets to travel halfway around the world.

Scaling to Higher Load

Shared-memory Architecture ( 共享内存架构 )
- If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine ( sometimes called vertical scaling or scaling up ). ( 垂直拓展 )
  - Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk.
  - In this kind of shared-memory architecture, all the components can be treated as a single machine.
- The problem with a shared-memory approach is that the cost grows faster than linearly.
  - ( 设备的性能提升, 不能一定能带来同样比例的负载提升 )
- A shared-memory architecture may offer limited fault tolerance.
  - It is definitely limited to a single geographic location.
  - ( 局限于地理位置, 无法提供异地容错能力 )
Shared-disk Architecture ( 共享存储架构 )
- It uses several machines with independent CPUs and RAM,
  - but stores data on an array of disks that is shared between the machines, which are connected via a fast network.
- This architecture is used for some data warehousing workloads,
  - but contention and the overhead of locking limit the scalability of the shared-disk approach.
  - ( 资源竞争以及锁的开销限制了进一步的伸缩性/拓展性 )
Shared-Nothing Architectures ( 无共享架构 )
- It's sometimes called horizontal scaling or scaling out. ( 水平拓展 )
- In this approach, each machine or virtual machine running the database software is called a node.
  - Each node uses its CPUs, RAM, and disks independently.
  - Any coordination between nodes is done at the software level, using a conventional network.
- No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio.
  - You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter.
  - ( icehe : 但实际上为了方便运维, 通常只提供少数几种标准配置类型的服务节点实例 : 存储型 / 计算型 / 内存型 / … )
- While a distributed shared-nothing architecture has many advantages,
  - it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use.

Replication Versus Partitioning

Replication ( 复制 )
- Keeping a copy of the same data on several different nodes, potentially in different locations.
- Replication provides redundancy :
  - if some nodes are unavailable, the data can still be served from the remaining nodes.
- Replication can also help improve performance.
Partitioning ( 分区 )
- Splitting a big database into smaller subsets called partitions
  - so that different partitions can be assigned to different nodes ( also known as sharding ) ( 分片 ) .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

designing-data-intensive-applications-p1.md

designing-data-intensive-applications-p1.md

Foundations of Data Systems

Reliable, Scalable & Maintainable Systems

Reliability

Scalability

Describing Load

Describing Performance

Percentiles

Cope with Load

Mantainability

Data Models and Query Languages

Storage and Retrival

Data Structures That Power Your Database

Hash Indexes

SSTables and LSM-Trees

B-Tree

Comparing B-Trees and LSM-Trees

Other Indexing Structures

Transaction Processing or Analytics?

Data Warehousing

Column-Oriented Storage

Encoding and Evolution

Formats for Encoding Data

Language-Specific Formats

JSON, XML, and Binary Variants

Modes of Dataflow

Dataflow Through Services: REST and RPC

Message-Passing Dataflow

Files

designing-data-intensive-applications-p1.md

Latest commit

History

designing-data-intensive-applications-p1.md

File metadata and controls

Foundations of Data Systems

Reliable, Scalable & Maintainable Systems

Reliability

Scalability

Describing Load

Describing Performance

Percentiles

Cope with Load

Mantainability

Data Models and Query Languages

Storage and Retrival

Data Structures That Power Your Database

Hash Indexes

SSTables and LSM-Trees

B-Tree

Comparing B-Trees and LSM-Trees

Other Indexing Structures

Transaction Processing or Analytics?

Data Warehousing

Column-Oriented Storage

Encoding and Evolution

Formats for Encoding Data

Language-Specific Formats

JSON, XML, and Binary Variants

Modes of Dataflow

Dataflow Through Services: REST and RPC

Message-Passing Dataflow