HBase

Basics

HBase is a column-based key-value store modeled on Google’s BigTable. It runs over HDFS (Hadoop File System).
HBase has high write-throughput and scales well horizontally.
HBase works well for flexible datamodels of sparse records i.e. much of the fields are empty/null.
It also supports features such as automatic failover, versioning and compression out of the box.
Altering columns is expensive as HBase has to create a new columns and copy over all the data.
Each column can be fine tuned to affect read and write performance and space consumption. You miss out on fine-grained tuning if you de-normalize everything to fewer column families.
Column families stored in different directories allow you to only read columns you are interested in. This is great for read-heavy workloads.
Operations are atomic at the row-level, so data across multiple columns families in a record is always consistent.
Rows are sorted by the row-key.
Regions are chunks of rows stores on a region server. Each region server typically stores multiple regions.
The architecture comprises of a Master server and multiple region servers. The master server makes use of Zoo Keeper to manage configuration and synchronization. Read/write requests from clients directly go to the region servers.

HBase uses Log Structured Merge Trees (LSTM) to store data.
Rows of data in HBase are split into shards called regions. Every region is stored on a region server. A region server may host one or more regions.
When data is written to HBase, it is first recorded in a WAL (Write-Ahead Log) called HLog. The data is then stored in an in-memory MemStore, which is flushed to disk when its size reaches a specific threshold. The file written to disk is an immutable index-organized data file called an HFile.
HBase performs compaction to reduce read overhead by merging several smaller files into fewer larges ones. During this process, two or more sorted HFiles are merge-sorted into a single file. This helps co-locate all data for a given key on disk, as well as eliminate deleted data.
Automatic rollover is supported, enabling all regions on a failed region server to migrate over to another region server. This is done by replaying all the content associated with a region onto a different region server. This makes failovers expensive.
HDFS supports HBase with replication, end-to-end checksums and auto-matic rebalancing.