HBase
Basics
- HBase is a column-based key-value store modeled on Google’s BigTable. It runs over HDFS (Hadoop File System).
- HBase has high write-throughput and scales well horizontally.
- HBase works well for flexible datamodels of sparse records i.e. much of the fields are empty/null.
- It also supports features such as automatic failover, versioning and compression out of the box.
- Altering columns is expensive as HBase has to create a new columns and copy over all the data.
- Each column can be fine tuned to affect read and write performance and space consumption. You miss out on fine-grained tuning if you de-normalize everything to fewer column families.
- Column families stored in different directories allow you to only read columns you are interested in. This is great for read-heavy workloads.
- Operations are atomic at the row-level, so data across multiple columns families in a record is always consistent.
- Rows are sorted by the row-key.
- Regions are chunks of rows stores on a region server. Each region server typically stores multiple regions.
- The architecture comprises of a Master server and multiple region servers. The master server makes use of Zoo Keeper to manage configuration and synchronization. Read/write requests from clients directly go to the region servers.
Internals
- HBase uses Log Structured Merge Trees (LSTM) to store data.
- Rows of data in HBase are split into shards called regions. Every region is stored on a region server. A region server may host one or more regions.
- When data is written to HBase, it is first recorded in a WAL (Write-Ahead Log) called HLog. The data is then stored in an in-memory MemStore, which is flushed to disk when its size reaches a specific threshold. The file written to disk is an immutable index-organized data file called an HFile.
- HBase performs compaction to reduce read overhead by merging several smaller files into fewer larges ones. During this process, two or more sorted HFiles are merge-sorted into a single file. This helps co-locate all data for a given key on disk, as well as eliminate deleted data.
- Automatic rollover is supported, enabling all regions on a failed region server to migrate over to another region server. This is done by replaying all the content associated with a region onto a different region server. This makes failovers expensive.
- HDFS supports HBase with replication, end-to-end checksums and auto-matic rebalancing.