Persistent Database
We want the data never lost
Persistence database makes sure the data is not lost although the database is stopped on purpose or unexpectedly like a power outage.
MySQL and Postgres are common choices for persistence databases. It is a classic Relational Database (RDBMS)and uses SQL as its query language.
In the Relational Model, we break down the entity into several tables and relationships. The process of structuring the database is called Normalization mainly to reduce Data Redundancy and improve Data Integrity.
Different from RDBMS, Document Database like MongoDB put entities into a single table and are flexible on their structure. In this way, it offers better performance for querying and is more simple in terms of data design. MongoDB is also a good alternative for Persistent Databases.
Low-Latency Database
We want to get the data a.s.a.p
While the persistence database stores data in a disk, the low-latency database usually uses RAM (sometimes called In-Memory Database). The main use case for this kind of database is cache.
It usually doesn’t keep the data for a long time and has Time-To-Live (TTL) and Eviction Policy. It got flushed every restart, although it is configurable to persistence mode.
Mostly Low-latency databases like Redis and Memcached are simple key-value storage and don’t support complex structure data.
Message Broker
We want asynchronous data
Similar in real life, we drop messages to someone so we can do other things while waiting for them to respond (asynchronous/non-blocking). For urgent matters, we would like to take a phone call, not able to do other things until we finish the business (synchronous/blocking).
Message-Broker like Kafka or RabbitMQ is common to use when handling processes asynchronously in distributed computing. Messages can be the data to be sent to other services (inter-service communication) or the job request to the worker (job queue).
Search Engine Database
We want to make a “smart” search inquiry
Users don’t know what exactly they are looking for (that’s why they search) and the application expects to return what they want, so we kinda want “smart” search to able to accept fuzzy search (typo is okay), popularity search (what searched most), full-text search (text within the text), etc.
While the mainstream database straightforwardly manages the data, implementing “smart” search can really CPU-Memory extensive. Elasticsearch is a popular search engine to get good search results.
It is not encouraged to implement a search engine for basic search needs since it can be costly due to syncing the data problem with persistent and maintaining the infra itself (bunch on configurations).
Another use case for search-engine that is commonly found in tech companies is logging problems. ELK Stack (later called an Elastic Stack) combine ElasticSearch with LogStash dan Kibana to manage the application log
Reporting Database
We want ready-to-use data for reporting
Reporting is important for the business to continually improve and innovate. Usually, we don’t put reporting and application into one basket as we treat them as separate problems. No mentioning in this big data era, reporting even became more vital and complicated both in terms of volume and computation.
The application database always changes (the structure/format) along with the development and incident. Overall the data shape is sometimes not good and full of war scars. In another hand, the management needs a clean and insightful report. The bad data eventually produce a bad report (Garbage In, Garbage Out).
Another concern is performance. We need to fetch and analyze the data over weeks or months with specific criteria. With such queries, it can take resources (CPU/RAM) that suppose to serve the customer.
Since the data was designed for the application in the first place, processing the data can be more expensive for reporting. Vice-versa, designing the data for reporting is not optimal for application use-case.
So in order to produce fast and clean reporting, the data engineer creates a Data Pipeline/ETL to move the data to a better formatted and designed database for reporting. The concept is called the Single Source of Truth and the database is called Data Mart or Data Warehouse.
Google BigQuery has the capability to process huge data and is relatively cost-efficient for this purpose. Mostly you will pay for the query computation instead of how much data is being stored which is relevant for data warehousing where we store considerably huge years of data.
Config Store
We want to change the configuration remotely
The Google Play Store and Apple App Store have a review process before publishing a new version of the mobile app. It is vital to have a remote configuration like Google Firebase so we can update important variables to change the behavior without producing a new release.
It is common for frontend/mobile to adopt Trunk Based Development to manage the application version branch which relies on Feature Toggle as part of the configuration. Feature Toggle also allows the stakeholder to experiment through A/B Testing to provide a better user experience.
The backend also needs to centralize the configuration especially with an increasing number of running services, which is common in Microservice Architecture.
Consul offers a Key-Value Store feature for distributed configuration to ensure all service instances use the same configuration. To store and secure sensitive data such as tokens and passwords, we can use something like Vault.
Time-Series Database
We want to watch the data over the time
The data like CPU/memory usage, IoT device data, and stock movement are examples of time series.
Time series databases like InfluxDB are dedicated to managing this kind of data. Time series really helpful in monitoring, alerting or predicting things. Combining with other tools (for visualization or data ingestion), the database is easily set up to serve the need.
DevOps can use TIG Stack for infra monitoring by putting InfluxDB together with Telegraf (collecting data) and Grafana(data visualization).
Graph Database
We want to understand the relationship
Graph theory is important in computer science with many applications like finding the shortest path or social network analysis.
Graph Database like Neo4J is helpful when we need to analyze a great amount of data as a graph like what data scientists do. In my previous company, we are using it to detect anomaly/fraudulent relationships between customers and vendors.
Do you know any other type of database? Can you share your experience dealing with the database mentioned in this article? Let us know your thought in the comments below.