Do you need to know about MongoDB ? Then this post might help you.
We will start with NoSQL Database, CAP theorem. Then we will see about MongoDB.
- NoSQL Database, also called Not only SQL, is an approach of data management and data design that’s useful for very large sets of distributed data.
- NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that’s stored remotely on multiple virtual servers in the cloud.
- Useful for low latency queries.
- Document-based Database :A document-oriented database is a specific kind of database that works on the principle of dealing with ‘documents’ rather than strictly defined tables of information. Ex:MongoDB.
- Key-Value Store: The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The key-value model is the simplest and easiest to implement. Example: Redis, Memcached.
- Graph-based Store : A graph database, also called a graph-oriented database, is a type of NoSQL database that uses graph theory to store, map and query relationships. Ex: Neo4J , HyperGraphDB
- Column-bases databases : such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
- Consistency: A read is guaranteed to return the most recent write for a given client.
- Availability: A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
- Partition Tolerance: This means that the system continues to function even if the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
- It is impossible to fulfill all 3 requirements. CAP provides the basic requirements for a distributed system to follow 2 of the 3 requirements. Therefore all the current NoSQL database follow the different combinations of the C, A, P from the CAP theorem. Here is the brief description of three combinations CA, CP, AP :
- CA: Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks.
- CP: Some data may not be accessible, but the rest is still consistent/accurate.
- AP: System is still available under partitioning, but some of the data returned may be inaccurate.
What is MongoDB?
- Scalable High-Performance Open-source, Document-orientated database (CP System).
- Rich Document based queries for Easy readability.
- Full Index Support for High Performance.
- Replication and Failover for High Availability.
- Auto Sharding for Easy Scalability.
- Map / Reduce for Aggregation.
- MongoDB stores documents (or) objects.
- Now-a-days, everyone works with objects (Python/Ruby/Java/etc.)
- And we need Databases to persist our objects. Then why not store objects directly ?
- Embedded documents and arrays reduce need for joins.
- No Joins and No-multi document transactions.
- Map Reduce Aggregations
- Pipeline Aggregations
- Geo Spatial Queries
- Replication is the process of synchronizing data across multiple servers.
- A replica set in MongoDB is a group of mongod processes that maintain the same data set.
- With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server.
- In some cases, replication can provide increased read capacity as clients can send read operations to different servers.
- Maintaining copies of data in different data centers can increase data locality and availability for distributed applications.
- You can also maintain additional copies for dedicated purposes, such as disaster recovery, reporting, or backup.
- The primary node receives all write operations.
- The primary records all changes to its data sets in its operation log, i.e. oplog
- The secondaries replicate the primary’s oplog and apply the operations to their data sets such that the secondaries’ data sets reflect the primary’s data set
- When a primary does not communicate with the other members of the set for more than 10 seconds, an eligible secondary will hold an election to elect itself the new primary.
- The first secondary to hold an election and receive a majority of the members’ votes becomes primary.
- Although non-voting members do not vote in elections, these members hold copies of the replica set’s data and can accept read operations from client applications.
- Because a replica set can have up to 50 members, but only 7 voting members, non-voting members allow a replica set to have more than seven members.
- The purpose of an arbiter is to maintain a quorum in a replica set by responding to heartbeat and election requests by other replica set members.
- If your replica set has an even number of members, add an arbiter to obtain a majority of votes in an election for primary.
If the data increases, a single machine may not be sufficient to store the data nor provide an
acceptable read and write throughput.
MongoDB uses sharding to support deployments with very large data sets and high throughput
Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over
multiple servers, or shards.
Each shard is an independent database, and collectively, the shards make up a single logical
- Sharding addresses the challenge of scaling to support high throughput and large data sets
- Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally.
- Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows.
- Shard Servers
- Shards are responsible for the actual data storage operations. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set.
- Query Routers
- The query routers are the machines that your application actually connects to. These machines are responsible for communicating to the config servers to figure out where the requested data is stored. It then accesses and returns the data from the appropriate shard(s).
- Config Servers
- Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.
- Config servers for sharded clusters can be deployed as a replica set.