BILIBILI Like System Architecture

2025-04-22 updated 2025-10-01 2 min

Table of Contents

This post was translated by LLM.

System Pressure

Write traffic
- Aggregate writes. For example, batch like counts within a 10-second window and flush them in one shot to reduce I/O.
- Make DB writes asynchronous through MQ or similar infrastructure.
Before every like-state update, check the previous like state first, because unliking requires an existing like, and liking requires the opposite.

Viral content events
- Hotspots can appear in both the DB and cache. The system needs a hot-key detection mechanism so hot data can be cached locally with a reasonable TTL.

key-value = user:likes:patten:{mid}:{business_id} - member(messageID)-score(likeTimestamp)

Use a zset directly, maintain a maximum length, and evict the earliest liked messages.

Handle cache hotspots.
Use a min-heap algorithm to count the most frequently accessed cache keys within a configurable time window, then keep hot keys and values in local memory with a business-acceptable TTL.

Two datacenters serve as disaster-recovery peers.
- Datacenter A handles all writes plus part of the reads.
- Datacenter B handles part of the reads.
If the DB fails, switch read/write traffic to the standby datacenter through a db-proxy sidecar.
Cross-datacenter cache consistency is maintained by asynchronously consuming TiDB binlogs. When needed, traffic can be switched between datacenters without causing a large amount of cold data to fall back to the database. Cold data here means data missing from Redis and therefore requiring a DB lookup.

Write like data, refresh cache, and send like and like-count messages to downstream services such as the recommendation system.
Disaster recovery for binlog interruption
- Monitor the binlog stream first.
- Let the business service emit standby messages ahead of time. When the job detects a binlog anomaly, it automatically switches to the fallback consumer stream.

TODO: