When working with massive streams of data like hashtags from tweets, storing and processing every individual hashtag becomes impractical due to high memory usage and real-time constraints.
👉 The goal:
Find the top-K trending hashtags in real time, from a continuous stream of tweets.
✅ Problem Statement
A large-scale tweet stream sends thousands of hashtags per second.
Your task: Continuously identify the most popular hashtags trending right now.
✅ Challenges:
- Huge volume of hashtags makes exact counting infeasible.
- Limited memory and need for fast processing.
- Real-time approximate results with good accuracy.
🧐 Two Practical Approaches
1️⃣ Multiple CMS Time Window Approach
✔️ What It Solves:
Enables answering questions like:
- “What are the top trending hashtags in the last 15 minutes?”
- “What are the top trending hashtags in the last 30 minutes?”
- “What are the top trending hashtags in the last 1 hour?”
✔️ How It Works:
- Maintain multiple CMS instances, one per fixed time window (e.g., one CMS per minute).
- Keep only the latest N CMS instances corresponding to the desired time window (sliding window).
- At query time, sum counts across the relevant CMS instances.
✅ Java Implementation Example:
public class SlidingWindowCMS {
private final int windowSize;
private final LinkedList cmsList;
public SlidingWindowCMS(int windowSize, int rows, int cols) {
this.windowSize = windowSize;
this.cmsList = new LinkedList<>();
}
public void addNewMinuteCMS(CountMinSketch newCms) {
if (cmsList.size() >= windowSize) {
cmsList.removeFirst();
}
cmsList.addLast(newCms);
}
public int query(String key) {
int totalCount = 0;
for (CountMinSketch cms : cmsList) {
totalCount += cms.count(key);
}
return totalCount;
}
}
2️⃣ Decaying Count Approach
✔️ What It Solves:
Answers questions such as:
- “What are the currently trending hashtags, giving more weight to recent data?”
✔️ How It Works:
- Use a single CMS instance.
- Periodically apply a decay factor to all counters (e.g., multiply by 0.99 every minute).
- Recent hashtags remain significant while older counts fade automatically.
- Eliminates the need for storing multiple CMS instances.
✅ Java Implementation Example:
public class DecayingCMS {
private final CountMinSketch cms;
private final double decayFactor;
public DecayingCMS(int rows, int cols, double decayFactor) {
this.cms = new CountMinSketch(rows, cols);
this.decayFactor = decayFactor;
}
public void add(String key) {
cms.add(key);
}
public int query(String key) {
return cms.count(key);
}
public void applyDecay() {
cms.applyDecay(decayFactor);
}
}
✅ Count-Min Sketch Implementation Example:
public class CountMinSketch {
private final int[][] table;
private final int[] seeds;
private final int rows;
private final int cols;
private final Random rand = new Random();
public CountMinSketch(int rows, int cols) {
this.rows = rows;
this.cols = cols;
this.table = new int[rows][cols];
this.seeds = new int[rows];
for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt();
}
public void add(String key) {
for (int i = 0; i < rows; i++) {
int index = Math.abs(hash(key, seeds[i]) % cols);
table[i][index]++;
}
}
public int count(String key) {
int min = Integer.MAX_VALUE;
for (int i = 0; i < rows; i++) {
int index = Math.abs(hash(key, seeds[i]) % cols);
min = Math.min(min, table[i][index]);
}
return min;
}
public void applyDecay(double factor) {
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
table[i][j] *= factor;
}
private int hash(String key, int seed) {
int hash = 0;
for (char c : key.toCharArray()) {
hash = hash * seed + c;
}
return hash;
}
}
✅ Conclusion
When processing streams of hashtags in real time:
- ✅ Use Multiple CMS Time Windows if you need exact top-K in fixed time windows (e.g., last 15 min, 1 hour).
- ✅ Use Decaying Counts CMS for smooth, approximate real-time trending insights without separate windows.
Both approaches solve real-world problems depending on your requirements.
👉 For a deeper overview of how Count-Min Sketch works in general, check out our detailed CMS overview 👉 Count-Min Sketch Overview.