How to Find Top-K Trending Hashtags from a Stream of Tweets Using Count-Min Sketch


When working with massive streams of data like hashtags from tweets, storing and processing every individual hashtag becomes impractical due to high memory usage and real-time constraints.

👉 The goal:

Find the top-K trending hashtags in real time, from a continuous stream of tweets.




✅ Problem Statement

A large-scale tweet stream sends thousands of hashtags per second.

Your task: Continuously identify the most popular hashtags trending right now.



✅ Challenges:

  • Huge volume of hashtags makes exact counting infeasible.
  • Limited memory and need for fast processing.
  • Real-time approximate results with good accuracy.



🧐 Two Practical Approaches



1️⃣ Multiple CMS Time Window Approach



✔️ What It Solves:

Enables answering questions like:

  • “What are the top trending hashtags in the last 15 minutes?”
  • “What are the top trending hashtags in the last 30 minutes?”
  • “What are the top trending hashtags in the last 1 hour?”



✔️ How It Works:

  • Maintain multiple CMS instances, one per fixed time window (e.g., one CMS per minute).
  • Keep only the latest N CMS instances corresponding to the desired time window (sliding window).
  • At query time, sum counts across the relevant CMS instances.



✅ Java Implementation Example:

public class SlidingWindowCMS {
    private final int windowSize;
    private final LinkedList cmsList;

    public SlidingWindowCMS(int windowSize, int rows, int cols) {
        this.windowSize = windowSize;
        this.cmsList = new LinkedList<>();
    }

    public void addNewMinuteCMS(CountMinSketch newCms) {
        if (cmsList.size() >= windowSize) {
            cmsList.removeFirst();
        }
        cmsList.addLast(newCms);
    }

    public int query(String key) {
        int totalCount = 0;
        for (CountMinSketch cms : cmsList) {
            totalCount += cms.count(key);
        }
        return totalCount;
    }
}

Enter fullscreen mode

Exit fullscreen mode




2️⃣ Decaying Count Approach



✔️ What It Solves:

Answers questions such as:

  • “What are the currently trending hashtags, giving more weight to recent data?”



✔️ How It Works:

  • Use a single CMS instance.
  • Periodically apply a decay factor to all counters (e.g., multiply by 0.99 every minute).
  • Recent hashtags remain significant while older counts fade automatically.
  • Eliminates the need for storing multiple CMS instances.



✅ Java Implementation Example:

public class DecayingCMS {
    private final CountMinSketch cms;
    private final double decayFactor;

    public DecayingCMS(int rows, int cols, double decayFactor) {
        this.cms = new CountMinSketch(rows, cols);
        this.decayFactor = decayFactor;
    }

    public void add(String key) {
        cms.add(key);
    }

    public int query(String key) {
        return cms.count(key);
    }

    public void applyDecay() {
        cms.applyDecay(decayFactor);
    }
}

Enter fullscreen mode

Exit fullscreen mode




✅ Count-Min Sketch Implementation Example:

public class CountMinSketch {
    private final int[][] table;
    private final int[] seeds;
    private final int rows;
    private final int cols;
    private final Random rand = new Random();

    public CountMinSketch(int rows, int cols) {
        this.rows = rows;
        this.cols = cols;
        this.table = new int[rows][cols];
        this.seeds = new int[rows];
        for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt();
    }

    public void add(String key) {
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            table[i][index]++;
        }
    }

    public int count(String key) {
        int min = Integer.MAX_VALUE;
        for (int i = 0; i < rows; i++) {
            int index = Math.abs(hash(key, seeds[i]) % cols);
            min = Math.min(min, table[i][index]);
        }
        return min;
    }

    public void applyDecay(double factor) {
        for (int i = 0; i < rows; i++)
            for (int j = 0; j < cols; j++)
                table[i][j] *= factor;
    }

    private int hash(String key, int seed) {
        int hash = 0;
        for (char c : key.toCharArray()) {
            hash = hash * seed + c;
        }
        return hash;
    }
}

Enter fullscreen mode

Exit fullscreen mode




✅ Conclusion

When processing streams of hashtags in real time:

  • ✅ Use Multiple CMS Time Windows if you need exact top-K in fixed time windows (e.g., last 15 min, 1 hour).
  • ✅ Use Decaying Counts CMS for smooth, approximate real-time trending insights without separate windows.

Both approaches solve real-world problems depending on your requirements.

👉 For a deeper overview of how Count-Min Sketch works in general, check out our detailed CMS overview 👉 Count-Min Sketch Overview.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *