Gavin Cheng
辛眉筆記
COMPSCI 753 Notes

Course outline sorted out by myself for quick revision
University of Auckland COMPSCI 753 - Algorithms for Massive Data


Data Stream


Find similar items

  • Jaccard Similarity
  • Shingling -> MinHash -> LSH
  • MinHash
    • Random permutations
    • One-pass MinHash

Sampling data stream

  • Fixed proportion
  • Fixed size
    • Reservoir Sampling

Find frequent items

  • [Deterministic] Misra-Gries
    • (m-m')/(k+1) decrement steps at most
  • [Randomized] CountMin Sketch

Filtering data stream

  • Bloom Filter
    • 1 hash function
    • k hash functions

Locality Sensitive Search (LSH)

  • (r, c, p1, p2)-sensitive
  • Jaccard similarity/distance
    • MinHash
  • Cosine similarity/distance
    • SimHash

Graph


Link Analysis

  • TF.IDF
  • Term Spam
  • PageRank
    • Dead ends
      • Recursively remove
      • Taxation
  • Biased PageRank
  • Link Spam (Spam Farm)
    • Trust Rank
    • Spam Mass

Social Network Analysis

  • Small world property
    • Power law degree distribution
  • Core-Periphery structure
  • Strength of ties
    • Triadic closure
  • Clustering Coefficient
    • Triangle enumeration
  • Community Detection
    • K-core decomposition
  • Influence Maximization
    • Greedy-based
    • Sketch-based

Join Analysis

  • Natural join
  • Semijoin
  • Multiway join
  • Acyclic join
    • Yannakakis Algorithm