Data Streams

Stream Data Model
- Any number of streams can enter the system - Each stream can provide elements at its own schedule. - No requirement of having same data rates or data types. - Streams can be huge in size, and thus may be archived in a large archival storage
- Working store -> summaries of parts of streams are placed. Queries can be run here

Stream Sources
- Sensor Data - Image Data - Internet and Web Traffic

Stream Queries
- Standing queries are permanently executing queries and produce outputs at appropriate times
- Example, to output an alert whenever the temperature exceeds 25\(\degree\) centigrade.
- Ad-hoc queries is a question asked once about the current state of a stream(s)
- common approach is to store a sliding window of each stream in the working store

Issues in Stream Processing
- Given enough memory, many problems in streaming data would be very easy to solve
- It is much more efficient to get an approximate answer to our problem than exact solution
- Hashing techniques can be used to produce approximate answer

Sampling Data in Stream
¶

selecting a subset of a stream so that we can ask queries about the selected subset and have the answers be statistically representative of the stream as a whole.
We assume the stream consists of tuples (user, query, time) – of which only user is in the key.

Filtering Streams
- Another common process on streams is selection, or filtering. -
Bloom Filtering
- It is a way to check whether the element \(\omega\) is present in the set \(S\).
- We cannot remove an item from the bloom filter, it never forgets
- Uses the main memory as a bit array.

Bloom filter consists of:
1. An array of n bits, initially all set to 0
2. Collection of hash functions \(h_1, h_2,...h_k\), each function maps 'key' values to n buckets.
3. set S of m key values.
Purpose -> Allow through all stream elements whose keys are in \(S\), rejecting most of the stream elements whose keys are not in \(S\).
How to check
- Begin with all bits set to 0
- Take each key value in \(S\) and hash it using each of the \(k\) hash functions, update each bit at the \(h_i(K_j)\) to 1. ... \(h_i(Key_j)\) -> return value from hash function
- To test a key arriving in stream, check that all of \(h_1(K),h_2(K),..h_k(K)\) are 1s in the bit array.
- If all are 1's then it is a probable yes. If even a single bit turns out 0 then it's a definite NO.
To calculate probability of false positive
- Example -> Throwing darts | x targets | y darts
- \(P\)(given dart will not hit a given target) = \((x-1)/x\)
- \(P\)(not of the \(y\) darts hit a given target) = \(((x-1)/x)^y\)
  - The above equation can be written as \(\((1-1/x)^{x(\frac{y}{x})}\)\)
  - approximation -> \((1 - \epsilon)^{1/\epsilon} \ = \ 1/e\)
  - Thus, we conclude, P(no darts hit a given target) = \(e^{-y/x}\)
  - and, P(a given target is hit) = 1 - \(e^{-y/x}\)
- False Positive rate = (\((1-e^{-km/n})^k\)\)Where
  - k -> number of hash functions
  - m-> number of set members
  - n -> total number of bits

Data Streams

Sampling Data in Stream¶

Sampling Data in Stream
¶