# cuckoo-filter

Cuckoo filters are a probabilistic data structure used to answer questions like “Have I already seen this user” or “Is this word in the English language?”. They’re *probabilistic* because each membership operation has a false positive probability. It guarnatees that there will never be a false negative, but may have a low chance of false positives.

Bloom filters are the cannonical probabilistic filter structure, and cuckoo filters are a simlar but different tool. As a bloom filter’s load factor increases, the chance of false positive trends towards 100%, but the inserts will never fail. On the other hand, a Cuckoo filter retains a relatively stable false positive probability under load, but as load approahes 95% inserts will begin to fail. In either case you probably want to resize your filter…

This implementation has the following properties:

- Buckets of 4 elements
- 8 bit fingerprints
- Cycle termination during item kicking occurs after (0.1 * size) buckets have been checked.
- Size may be any non-zero natural number (not limited to powers of 2)

For more details about how Cuckoo filters work, I recommend you read Fan et. al.’s 2016 paper https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf.

### Usage

Cuckoo filters support three operations: `insert`

, `member`

, and `delete`

. See the haddocks for details.

### Performance

As you’ll find in the criterion results, the pure version of the filter can handle ~1.6 million insertions/s. From memory profiles, the vast majority of the memory is taken up by the underlying implementation of `Filter`

, so this is an obvious area for improvement.

The current implementation avoids pre-allocating memory for the filter, so the heap usage will incrase linearly with `insert`

calls. This obviously helps keep heap usage low for sparse filters, but also means inserts are slower than they would be in a mutable implementation.

#### Loading a SpellChecker test

The following test was run on a laptop, so the absolute numbers are going to vary a ton. The important thing is the relationship between the pure & immutable filter implementations.

The test consists of:

- Load the
`/usr/share/dict/words`

file into memory
- Create a filter containing all of the words
- Lookup each word in the filter

Pure

```
500000 cells
235886 words
0.078749ss to count words
0.933969ss to construct filter
745 insert failures
0.80465ss to query every element
```

Mutable

```
500000 cells
235886 words
0.082926ss to count words
0.29735ss to construct filter
582 insert failures
0.52605ss to query every element
```

Incredibly unscientific comparison to `bloom-filter`

using a vanilla filter

```
235886 words
0.087499ss to count words
Bloom { 4194304 bits }
0.464982ss to construct filter
0.506902ss to query every element
```

*** Cuckoo Filters report the number of failures, while the Bloom Filter reports how many bits it contains. I’ll start capturing size for the mutable Cuckoo Filter soon.