# Histogram

Computing a histogram of a bunch of bytes of data means counting the occurrence of each value of each byte - creating a 256 element array containing the number of times each value of a byte is seen in the data.

## Naïve Example

```
void make_histogram_naive(unsigned histogram[256], unsigned char *in, size_t num_in) {
memset(histogram, 0, sizeof(histogram[0]) * 256);
unsigned char *in_end = in + num_in;
while(in < in_end) {
++histogram[*in++];
}
}
```

This version of the code has a speed on my 3970x of 1829MB/s

## Loop Unrolling

```
void make_histogram_unroll(unsigned *histogram, unsigned char *in, size_t num_in) {
memset(histogram, 0, sizeof(histogram[0]) * 256);
unsigned char *in_end = in + num_in;
// Do 4 per iteration
while(in < in_end-3) {
++histogram[*in++];
++histogram[*in++];
++histogram[*in++];
++histogram[*in++];
}
// Do whatever is left-over
while(in < in_end) {
++histogram[*in++];
}
}
```

This version of the code has a speed on my 3970x of 2059MB/s

## Load Once

```
void make_histogram_unroll(unsigned *histogram, unsigned char *in, size_t num_in) {
memset(histogram, 0, sizeof(histogram[0]) * 256);
unsigned char *in_end = in + num_in;
// Do 4 per iteration
while(in < in_end-3) {
unsigned v = *(unsigned*)in;
in += 4;
++histogram[(unsigned char)v];
++histogram[(unsigned char)(v>>8)];
++histogram[(unsigned char)(v>>16)];
++histogram[v>>24];
}
// Do whatever is left-over
while(in < in_end) {
++histogram[*in++];
}
}
```

This version of the code has a speed on my 3970x of 2195MB/s

## Pipelining

This version of the code performs well on Out-of-Order Execution modern CPUs.

Essentially, a modern CPU can do multiple independent things in parallel, so if your code is such that different parts can be executed in parallel, the CPU will generally look ahead and go and do that for you automatically.

The following code has 4 independent histograms which are summed together at the end. This allows the CPU to do 4 sets of histogram computations in parallel with eachother - significantly increasing the performance to 3553MB/s!

```
void make_histogram_pipeline(unsigned *histogram, unsigned char *in, size_t num_in) {
memset(histogram, 0, sizeof(histogram[0]) * 256);
unsigned tmp_histogram[3][256] = {{0}};
unsigned char *in_end = in + num_in;
while(in < in_end-15) {
// This can run in parallel with the others as its independent.
unsigned v = *(unsigned*)(in+0);
++histogram[(unsigned char)v];
++histogram[(unsigned char)(v>>8)];
++histogram[(unsigned char)(v>>16)];
++histogram[v>>24];
// This can run in parallel with the others as its independent.
unsigned v2 = *(unsigned*)(in+4);
++tmp_histogram[0][(unsigned char)v2];
++tmp_histogram[0][(unsigned char)(v2>>8)];
++tmp_histogram[0][(unsigned char)(v2>>16)];
++tmp_histogram[0][v2>>24];
// This can run in parallel with the others as its independent.
unsigned v3 = *(unsigned*)(in+8);
++tmp_histogram[1][(unsigned char)v3];
++tmp_histogram[1][(unsigned char)(v3>>8)];
++tmp_histogram[1][(unsigned char)(v3>>16)];
++tmp_histogram[1][v3>>24];
// This can run in parallel with the others as its independent.
unsigned v4 = *(unsigned*)(in+12);
++tmp_histogram[2][(unsigned char)v4];
++tmp_histogram[2][(unsigned char)(v4>>8)];
++tmp_histogram[2][(unsigned char)(v4>>16)];
++tmp_histogram[2][v4>>24];
in+=16;
}
while(in < in_end) {
++histogram[*in++];
}
for(int i = 0; i < 256; ++i) {
histogram[i] += tmp_histogram[0][i] + tmp_histogram[1][i] + tmp_histogram[2][i];
}
}
```

## SIMD

Not a lot can be SIMD'ified here with SSE2. The loop at the end though which sums the histograms is a prime candidate.

Performance of this version is 3765MB/s -- though that depends on your chunk size!

```
void make_histogram_pipeline_simd(unsigned *histogram, unsigned char *in, size_t num_in) {
// on some achitectures, unaligned data access is slower than aligned data access.
// so lets align these temporary buffers...
unsigned __declspec(align(16)) tmp_histogram[4][256] = {{0}};
unsigned char *in_end = in + num_in;
while(in < in_end-15) {
unsigned v = *(unsigned*)(in+0);
++tmp_histogram[0][(unsigned char)v];
++tmp_histogram[0][(unsigned char)(v>>8)];
++tmp_histogram[0][(unsigned char)(v>>16)];
++tmp_histogram[0][v>>24];
unsigned v2 = *(unsigned*)(in+4);
++tmp_histogram[1][(unsigned char)v2];
++tmp_histogram[1][(unsigned char)(v2>>8)];
++tmp_histogram[1][(unsigned char)(v2>>16)];
++tmp_histogram[1][v2>>24];
unsigned v3 = *(unsigned*)(in+8);
++tmp_histogram[2][(unsigned char)v3];
++tmp_histogram[2][(unsigned char)(v3>>8)];
++tmp_histogram[2][(unsigned char)(v3>>16)];
++tmp_histogram[2][v3>>24];
unsigned v4 = *(unsigned*)(in+12);
++tmp_histogram[3][(unsigned char)v4];
++tmp_histogram[3][(unsigned char)(v4>>8)];
++tmp_histogram[3][(unsigned char)(v4>>16)];
++tmp_histogram[3][v4>>24];
in+=16;
}
while(in < in_end) {
++histogram[*in++];
}
for(int i = 0; i < 256; i+=4) {
__m128i total = _mm_load_si128((__m128i*)&tmp_histogram[0][i]);
__m128i total2 = _mm_load_si128((__m128i*)&tmp_histogram[2][i]);
total = _mm_add_epi32(total, _mm_load_si128((__m128i*)&tmp_histogram[1][i]));
total2 = _mm_add_epi32(total2, _mm_load_si128((__m128i*)&tmp_histogram[3][i]));
_mm_storeu_si128((__m128i *)&histogram[i], _mm_add_epi32(total,total2));
}
}
```