# A hands-on demo of analyzing big data with Spark

Written by Matt Sosna on June 16, 2021

#### 2. The analytics framework for big data

I had to go down quite a rabbit hole to understand what hardware, exactly, Spark allocates tasks to. Your computer has several physical CPU cores with independent processing power. This is how you can type in a Word doc and flip through photos while YouTube is playing, for example.

We can take it a step further, though $-$ a physical CPU core can handle multiple tasks at once by hyper-threading between two or more “logical” cores. Logical cores act as independent cores each handling their own tasks, but they’re really just the same physical core. The trick is that the physical core can switch between logical cores incredibly quickly, taking advantage of task downtime (e.g. waiting for YouTube to send back data after you enter a search term) to squeeze in more computations.

When running on your local machine, Spark allocates tasks to all logical cores on your computer unless you specify otherwise. By default, Spark sets aside 512 MB of each core and partitions data equally to each one. You can check the number of logical cores with sc.defaultParallelism.

#### 3. Counting letter frequencies in a novel

In earlier drafts of this post, I toyed around with generating the text for a “novel” myself. I looked at some lorem ipsum Python packages, but they were a little inconsistent; I found the very funny Bacon Ipsum API but didn’t want to drown it with a request for thousands of paragraphs. The code below uses random strings to generates a “novel” 100,000 paragraphs long, or 8.9x War and Peace’s measly 11,186. Turns out writing a novel is way easier than I thought!

 Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
from string import ascii_lowercase

# Create set of characters to sample from
ALPHABET = list(ascii_lowercase) + [' ', ', ', '. ', '! ']

# Set parameters
N_PARAGRAPHS = 100000
MIN_PAR_LEN = 100
MAX_PAR_LEN = 2000

# Set random state
np.random.seed(42)

# Generate novel
novel = []
for _ in range(N_PARAGRAPHS):

# Generate a paragraph
n_char = np.random.randint(MIN_PAR_LEN, MAX_PAR_LEN)
paragraph = np.random.choice(ALPHABET, n_char)

novel.append(''.join(paragraph))

# Visualize first "paragraph"
print(our_novel[0]) # t. okh. ugzswkkxudhxcvubxl! fb, ualzv....
`

#### 4. Counting letter frequencies in a novel

You can easily reduce RDDs with more complex functions, or ones you’ve defined ahead of time. But for much big data processing, the operations are usually pretty simple $-$ adding elements together, filtering by some threshold $-$ so it’s a little overkill to define these functions explicitly outside the one or two times you use them.

Written on June 16, 2021