Of Coward Flaws & Power Laws

Show code
import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import sys
sys.path.append("../python")
import general
import visualizations
Show code
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Show code
params = {'OUTPUT' : {'path' : os.path.join('output_html', 'power_law'),
                      'name' : 'power_law_20200917'},
          
          # Simulating heights of American Women
          'GAUSSIAN': {'mean' : 65, 
                       'stdev' : 3.5,
                       'n' : 10000},
          
          # Simulating batting averages (baseball)
          'BINOMIAL' : {'at_bats_per_game' : 4,
                        'batting_average' : 0.3,
                        'number_of_games' : 100000},
          
          # Value of founded company
          'POWERLAW' : {'shape' : 0.8,
                        'n' : 50000000,
                        'iteration_size' : 100000} # Recalculate mean every, say 100,000th sample
         }

Statistics, in a nutshell, is a tool for comprehending
The known and the uncertain, via samples never-ending
Ideally, with sufficient size, we soon achieve convergence,
From which, debates are settled via wisdom’s swift emergence

Show code
random_sample = np.random.normal(params['GAUSSIAN']['mean'], params['GAUSSIAN']['stdev'], params['GAUSSIAN']['n'])
Show code
visualizations.gaussianHistogram(random_sample)
Show code
gaussian_means = []
for i in tqdm(range(1,params['GAUSSIAN']['n']+1)):
    gaussian_means.append(np.mean(random_sample[:i]))
100%|██████████| 10000/10000 [00:00<00:00, 35367.48it/s]
Show code
visualizations.gaussianConvergenceLine(gaussian_means, params['GAUSSIAN']['mean'])

Canonical examples, which a lecturer deploys
Discuss the dull dimensions of two samples - girls and boys
And thus cliches perpetuate and students doze from boredom
“Deliver better content!” thus the teacher’s class implored him…

Show code
random_sample = np.random.binomial(params['BINOMIAL']['at_bats_per_game'], params['BINOMIAL']['batting_average'], 
                                   params['BINOMIAL']['number_of_games'])
Show code
visualizations.binomialHistogram(random_sample)
Show code
binomial_means = []
for i in tqdm(range(1,params['BINOMIAL']['number_of_games']+1)):
    binomial_means.append(np.mean(random_sample[:i])/(params['BINOMIAL']['at_bats_per_game']))
100%|██████████| 100000/100000 [00:11<00:00, 9048.29it/s]

And though, for some, athletic stats are somewhat more compelling,
The fallacies of gamesmen oversimplify foretelling.
Because, alas, such cases of predictable behavior,
Mislead the intuition that large samples serve as savior.

Show code
visualizations.binomialConvergenceLine(binomial_means, params['BINOMIAL']['batting_average'])
Show code
random_sample = np.random.pareto(params['POWERLAW']['shape'], params['POWERLAW']['n'])
Show code
visualizations.powerLawHistogram(random_sample)
Show code
power_law_means, power_law_medians, n_samples = [],[],[]
for i in tqdm(range(1,int(params['POWERLAW']['n']/params['POWERLAW']['iteration_size'])+1)):
    power_law_means.append(np.mean(random_sample[:(i*params['POWERLAW']['iteration_size'])]))
    power_law_medians.append(np.percentile(random_sample[:(i*params['POWERLAW']['iteration_size'])],50))
    n_samples.append(i*params['POWERLAW']['iteration_size'])
100%|██████████| 500/500 [03:44<00:00,  2.23it/s]

For losses may be well-defined,
and gains may be obscene.
A model may hold miracles,
and still may lack a mean!

Show code
visualizations.powerLawMedianLine(power_law_medians, n_samples)

The median may still exist, remarkably consistent,
But mean and even variance may still be non-existent!
For though most efforts end in loss as torturous examples,
We find expected values truly rise with larger samples!
And even after sample sizes climb into the millions
Results defy the certainty so prized by mere civilians

Show code
visualizations.powerLawConvergenceLine(power_law_means, n_samples)

So which world is inhabited when forced to make decisions?
Convergence and simplicity or magnitude revisions?
For wisdom is available if first one ascertains
If losses fit a power law…or if the tail yields gains

Show code
if not os.path.exists(params['OUTPUT']['path']): os.makedirs(params['OUTPUT']['path'])
general.publish('powerLaw', params['OUTPUT']['path'], params['OUTPUT']['name'])