SciPy
es paquete que incluye una colección de algoritmos matemáticos y funciones construidas sobre el paquete NumPy
. En esta clase nos vamos a centrar en cálculos estadísticos.
Como siempre lo primero es lo primero, importemos lo paquetes que vamos a utilizar:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
Este módulo contiene un gran número de distribuciones de probabilidad, tanto continuas como discretas, así como un creciente número de funciones estadísticas.
# Importar el módulo entero
import scipy.stats as st
# Información
np.info(st)
========================================== Statistical functions (:mod:`scipy.stats`) ========================================== .. module:: scipy.stats This module contains a large number of probability distributions as well as a growing library of statistical functions. Each included distribution is an instance of the class rv_continous: For each given name the following methods are available: .. autosummary:: :toctree: generated/ rv_continuous rv_continuous.pdf rv_continuous.logpdf rv_continuous.cdf rv_continuous.logcdf rv_continuous.sf rv_continuous.logsf rv_continuous.ppf rv_continuous.isf rv_continuous.moment rv_continuous.stats rv_continuous.entropy rv_continuous.fit rv_continuous.expect Calling the instance as a function returns a frozen pdf whose shape, location, and scale parameters are fixed. Similarly, each discrete distribution is an instance of the class rv_discrete: .. autosummary:: :toctree: generated/ rv_discrete rv_discrete.rvs rv_discrete.pmf rv_discrete.logpmf rv_discrete.cdf rv_discrete.logcdf rv_discrete.sf rv_discrete.logsf rv_discrete.ppf rv_discrete.isf rv_discrete.stats rv_discrete.moment rv_discrete.entropy rv_discrete.expect Continuous distributions ======================== .. autosummary:: :toctree: generated/ alpha -- Alpha anglit -- Anglit arcsine -- Arcsine beta -- Beta betaprime -- Beta Prime bradford -- Bradford burr -- Burr cauchy -- Cauchy chi -- Chi chi2 -- Chi-squared cosine -- Cosine dgamma -- Double Gamma dweibull -- Double Weibull erlang -- Erlang expon -- Exponential exponweib -- Exponentiated Weibull exponpow -- Exponential Power f -- F (Snecdor F) fatiguelife -- Fatigue Life (Birnbaum-Sanders) fisk -- Fisk foldcauchy -- Folded Cauchy foldnorm -- Folded Normal frechet_r -- Frechet Right Sided, Extreme Value Type II (Extreme LB) or weibull_min frechet_l -- Frechet Left Sided, Weibull_max genlogistic -- Generalized Logistic genpareto -- Generalized Pareto genexpon -- Generalized Exponential genextreme -- Generalized Extreme Value gausshyper -- Gauss Hypergeometric gamma -- Gamma gengamma -- Generalized gamma genhalflogistic -- Generalized Half Logistic gilbrat -- Gilbrat gompertz -- Gompertz (Truncated Gumbel) gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I gumbel_l -- Left Sided Gumbel, etc. halfcauchy -- Half Cauchy halflogistic -- Half Logistic halfnorm -- Half Normal hypsecant -- Hyperbolic Secant invgamma -- Inverse Gamma invgauss -- Inverse Gaussian invweibull -- Inverse Weibull johnsonsb -- Johnson SB johnsonsu -- Johnson SU ksone -- Kolmogorov-Smirnov one-sided (no stats) kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats) laplace -- Laplace logistic -- Logistic loggamma -- Log-Gamma loglaplace -- Log-Laplace (Log Double Exponential) lognorm -- Log-Normal lomax -- Lomax (Pareto of the second kind) maxwell -- Maxwell mielke -- Mielke's Beta-Kappa nakagami -- Nakagami ncx2 -- Non-central chi-squared ncf -- Non-central F nct -- Non-central Student's T norm -- Normal (Gaussian) pareto -- Pareto pearson3 -- Pearson type III powerlaw -- Power-function powerlognorm -- Power log normal powernorm -- Power normal rdist -- R-distribution reciprocal -- Reciprocal rayleigh -- Rayleigh rice -- Rice recipinvgauss -- Reciprocal Inverse Gaussian semicircular -- Semicircular t -- Student's T triang -- Triangular truncexpon -- Truncated Exponential truncnorm -- Truncated Normal tukeylambda -- Tukey-Lambda uniform -- Uniform vonmises -- Von-Mises (Circular) wald -- Wald weibull_min -- Minimum Weibull (see Frechet) weibull_max -- Maximum Weibull (see Frechet) wrapcauchy -- Wrapped Cauchy Multivariate distributions ========================== .. autosummary:: :toctree: generated/ multivariate_normal -- Multivariate normal distribution Discrete distributions ====================== .. autosummary:: :toctree: generated/ bernoulli -- Bernoulli binom -- Binomial boltzmann -- Boltzmann (Truncated Discrete Exponential) dlaplace -- Discrete Laplacian geom -- Geometric hypergeom -- Hypergeometric logser -- Logarithmic (Log-Series, Series) nbinom -- Negative Binomial planck -- Planck (Discrete Exponential) poisson -- Poisson randint -- Discrete Uniform skellam -- Skellam zipf -- Zipf Statistical functions ===================== Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays. .. autosummary:: :toctree: generated/ describe -- Descriptive statistics gmean -- Geometric mean hmean -- Harmonic mean kurtosis -- Fisher or Pearson kurtosis kurtosistest -- mode -- Modal value moment -- Central moment normaltest -- skew -- Skewness skewtest -- tmean -- Truncated arithmetic mean tvar -- Truncated variance tmin -- tmax -- tstd -- tsem -- nanmean -- Mean, ignoring NaN values nanstd -- Standard deviation, ignoring NaN values nanmedian -- Median, ignoring NaN values variation -- Coefficient of variation .. autosummary:: :toctree: generated/ cumfreq _ histogram2 _ histogram _ itemfreq _ percentileofscore _ scoreatpercentile _ relfreq _ .. autosummary:: :toctree: generated/ binned_statistic -- Compute a binned statistic for a set of data. binned_statistic_2d -- Compute a 2-D binned statistic for a set of data. binned_statistic_dd -- Compute a d-D binned statistic for a set of data. .. autosummary:: :toctree: generated/ obrientransform signaltonoise bayes_mvs sem zmap zscore .. autosummary:: :toctree: generated/ sigmaclip threshold trimboth trim1 .. autosummary:: :toctree: generated/ f_oneway pearsonr spearmanr pointbiserialr kendalltau linregress .. autosummary:: :toctree: generated/ ttest_1samp ttest_ind ttest_rel kstest chisquare power_divergence ks_2samp mannwhitneyu tiecorrect rankdata ranksums wilcoxon kruskal friedmanchisquare .. autosummary:: :toctree: generated/ ansari bartlett levene shapiro anderson anderson_ksamp binom_test fligner mood .. autosummary:: :toctree: generated/ boxcox boxcox_normmax boxcox_llf entropy Contingency table functions =========================== .. autosummary:: :toctree: generated/ chi2_contingency contingency.expected_freq contingency.margins fisher_exact Plot-tests ========== .. autosummary:: :toctree: generated/ ppcc_max ppcc_plot probplot boxcox_normplot Masked statistics functions =========================== .. toctree:: stats.mstats Univariate and multivariate kernel density estimation (:mod:`scipy.stats.kde`) ============================================================================== .. autosummary:: :toctree: generated/ gaussian_kde For many more stat related functions install the software R and the interface package rpy.
Carguemos unos datos, por ejemplo unas notas de la carrera, y veamos cómo podemos aprovechar las funciones de scipy.stats
.
# esta línea no funciona en Windows
!head ../data/notas.csv
# Leemos el archivo
datos = np.loadtxt("../data/notas.csv", skiprows=1)
datos
array([ 2.9, 4.3, 3.9, 0. , 4.1, 7.3, 2.3, 5.6, 2.9, 3.9, 4.6, 6.3, 2.1, 2.1, 6.5, 1.9, 0. , 6.5, 2.5, 5.1, 5.3, 6.3, 5.4, 5.3, 5.3, 2. , 3.5, 4.4, 5.5, 3.6, 3.9, 2.5, 4.1, 3. , 4.6, 4. , 6.3, 0.6, 2.4, 6.5, 2.3, 4.6, 6.9, 5.1, 5.4, 5.3, 4.5, 6.5, 2.1, 5.5, 3.4, 8.1, 4. , 1.9, 1.6, 4.3, 4.6, 5.4, 1. , 6.5, 5.5, 4.9, 4. , 5.3, 3.5, 4.4, 2.8, 5.4, 3.5, 2.3, 4.8, 2.1, 6.6, 0.5, 2.1, 3.1, 3.4, 5.9, 3.4, 4.3, 1.5, 5.5, 4.4, 1.9, 4.4, 2.9, 3.9, 5.8, 2.8, 3. , 1.5, 2.6, 2.9, 3.4, 5.4, 3.6, 4.6, 5. , 1.4, 4.3, 4.6, 3.1, 2. , 3.6, 4. , 2.5, 3. , 5.1, 6.4, 3.5, 5.8, 4.1, 5.9, 4. , 6.4, 2.3, 7. , 1.4, 3.5, 4.4, 2.9, 5.1, 3.4, 4.8, 4.6, 4.3, 6.9, 5.4, 4. , 3.3, 1.4, 1.9, 3.8, 3.4, 3.6, 3.8, 6.3, 4.8, 4. , 6.8, 4. , 3.6, 4.4, 4.1, 6. , 4.1, 5.6, 3.9, 4.6, 5. , 6.5, 3.5, 5.5, 4.6, 4.8, 4.6, 6.5, 4.1, 4.4, 5.3, 3.6, 7.1, 4.6, 2.1, 3.3, 3.9, 4. , 4.4, 0.9, 4.3, 2.4, 2.9, 3.6, 1.4, 2.8, 2.5, 6.6, 0. , 5.1, 0. , 5.1, 4. , 2.6, 4.1, 4.6, 3.1, 4.4, 2.8, 2.8, 5.6, 3.9, 4.4, 4.1, 0.3, 2.4, 3.3, 2.5, 4.3, 2.5, 4.5, 4.8, 4.3, 3.3, 6. , 2.3, 4.5, 3.4, 4.5, 2.3, 4.5, 2.5, 6.4, 7. , 5.8, 3.4, 5.1, 4.9, 8.5, 3. , 3.3, 3.3, 2.6, 1.8, 2.9, 4.3, 2.1, 4.8, 5.3, 3.1, 5. , 4.6, 3.1, 5.6, 5.5, 4.3, 4.5, 5.4, 4.3, 1.5, 3.9, 5.4, 1.8, 3.5, 5.3, 2.6, 3.5, 3. , 4.9, 3.5, 3.4, 2. , 4.5, 5.3, 3. , 1.6, 4.8, 3.8, 0.6, 4.1, 5.4, 7.5, 0.6, 2.6, 2.1, 3.8, 3.3, 5.4, 3.6, 2.4, 4.6, 0.9, 5.8, 4.4, 1. , 3.6, 6.3, 4.4, 7.5, 5.9, 0.5, 4.3, 2.4, 6. , 4.6, 5.1, 6.1, 4. , 4.5, 7.9, 3.5, 3.1, 5. , 6.3, 4.9, 5.9, 4.6, 5.8, 5.4, 0.1, 1.8, 5.1, 4. , 2.4, 4.6, 4.9, 3.1, 1.4, 4. , 3.6, 4.3, 3.8, 4.4, 4.8, 5.1, 4. , 0. , 2.1, 5.9, 6.3, 3.1, 6. , 3.4, 1.9, 5.6, 5.3, 4.8, 2.6, 5.6, 4.8, 5.4, 3.4, 5.3, 4.1, 3.8, 3.6, 3.9, 2. , 3.5, 4. , 3.6, 0.6, 2.4, 3.9, 4.1, 2.8, 3. , 5.1, 5.4, 3.9, 3.3, 3.8, 2.5, 0.6, 2.8, 2.9, 7. , 6. , 2.8, 4. , 4.9, 4.8, 2.8, 2. , 4. , 2.6, 3.1, 2.9, 6.5, 4.3, 2.1, 3.9, 4.3, 0. , 7.4, 3.9])
# Descripción rápida de los datos
st.describe(datos)
(375, (0.0, 8.5), 3.9706666666666668, 2.5927736185383243, -0.13203546994646295, -0.06611485627230884)
# Histograma con st
st.histogram(datos, numbins=10, defaultlimits=(0,10))
(array([ 17., 20., 61., 78., 98., 61., 29., 9., 2., 0.]), 0, 1.0, 0)
# Pintemos un histograma con plt
plt.hist(datos, range(0,11,))
plt.xticks(range(0,11))
plt.grid(True)
plt.vlines(5, 0, 100, lw=5, colors='red', alpha=0.8)
plt.fill_between([0, 5], [100, 100], color='red', alpha=0.5)
<matplotlib.collections.PolyCollection at 0x7f3f07891358>
# Pintemos un histograma acumulado con plt
plt.hist(datos, range(0,11), cumulative=True)
plt.xticks(range(0,11))
plt.vlines(5, 0, 400, lw=5, colors='red', alpha=0.8)
plt.fill_between([0, 5], [400, 400], color='red', alpha=0.5)
plt.grid(True)
# Percentil
st.percentileofscore(datos, 5)
73.733333333333334
# Nota de un percentil
st.scoreatpercentile(datos, 50)
4.0
¿Te parecen normales estas notas? No, no me refiero a si te gustan o no... Me refiero a que si crees que estas notas se distribuyen de manera gaussiana.
# Parámetros
med = st.nanmean(datos)
des_tip = st.nanstd(datos)
# Distribución normal
dist_normal = st.norm(loc=med, scale=des_tip)
Ahora podemos ver:
pdf
cdf
De esta manera, nos ahorramos definir funciones como:
$$N(\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$$$\phi(x)=\intop_{-\infty}^{x}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}dx$$Para calcular probabilidades $P[a\leq X\leq b]=\intop_{a}^{b}f(x)dx$
# Calculamos la pdf
x = np.linspace(0, 10, 100)
y1 = dist_normal.pdf(x)
y1
array([ 1.18466112e-02, 1.38013272e-02, 1.60154276e-02, 1.85117381e-02, 2.13131100e-02, 2.44420387e-02, 2.79202305e-02, 3.17681221e-02, 3.60043569e-02, 4.06452262e-02, 4.57040834e-02, 5.11907430e-02, 5.71108776e-02, 6.34654270e-02, 7.02500358e-02, 7.74545375e-02, 8.50625018e-02, 9.30508640e-02, 1.01389654e-01, 1.10041839e-01, 1.18963303e-01, 1.28102958e-01, 1.37403018e-01, 1.46799426e-01, 1.56222440e-01, 1.65597374e-01, 1.74845495e-01, 1.83885049e-01, 1.92632412e-01, 2.01003342e-01, 2.08914298e-01, 2.16283816e-01, 2.23033887e-01, 2.29091330e-01, 2.34389108e-01, 2.38867558e-01, 2.42475512e-01, 2.45171267e-01, 2.46923389e-01, 2.47711323e-01, 2.47525793e-01, 2.46368984e-01, 2.44254502e-01, 2.41207102e-01, 2.37262214e-01, 2.32465247e-01, 2.26870729e-01, 2.20541268e-01, 2.13546393e-01, 2.05961280e-01, 1.97865419e-01, 1.89341226e-01, 1.80472670e-01, 1.71343910e-01, 1.62038000e-01, 1.52635673e-01, 1.43214236e-01, 1.33846588e-01, 1.24600387e-01, 1.15537363e-01, 1.06712791e-01, 9.81751290e-02, 8.99658014e-02, 8.21191408e-02, 7.46624629e-02, 6.76162683e-02, 6.09945536e-02, 5.48052151e-02, 4.90505281e-02, 4.37276824e-02, 3.88293574e-02, 3.43443198e-02, 3.02580270e-02, 2.65532243e-02, 2.32105212e-02, 2.02089381e-02, 1.75264140e-02, 1.51402697e-02, 1.30276213e-02, 1.11657426e-02, 9.53237395e-03, 8.10597949e-03, 6.86595419e-03, 5.79278319e-03, 4.86815747e-03, 4.07504992e-03, 3.39775625e-03, 2.82190558e-03, 2.33444533e-03, 1.92360505e-03, 1.57884349e-03, 1.29078296e-03, 1.05113462e-03, 8.52617863e-04, 6.88876683e-04, 5.54395268e-04, 4.44414802e-04, 3.54853006e-04, 2.82227567e-04, 2.23584322e-04])
# La representamos
plt.plot(x, y1)
plt.grid(True)
# Calculamos la cdf
y2 = dist_normal.cdf(x)
y2
array([ 0.00683286, 0.00812612, 0.00962974, 0.01137106, 0.01337976, 0.01568777, 0.0183293 , 0.02134067, 0.02476018, 0.02862791, 0.03298542, 0.03787547, 0.04334158, 0.04942765, 0.05617737, 0.06363372, 0.07183834, 0.08083088, 0.0906483 , 0.1013242 , 0.11288805, 0.12536454, 0.13877285, 0.15312598, 0.1684302 , 0.18468445, 0.20187994, 0.21999975, 0.2390186 , 0.25890272, 0.27960985, 0.30108939, 0.32328269, 0.34612342, 0.36953818, 0.39344715, 0.41776493, 0.44240138, 0.46726269, 0.49225242, 0.5172726 , 0.54222491, 0.56701185, 0.59153783, 0.61571033, 0.63944089, 0.66264614, 0.68524861, 0.70717752, 0.72836942, 0.74876866, 0.76832781, 0.78700782, 0.80477817, 0.82161679, 0.83750988, 0.85245164, 0.86644386, 0.87949545, 0.89162183, 0.90284437, 0.91318964, 0.92268879, 0.93137677, 0.93929165, 0.94647392, 0.95296581, 0.95881062, 0.96405221, 0.96873436, 0.97290038, 0.9765926 , 0.97985205, 0.98271818, 0.98522854, 0.98741866, 0.98932189, 0.99096933, 0.99238974, 0.99360961, 0.99465314, 0.99554231, 0.99629699, 0.99693499, 0.99747225, 0.99792288, 0.99829939, 0.99861271, 0.99887244, 0.9990869 , 0.99926327, 0.99940776, 0.99952567, 0.9996215 , 0.99969908, 0.99976165, 0.99981191, 0.99985212, 0.99988416, 0.9999096 ])
# La representamos
plt.plot(x, y2)
plt.grid(True)
Del mismo modo se pueden usar otras distribuciones continuas o discretas e incluso, definir distribuciones propias. Pero sigamos con las notas...
Ahora que ya hemos visualizado la distribución de las notas y que sabemos generar distribuciones normales. ¿Por qué no hacemos un test de Kolmogórov-Smirnov?
Se trata de ver lo bien o lo mal que se ajusta la distribución a una normal con $\mu=3.97$ y $\sigma²=2.57$
bars = st.histogram(datos, numbins=10, defaultlimits=(0,10))[0]
bars /= 375
plt.bar(np.arange(0,10), bars, alpha=0.5, width=1)
plt.plot(x, y1, c='black', lw=2)
plt.grid(True)
plt.hist(datos, np.linspace(0,10,51), cumulative=True, alpha=0.5)
plt.plot(x, y2 * 375, lw=2, c='black')
plt.xticks(range(0,11))
plt.grid(True)
datos2 = dist_normal.cdf
st.kstest(datos, dist_normal.cdf)
(0.04783071674813294, 0.34838712365988389)
Se rechaza la hipótesis nula si el valor p asociado al resultado observado es igual o menor que el nivel de significación establecido, convencionalmente 0,05 ó 0,01. Es decir, el valor p nos muestra la probabilidad de haber obtenido el resultado que hemos obtenido si suponemos que la hipótesis nula es cierta. (Wikipedia)
Si probamos con st.normaltest
que también comprueba la bondad del ajuste obtenemos un valor-p más alto:
st.normaltest(datos)
(1.1315306710610515, 0.56792532695583819)
En definitiva, parece que las notas esta vez siguieron una normal con $\mu=3.97$ y $\sigma²=2.57$
Las siguientes celdas contienen configuración del Notebook
Para visualizar y utlizar los enlaces a Twitter el notebook debe ejecutarse como seguro
File > Trusted Notebook
# Esta celda da el estilo al notebook
from IPython.core.display import HTML
css_file = '../styles/aeropython.css'
HTML(open(css_file, "r").read())