Note
Go to the end to download the full example code.
Distributions Fitting
This example shows how to use the best_fit_distribution
function
to determine which statistical distribution best first a given data set, and plot the
results together.
In this example, we’ll go further and see how by using the probability density function of a fit to a chi-square distrubtion, one can get knowledge of the underlying normal distributions it originated from.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import scipy.stats as st
from pyhdtoolkit.maths import stats_fitting as fitting
from pyhdtoolkit.plotting.styles import _SPHINX_GALLERY_PARAMS
from pyhdtoolkit.plotting.utils import set_arrow_label
from pyhdtoolkit.utils import logging
logging.config_logger(level="error")
plt.rcParams.update(_SPHINX_GALLERY_PARAMS) # for readability of this tutorial
We will for this example create a chi-square distribution with known parameters. As the chisquare distribution is based on deviations of independent, normal distribution variables, we will create a normal distribution with K degrees of freedom, and compute the associated
By repeating so a few thousand times, we get a chi-square distribution.
def chi_square_dist(num: int, meas_used: int) -> np.ndarray:
"""
Generates a chi-square distribution of *num* elements, each computed
from a normal distribution of *meas_used* elements.
Args:
num (int): number of elements for the generated distribution.
meas_used (int): number of values for each.
Returns:
The resulting distribution as a `numpy.ndarray`.
"""
result = []
for _ in range(num):
norm_dist = np.random.default_rng().normal(loc=0, scale=1, size=meas_used)
result.append(np.sum(np.square([i - np.mean(norm_dist) for i in norm_dist])))
return np.array(result)
def chi_dist(num: int, meas_used: int) -> np.ndarray:
"""
Generates a chi distribution of *num* elements, each computed from a
normal distribution of *meas_used* elements.
Args:
num (int): number of elements for the generated distribution.
meas_used (int): number of values for each.
Returns:
The resulting distribution as a `numpy.ndarray`.
"""
result = []
for _ in range(num):
norm_dist = np.random.default_rng().normal(loc=0, scale=1, size=meas_used)
result.append(np.sqrt(np.sum(np.square([i - np.mean(norm_dist) for i in norm_dist]))))
return np.array(result)
Now, from this chi-squared distribution of values, how do we find back the standard deviation \(\sigma_{0}\) of the normal (gaussian) distributions it is computed from? Let’s first start by fitting a statistical chisquared model to the data!
From there, we can plot the Probability Density Function of our best candidate and see how well it fits our data. It should fit very well, and then a good guess of this :math`sigma_{0}`` is based on the index at which to find the peak of the best candidate’s PDF.
Since this peak happens at \(K - 2\), then we have:
Note
In the case where we use \(N\) measurements, we have \(K = N - 1`\) degrees of freedom. This means that \(N\) measurements lead to a Chisquare(N-1) distribution. As a consequence, it is important to replace \(K`\) by \(N - 1`\) in all the equations above, and the normal distribution’s stdev estimator becomes:
Enough talk, see below an example with \(N = 5\) measurements.
measurements_used = 5
degrees_of_freedom = measurements_used - 1
# Chi-square distribution
sigs = chi_square_dist(50_000, meas_used=measurements_used)
data = pd.Series(sigs)
# Chi distribution
chis = chi_dist(num=50_000, meas_used=measurements_used)
chi_data = pd.Series(chis)
For the sake of not over-crowding our plot, let’s reduce the amount of statistical distributions we will try to fit to the data:
DISTRIBUTIONS = {
st.chi: "Chi",
st.chi2: "ChiSquared",
st.norm: "Normal",
}
fitting.set_distributions_dict(DISTRIBUTIONS)
Now we plot our data, and try to fit these three distributions to it, by calling
the best_fit_distribution
function:
ax = data.plot(
kind="hist",
bins=100,
density=True,
alpha=0.6,
label="Generated Distribution",
figsize=(20, 12),
)
data_y_lim = ax.get_ylim()
# Find best fit candidate
best_fit_func, best_fit_params = fitting.best_fit_distribution(data, 200, ax)
ax.set_ylim(data_y_lim)
ax.set_title("All Fitted Distributions")
ax.set_ylabel("Normed Hist Counts")
plt.legend()
plt.show()
That’s nice. We can already see that only the chi-squared distrubtion seems to
be a good fit. Let’s do a new plot with only the best fit’s probability density
function (which we get with the make_pdf
function) added
onto the data, and determine its mode:
# Let's get the PDF of the best fit
pdf: pd.Series = fitting.make_pdf(best_fit_func, best_fit_params)
plt.figure(figsize=(20, 12))
ax = pdf.plot(lw=2, label=f"{fitting.DISTRIBUTIONS[best_fit_func]} fit PDF", legend=True)
data.plot(
kind="hist",
bins=100,
density=True,
alpha=0.5,
label=f"Generated ({measurements_used - 1} degrees of freedom)",
legend=True,
ax=ax,
)
param_names = (
(best_fit_func.shapes + ", loc, scale").split(", ")
if best_fit_func.shapes
else ["loc", "scale"]
)
param_str = ", ".join(
[f"{k}={v:0.2f}" for k, v in zip(param_names, best_fit_params, strict=False)]
)
dist_str = f"{fitting.DISTRIBUTIONS[best_fit_func]}({param_str})"
# Let's add to the plot some info on the fit's peak
set_arrow_label(
axis=ax,
label=f"Measured Mode: {pdf.idxmax():.3f}",
arrow_position=(pdf.idxmax(), pdf.max()),
label_position=(1.1 * np.mean(ax.get_xlim()), 0.75 * max(pdf.to_numpy())),
color="indianred",
arrow_arc_rad=0.3,
fontsize=25,
)
plt.vlines(pdf.idxmax(), ymin=0, ymax=max(pdf.to_numpy()), linestyles="--", color="indianred")
ax.set_title("Best fit to distribution:\n" + dist_str)
ax.set_ylabel("Normed Hist Counts")
ax.set_xlabel("x")
ax.legend()
We expect a mode of 2, we can check that this is indeed the case. Here we leave a nice tolerance to ensure the documentation build does not fail.
assert np.isclose(pdf.idxmax(), 2, rtol=2e-2)
With this knowledge, we can now determine the standard deviation of the normal
distribution we’ve used to create this chi-square distribution. Remember that
in the chi_square_dist
function we have called np.random.default_rng().normal
with loc=0
and scale=1
, so we expect here to find a standard deviation of
one.
factor = (
np.sqrt(2)
* scipy.special.gamma((degrees_of_freedom + 1) / 2)
/ scipy.special.gamma(degrees_of_freedom / 2)
)
determined_stdev = chi_data.mean() / factor
assert np.isclose(determined_stdev, 1, rtol=1e-2) # nice tolerance here too
Which we do :)
Bonus: here’s a plot of trying to fit all basic distributions to the chi distribution generated above.
# Let's reset the distributions dict
DISTRIBUTIONS = {
st.chi: "Chi",
st.chi2: "Chi-Square",
st.expon: "Exponential",
st.laplace: "Laplace",
st.lognorm: "LogNorm",
st.norm: "Normal",
}
fitting.set_distributions_dict(DISTRIBUTIONS)
ac = chi_data.plot(
kind="hist",
bins=100,
density=True,
alpha=0.6,
label="Generated Chi Distribution",
figsize=(20, 12),
)
data_ylim = ac.get_ylim()
# Find best fit candidate
best_fit_func, best_fit_params = fitting.best_fit_distribution(chi_data, 200, ac)
ac.set_ylim(data_ylim)
ac.set_title("All Fitted Distributions")
ac.set_ylabel("Normed Hist Counts")
plt.legend()
References
The use of the following functions, methods, classes and modules is shown in this example:
stats_fitting
:best_fit_distribution
,make_pdf
utils
:set_arrow_label
Total running time of the script: (0 minutes 12.872 seconds)