一日一万字の感謝の写経

一日一万字の感謝の写経をして強くなります。そのうち本を置き去りにして何かを作り始める予定です。

Pythonで体験するベイズ推論 PyMCによるMCMC入門 第2章 (2.2.6から2章の終わりまで) 写経

目次

始めに

 この記事では、Pythonで体験するベイズ推論 PyMCによるMCMC入門

の第2章の2.2.4までを写経した記録をまとめます。前回はこちらPythonで体験するベイズ推論 PyMCによるMCMC入門 第2章 (2.2.4まで) 写経。 実行はJupyter Labにて実行しています。

 写経は以下のようにまとめました。

  • 写経ではありますが、コードを実行しやすいように関数を作成し、コードの再実行がやりやすいようにした部分があります。

  • また、より良いと思われるコードを考えた場合は書き換えてコメントを添えるようにし、変更点をなるべく明示するようにしてあります。

  • 個人的に気になった点のメモを残すようにしてあります。同じような疑問を持った方の助けになれば幸いです。

  • 以前書いたコードと同じようなコード(例えばグラフの描写等)は効率化のために飛ばしているところもあります。

 Pythonで体験するベイズ推論 PyMCによるMCMC入門を読んでいている際のちょっとした正誤表代わりになればと思います。

この記事で使用する主なモジュール、設定

 この記事では、主に以下のモジュールや設定を使用しています。

import scipy.stats as stats
from IPython.core.pylabtools import figsize
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
plt.rcParams['font.size'] = 17
import pymc as pm
from scipy.stats.mstats import mquantiles
from urllib.request import urlretrieve
from os import makedirs

2 PyMCについてもう少し

2.2 モデリングのアプローチ

2.2.6 二項分布

def plot_bin_dist_pmf(N=10, means=[.4, .9]):
    figsize(25, 8)
    binomial = stats.binom
    for p in means:
        _x = np.arange(N + 1)
        plt.bar(_x, binomial.pmf(_x, N, p), 
               alpha=0.6, linewidth=3, 
               label=f"$N$: {N}, $p$: {p}")
    plt.legend(loc="upper left")
    plt.xlim(-.5, N + 0.5)
    plt.xlabel("$k$")
    plt.ylabel("$P(X = k)$")
    plt.title("Probability mass distributions of "
             "binomial random variables")
plot_bin_dist_pmf()

Output:

f:id:koheitsutsumi223:20190323201952p:plain

以下の図は、二項分布が正規分布に収束するのが実感できますね。

plot_bin_dist_pmf(N=1000, means=[.5, .8])

Output:

f:id:koheitsutsumi223:20190323201956p:plain

2.2.7 例題:カンニングをした学生の割合

N = 100
p = pm.Uniform("freq_cheating", 0, 1)

true_answers = pm.Bernoulli("truths", p, size=N)

first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
print(first_coin_flips.value)

Output:

    [False  True  True False False  True False  True  True False False False
      True  True False  True False False  True False  True  True False False
      True False False False  True False  True  True False  True  True  True
     False  True  True  True  True False False False False False False False
     False  True  True  True False  True False False False False False False
      True False False  True  True False  True False  True False  True False
     False  True  True  True False  True  True  True  True  True False  True
     False  True  True  True False  True False  True  True  True  True False
      True  True False  True]
second_coin_flips = pm.Bernoulli("second_flips", 0.5, size=N)

@pm.deterministic
def observed_proportion(t_a=true_answers, 
                       fc=first_coin_flips, 
                       sc=second_coin_flips):
    observed = fc * t_a + (1 - fc) * sc
    return observed.sum()/float(N)

print(observed_proportion.value)

Output:

    0.41
X = 35
observations = pm.Binomial("obs", N, observed_proportion, 
                          observed=True, value=X)

model = pm.Model([p, true_answers, 
                  first_coin_flips, second_coin_flips, 
                  observed_proportion, observations])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 15000)

Output:

     [-----------------100%-----------------] 40000 of 40000 complete in 13.5 sec
figsize(25, 6)

p_trace = mcmc.trace("freq_cheating")[:]
plt.hist(p_trace, histtype="stepfilled", 
         density=True, alpha=0.85, bins=30, color="#348ABD", 
         label="posterior distribution")
plt.vlines([.05, .35], [0, 0], [5, 5], alpha=0.3)
plt.xlim(0, 1)
plt.xlabel("Value of $p$")
plt.ylabel("Density")
plt.title("Posterior distribution of parameter $p$")
plt.legend()

Output:

    <matplotlib.legend.Legend at 0x1fae1706320>

f:id:koheitsutsumi223:20190323202000p:plain

2.2.8 もう一つのPyMCモデル

p = pm.Uniform("freq_cheating", 0, 1)

@pm.deterministic
def p_skewed(p=p):
    return 0.5 * p + 0.25

yes_responses = pm.Binomial("number_cheaters", 
                            100, p_skewed, value=35, observed=True)

model = pm.Model([yes_responses, p_skewed, p])

mcmc = pm.MCMC(model)
mcmc.sample(25000, 2500)

Output:

     [-----------------100%-----------------] 25000 of 25000 complete in 1.2 sec
figsize(25, 6)

p_trace = mcmc.trace("freq_cheating")[:]
plt.hist(p_trace, histtype="stepfilled", 
         density=True, alpha=0.85, bins=30, color="#348ABD", 
         label="posterior distribution")
plt.vlines([.05, .35], [0, 0], [5, 5], alpha=0.3)
plt.xlim(0, 1)
plt.xlabel("Value of $p$")
plt.ylabel("Density")
plt.title("Posterior distribution of parameter $p$"
          "from alternate model")
plt.legend()

Output:

<matplotlib.legend.Legend at 0x1fadfc56eb8>

f:id:koheitsutsumi223:20190323202000p:plain

2.2.9 PyMCの使い方をもう少し

上級者向け:Lambdaクラスを使ってお手軽にdeterministic変数を扱う

beta = pm.Normal("coefficients", 0, 1, size=(N, 1))

x = np.random.randn(N, 1)
メモ

 np.random.randnは標準正規分布からのサンプルを返す関数のようです。 引数の整数は返されるarrayオブジェクトの次元を指定するそうです。

np.random.randn??(クリックすると展開されます)

Docstring:
randn(d0, d1, ..., dn)

Return a sample (or samples) from the "standard normal" distribution.

If positive, int_like or int-convertible arguments are provided,
`randn` generates an array of shape ``(d0, d1, ..., dn)``, filled
with random floats sampled from a univariate "normal" (Gaussian)
distribution of mean 0 and variance 1 (if any of the :math:`d_i` are
floats, they are first converted to integers by truncation). A single
float randomly sampled from the distribution is returned if no
argument is provided.

This is a convenience function.  If you want an interface that takes a
tuple as the first argument, use `numpy.random.standard_normal` instead.

Parameters
----------
d0, d1, ..., dn : int, optional
    The dimensions of the returned array, should be all positive.
    If no argument is given a single Python float is returned.

Returns
-------
Z : ndarray or float
    A ``(d0, d1, ..., dn)``-shaped array of floating-point samples from
    the standard normal distribution, or a single such float if
    no parameters were supplied.

See Also
--------
random.standard_normal : Similar, but takes a tuple as its argument.

Notes
-----
For random samples from :math:`N(\mu, \sigma^2)`, use:

``sigma * np.random.randn(...) + mu``

Examples
--------
>>> np.random.randn()
2.1923875335537315 #random

Two-by-four array of samples from N(3, 6.25):

>>> 2.5 * np.random.randn(2, 4) + 3
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],  #random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]]) #random
Type:      builtin_function_or_method

linear_combination = pm.Lambda("comb", lambda x=x, 
                               beta=beta: np.dot(x.T, beta))

xbeta内積を計算したdeterministic変数が作成されます。

linear_combination.value

Output:

    array([[ 3.54607155]])
メモ

 x.Tはxの転置行列(ベクトル?)を返します。

x.T

Output:

    array([[-0.69877594, -0.30683497, -1.04543408,  0.92360732, -0.84359431,
            -0.8104658 ,  1.033511  ,  1.06662984, -0.87370605, -1.61567741,
             0.277752  ,  0.06679146,  1.05671416, -0.04508463, -0.87050541,
            -0.29798197,  1.41862678, -0.18613036, -1.14649344, -3.38937616,
            -0.34949099,  0.26877773,  1.61307278,  0.11562269, -0.18933638,
             0.0502833 , -0.46785645,  1.40923847, -0.50753077,  0.18046069,
             1.46627307,  1.14276522, -0.31624792, -0.13193391,  1.00854897,
             0.53726617,  0.75892404, -0.49679437, -0.49542055, -0.29550763,
             0.38086154, -0.22335623,  0.42129412, -0.27474526, -1.06314766,
            -0.25977146, -0.12190971,  1.61424116, -0.7144213 , -0.11599007,
             1.78002587, -0.34226   , -0.02305097, -0.22823814,  0.26100848,
            -0.39558157, -1.78600444,  0.38610165, -1.14085662, -0.9320066 ,
            -0.55027475, -0.1132584 , -0.26211372, -0.05095378,  1.59816836,
             0.61105632, -1.26027571,  1.28556484, -1.61263487, -0.62419771,
            -0.01766886,  2.28490072, -0.82740689, -0.03267927, -0.45830663,
             1.51543338, -0.39441128, -0.13035409, -0.4335249 , -0.70002868,
             1.36377425, -1.83376874, -0.44147943,  0.00362854, -1.01908356,
             2.54583705, -0.12141319,  0.27333106,  2.22923404,  0.1929837 ,
            -2.02996548, -0.29450739,  0.22175473,  2.27184055,  0.39794602,
             0.53896599, -0.46472929, -1.58689283,  1.6319995 ,  0.86209208]])

 np.dotはベクトル同士の内積や行列の積を計算します。以下では5行1列の行列と1行5列の行列を掛け合わせ、5行5列の行列がアウトプットされています。

np.dot(x[:5], x[:5].T)

Output:

    array([[ 0.48828781,  0.2144089 ,  0.73052418, -0.64539457,  0.58948341],
           [ 0.2144089 ,  0.0941477 ,  0.32077574, -0.28339503,  0.25884424],
           [ 0.73052418,  0.32077574,  1.09293241, -0.96557056,  0.88192224],
           [-0.64539457, -0.28339503, -0.96557056,  0.85305048, -0.77914988],
           [ 0.58948341,  0.25884424,  0.88192224, -0.77914988,  0.71165136]])

 np.dotの説明はこんな感じ。

np.dot??(クリックすると展開されます)

Docstring:
dot(a, b, out=None)

Dot product of two arrays.

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D
arrays to inner product of vectors (without complex conjugation). For
N dimensions it is a sum product over the last axis of `a` and
the second-to-last of `b`::

    dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

Parameters
----------
a : array_like
    First argument.
b : array_like
    Second argument.
out : ndarray, optional
    Output argument. This must have the exact kind that would be returned
    if it was not used. In particular, it must have the right type, must be
    C-contiguous, and its dtype must be the dtype that would be returned
    for `dot(a,b)`. This is a performance feature. Therefore, if these
    conditions are not met, an exception is raised, instead of attempting
    to be flexible.

Returns
-------
output : ndarray
    Returns the dot product of `a` and `b`.  If `a` and `b` are both
    scalars or both 1-D arrays then a scalar is returned; otherwise
    an array is returned.
    If `out` is given, then it is returned.

Raises
------
ValueError
    If the last dimension of `a` is not the same size as
    the second-to-last dimension of `b`.

See Also
--------
vdot : Complex-conjugating dot product.
tensordot : Sum products over arbitrary axes.
einsum : Einstein summation convention.
matmul : '@' operator as method with out parameter.

Examples
--------
>>> np.dot(3, 4)
12

Neither argument is complex-conjugated:

>>> np.dot([2j, 3j], [2j, 3j])
(-13+0j)

For 2-D arrays it is the matrix product:

>>> a = [[1, 0], [0, 1]]
>>> b = [[4, 1], [2, 2]]
>>> np.dot(a, b)
array([[4, 1],
       [2, 2]])

>>> a = np.arange(3*4*5*6).reshape((3,4,5,6))
>>> b = np.arange(3*4*5*6)[::-1].reshape((5,4,6,3))
>>> np.dot(a, b)[2,3,2,1,2,2]
499128
>>> sum(a[2,3,2,:] * b[1,2,:,2])
499128
Type:      builtin_function_or_method

上級者向け:PyMC変数のarray

N = 10
x = np.empty(N, dtype=object)
for i in range(0, N):
    x[i] = pm.Exponential(f"x_{i}", (i + 1)**2)

x

Output:

    array([ <pymc.distributions.new_dist_class.<locals>.new_class 'x_0' at 0x000001FADFA052B0>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_1' at 0x000001FADFA05240>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_2' at 0x000001FADFA051D0>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_3' at 0x000001FADFA05160>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_4' at 0x000001FADFA050F0>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_5' at 0x000001FADFA05080>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_6' at 0x000001FADFA05FD0>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_7' at 0x000001FADFA05A20>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_8' at 0x000001FADFA05EF0>,
           <pymc.distributions.new_dist_class.<locals>.new_class 'x_9' at 0x000001FADFA05D30>], dtype=object)

xは各成分がPyMC変数となっているarrayオブジェクトになっています。

print(x[0].value)
x[0].random()
print(x[0].value)

Output:

    1.347759419712658
    0.6845365343616834
メモ

 ちなみに、dtype=objectを指定しないと、指数分布からサンプリングされた値がxに保存されるようです。

N = 10
x = np.empty(N)
for i in range(0, N):
    x[i] = pm.Exponential(f"x_{i}", (i + 1)**2)

x

Output:

    array([ 0.76906025,  0.14474833,  0.11372349,  0.04600305,  0.038043  ,
            0.01263401,  0.02176758,  0.00217072,  0.00591296,  0.0207582 ])

np.emptyの説明はこんな感じ。

np.empty??(クリックすると展開されます)

Docstring:
empty(shape, dtype=float, order='C')

Return a new array of given shape and type, without initializing entries.

Parameters
----------
shape : int or tuple of int
    Shape of the empty array
dtype : data-type, optional
    Desired output data-type.
order : {'C', 'F'}, optional
    Whether to store multi-dimensional data in row-major
    (C-style) or column-major (Fortran-style) order in
    memory.

Returns
-------
out : ndarray
    Array of uninitialized (arbitrary) data of the given shape, dtype, and
    order.  Object arrays will be initialized to None.

See Also
--------
empty_like, zeros, ones

Notes
-----
`empty`, unlike `zeros`, does not set the array values to zero,
and may therefore be marginally faster.  On the other hand, it requires
the user to manually set all the values in the array, and should be
used with caution.

Examples
--------
>>> np.empty([2, 2])
array([[ -9.74499359e+001,   6.69583040e-309],
       [  2.13182611e-314,   3.06959433e-309]])         #random

>>> np.empty([2, 2], dtype=int)
array([[-1073741821, -1067949133],
       [  496041986,    19249760]])                     #random
Type:      builtin_function_or_method

 dtypearrayに格納するデータの型を指定するようです。例えば、以下のようにするとstr型のデータを所持するarrayを作ることができます。

x = np.empty([2, 2], dtype=str)
x[0][0] = 'a'
x[0][1] = 100 # 暗黙の型変換も行われます。

x

Output:

    array([['a', '1'],
           ['', '']], 
          dtype='<U1')

dtype='<U1'は以下を参考に。ユニコードの1文字のデータ型のリトルエンディアンを表しているようです。リトル/ビッグエンディアンを何となくしか知らないので、これ以上の言及を避けておきます・・・

参考:Data type objects (dtype)

x = np.empty([2, 2])

デフォルトではdtype=floatなので、文字列を代入させることは出来ません。

x[0][0] = 'a'

Error:

    ---------------------------------------------------------------------------

    ValueError                                Traceback (most recent call last)

    <ipython-input-36-163567913498> in <module>
    ----> 1 x[0][0] = 'a'
    

    ValueError: could not convert string to float: 'a'

dtype=objectの指定により、複数のオブジェクトを所持するarrayを作成することができます。

x = np.empty([2, 2], dtype=object)
x[0][0] ='a'
x[0][1] = 'b'
x[1][0] = 3

x

Output:

    array([['a', 'b'],
           [3, None]], dtype=object)

例題:スペースシャトル「チャレンジャー号」の悲劇

makedirs("data", exist_ok=True)

urlretrieve("https://git.io/vXknD", "data/challenger_data.csv")

Output:

    ('data/challenger_data.csv', <http.client.HTTPMessage at 0x1fadfa06198>)
figsize(25, 7)
np.set_printoptions(precision=3, suppress=True)

challenger_data = np.genfromtxt("data/challenger_data.csv",
                                skip_header=1, usecols=[1, 2],
                                missing_values="NA",
                                delimiter=",")

challenger_data

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 80.,  nan],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.],
           [ 31.,  nan]])
challenger_data = challenger_data[~np.isnan(challenger_data[:, 1])]

challenger_data

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.]])

メモ

 challenger_data[~np.isnan(challenger_data[:, 1])]の補足をしておきます。

challenger_data = np.genfromtxt("data/challenger_data.csv",
                                skip_header=1, usecols=[1, 2],
                                missing_values="NA",
                                delimiter=",")

 まず、2列目の成分を[:, 1]により取り出します。

challenger_data[:, 1]

Output

    array([  0.,   1.,   0.,  nan,   0.,   0.,   0.,   0.,   0.,   1.,   1.,
             1.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,
             0.,   1.,  nan])

 np.isnanは入力されたarrayオブジェクトに対して、入力されたarrayオブジェクトの成分がnanならばTrueを、nan以外ならばFalseを所持するarrayオブジェクトを返します。

np.isnan??(クリックすると展開されます)

Call signature:  np.isnan(*args, **kwargs)
Type:            ufunc
String form:     <ufunc 'isnan'>
File:            c:\users\hoge\anaconda3\lib\site-packages\numpy\__init__.py
Docstring:      
isnan(x[, out])

Test element-wise for NaN and return result as a boolean array.

Parameters
----------
x : array_like
    Input array.

Returns
-------
y : ndarray or bool
    For scalar input, the result is a new boolean with value True if
    the input is NaN; otherwise the value is False.

    For array input, the result is a boolean array of the same
    dimensions as the input and the values are True if the
    corresponding element of the input is NaN; otherwise the values are
    False.

See Also
--------
isinf, isneginf, isposinf, isfinite

Notes
-----
Numpy uses the IEEE Standard for Binary Floating-Point for Arithmetic
(IEEE 754). This means that Not a Number is not equivalent to infinity.

Examples
--------
>>> np.isnan(np.nan)
True
>>> np.isnan(np.inf)
False
>>> np.isnan([np.log(-1.),1.,np.log(0)])
array([ True, False, False], dtype=bool)
Class docstring:
Functions that operate element by element on whole arrays.

To see the documentation for a specific ufunc, use np.info().  For
example, np.info(np.sin).  Because ufuncs are written in C
(for speed) and linked into Python with NumPy's ufunc facility,
Python's help() function finds this page whenever help() is called
on a ufunc.

A detailed explanation of ufuncs can be found in the "ufuncs.rst"
file in the NumPy reference guide.

Unary ufuncs:
=============

op(X, out=None)
Apply op to X elementwise

Parameters
----------
X : array_like
    Input array.
out : array_like
    An array to store the output. Must be the same shape as `X`.

Returns
-------
r : array_like
    `r` will have the same shape as `X`; if out is provided, `r`
    will be equal to out.

Binary ufuncs:
==============

op(X, Y, out=None)
Apply `op` to `X` and `Y` elementwise. May "broadcast" to make
the shapes of `X` and `Y` congruent.

The broadcasting rules are:

* Dimensions of length 1 may be prepended to either array.
* Arrays may be repeated along dimensions of length 1.

Parameters
----------
X : array_like
    First input array.
Y : array_like
    Second input array.
out : array_like
    An array to store the output. Must be the same shape as the
    output would have.

Returns
-------
r : array_like
    The return value; if out is provided, `r` will be equal to out.

例えば、先ほどの例では、以下のようなarrayオブジェクトが返されます。

np.isnan(challenger_data[:, 1])

Output:

    array([False, False, False,  True, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False,  True], dtype=bool)

そして、~によりブール値を反転させています。~は整数型のビットを反転させる演算子です。

参考:整数型におけるビット単位演算符号付数値表現 の「2の補数」

~np.isnan(challenger_data[:, 1])

Output:

    array([ True,  True,  True, False,  True,  True,  True,  True,  True,
            True,  True,  True,  True,  True,  True,  True,  True,  True,
            True,  True,  True,  True,  True,  True, False], dtype=bool)

 最後に、~np.isnan(challenger_data[:, 1])の成分がTrueの行だけをスライスにより取り出します。

challenger_data = challenger_data[~np.isnan(challenger_data[:, 1])]
challenger_data

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.]])

 補足は以上です。

plt.scatter(challenger_data[:, 0], 
            challenger_data[:, 1], 
            s=75, color="k", alpha=0.5)
plt.yticks([0, 1])
plt.ylabel("Damage incident?")
plt.xlabel("Outside temperature (Fahrenheit)")
plt.title("Defects of the space shuttle O-rings versus temperature")

Output:

    Text(0.5, 1.0, 'Defects of the space shuttle O-rings versus temperature')

f:id:koheitsutsumi223:20190323202004p:plain

figsize(24, 6)

def logistic(x, beta, alpha=0):
    return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))

x = np.linspace(-4, 4, 100)
def plot_logistics(x=x, beta=[1, 3, 5], alpha=[1, -2, 7]):
    for b in beta:
        plt.plot(x, logistic(x, b), label=fr"$\beta = {b}$", ls="--", lw=1)
    for b, a in zip(beta, alpha):
        plt.plot(x, logistic(x, b, a),
                 label=fr"$\beta = {b}, \alpha = {a}$")
    plt.title("Logistic functions for "
              r"different $\beta$ and $\alpha$ values")
    plt.xlabel("$x$")
    plt.ylabel("Logistic function at $x$")
    plt.legend(loc="lower left")
plot_logistics()

f:id:koheitsutsumi223:20190323202008p:plain

2.2.11 正規分布

temperature = challenger_data[:, 0]
D = challenger_data[:, 1]

beta = pm.Normal("beta", 0, 0.001, value=0)
alpha = pm.Normal("alpha", 0, 0.001, value=0)

@pm.deterministic
def p(t=temperature, alpha=alpha, beta=beta):
    return 1.0 / (1. + np.exp(beta * t + alpha))

p.value # betaとalphaを定義したときにvalueを0に指定したのでpの各成分は0.5になっている。

Output:

    array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,
            0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,
            0.5])
observed = pm.Bernoulli("bernoulli_obs", p, value=D, observed=True)

model = pm.Model([observed, beta, alpha])

map_ = pm.MAP(model)
map_.fit()
mcmc=pm.MCMC(model)
mcmc.sample(120000, 100000, 2)

Output:

     [-----------------100%-----------------] 120000 of 120000 complete in 11.2 sec
figsize(25, 12)

alpha_samples = mcmc.trace('alpha')[:, None]
beta_samples = mcmc.trace('beta')[:, None]

plt.subplot(211)
plt.title("Posterior distributions of "
          r"the model parameters $\alpha, \beta$")
plt.hist(beta_samples, histtype='stepfilled', 
         bins=35, alpha=0.85, color="#7A68A6", density=True,
         label=r"posterior of $\beta$")
plt.legend()

plt.subplot(212)
plt.hist(alpha_samples, histtype='stepfilled',
         bins=35, alpha=0.85, color="#A60628", density=True,
         label=r"posterior of $\alpha$")

plt.xlabel("Value of parameter")
plt.ylabel("Density")
plt.legend()

Output:

    <matplotlib.legend.Legend at 0x1fae1ba67b8>

f:id:koheitsutsumi223:20190323202012p:plain

figsize(25, 8)

t = np.linspace(temperature.min() - 5,
                temperature.max() + 5, 50)[:, None]

p_t = logistic(t.T, beta_samples, alpha_samples)
p_t

Output:

    array([[ 0.998,  0.998,  0.997, ...,  0.   ,  0.   ,  0.   ],
           [ 0.998,  0.998,  0.997, ...,  0.   ,  0.   ,  0.   ],
           [ 1.   ,  0.999,  0.999, ...,  0.001,  0.001,  0.001],
           ..., 
           [ 0.989,  0.986,  0.983, ...,  0.003,  0.002,  0.002],
           [ 0.989,  0.986,  0.983, ...,  0.003,  0.002,  0.002],
           [ 0.987,  0.984,  0.98 , ...,  0.002,  0.002,  0.002]])

 10000のサンプルが得られました。

p_t.shape

Output:

    (10000, 50)
def plot_post_exp_val_of_prob_of_def(plot_real_list=[0, -2], fig_size_y=8, legend_flag=True, mean_lw=1):
    figsize(25, fig_size_y)
    mean_prob_t = p_t.mean(axis=0)

    for i in plot_real_list:
        plt.plot(t, p_t[i, :], ls="--",
                 label=f"realization from posterior: {i}")
        
    plt.plot(t, mean_prob_t, lw=mean_lw, # plotするデータが多くなっても目立つようにlwを変更できるようにした。
             color='r',
             label="average posterior \n"
                   "probability of defect")
                 
    plt.scatter(temperature, D, color="K", s=50, alpha=0.5)

    plt.title("Posterior expected value of the probability of defect, "
              f"including {len(plot_real_list)} realizations")
    if legend_flag:
        plt.legend(loc="lower left")
    plt.ylim(-0.1, 1.1)
    plt.xlim(t.min(), t.max())
    plt.ylabel("Probability")
    plt.xlabel("Temperature")
plot_post_exp_val_of_prob_of_def()

f:id:koheitsutsumi223:20190323202018p:plain

 10000のデータから100個のデータを100間隔でプロットしてみました。いろんな値がサンプリングされているということがわかりますね。

plot_post_exp_val_of_prob_of_def(plot_real_list=[i for i in range(0, 10000, 100)], fig_size_y=8, legend_flag=False)

f:id:koheitsutsumi223:20190323202023p:plain

qs = mquantiles(p_t, [0.025, 0.975], axis=0)
figsize(25, 8)
mean_prob_t = p_t.mean(axis=0)
plt.fill_between(t[:, 0], *qs, alpha=0.7, color="#7A68A6")
plt.plot(t[:, 0], qs[0], label="95% CI", color="#7A68A6", alpha=0.7)
plt.plot(t, mean_prob_t, lw=1, ls="--", color="k", label="average posterior \nprobability of defect")
plt.xlim(t.min(), t.max())
plt.ylim(-0.02, 1.02)
plt.legend(loc="lower left")
plt.scatter(temperature, D, color="k", s=50, alpha=0.5)
plt.xlabel("Temprature, $t$")
plt.ylabel("Probability estimate")
plt.title("Posterior probability of estimates, given temperature $t$")

Output:

    Text(0.5, 1.0, 'Posterior probability of estimates, given temperature $t$')

f:id:koheitsutsumi223:20190323202027p:plain

2.2.12 チャレンジャー号の悲劇の日に何が起こった?

figsize(25, 5)

prob_31 = logistic(31, beta_samples, alpha_samples)
plt.hist(prob_31, bins=1000, density=True, histtype='stepfilled')

plt.xlim(0.995, 1)
plt.ylabel("Density")
plt.xlabel("Probability of defect occurring in O-ring")

plt.title("Posterior distribution of probability of defect, "
          "given $t = 31")

Output:

    Text(0.5, 1.0, 'Posterior distribution of probability of defect, given $t = 31')

f:id:koheitsutsumi223:20190323202031p:plain

2.3 このモデルは適切か?

simulated = pm.Bernoulli("bernoulli_sim", p)
N = 10000

mcmc = pm.MCMC([simulated, alpha, beta, observed])
mcmc.sample(N)

Output:

     [-----------------100%-----------------] 10000 of 10000 complete in 1.9 sec
simulations = mcmc.trace("bernoulli_sim")[:].astype(int)

メモ

.astype(int)の補足。mcmc.trace("bernoulli_sim")[:]dtypebool値となっています。

mcmc.trace("bernoulli_sim")[:]

Output:

    array([[False,  True, False, ..., False,  True,  True],
           [False, False, False, ..., False, False,  True],
           [False, False, False, ..., False, False,  True],
           ..., 
           [ True, False, False, ..., False, False,  True],
           [ True, False, False, ..., False, False,  True],
           [False, False, False, ..., False, False,  True]], dtype=bool)

そこで、astypeメソッドを使って、整数型に変換しています。

mcmc.trace("bernoulli_sim")[:].astype(int)

Output:

    array([[0, 1, 0, ..., 0, 1, 1],
           [0, 0, 0, ..., 0, 0, 1],
           [0, 0, 0, ..., 0, 0, 1],
           ..., 
           [1, 0, 0, ..., 0, 0, 1],
           [1, 0, 0, ..., 0, 0, 1],
           [0, 0, 0, ..., 0, 0, 1]])

以上、メモでした。

simulations.shape

Output:

    (10000, 23)
plt.suptitle("Simulated datasets using posterior parameters")

figsize(25, 20)
for i in range(4):
    ax = plt.subplot(4, 1, i + 1)
    plt.scatter(temperature, simulations[1000 * i, :], 
                color="k", s=50, alpha=0.5)

Output: f:id:koheitsutsumi223:20190323202038p:plain

2.3.1 セパレーションプロット

posterior_probability = simulations.mean(axis=0)

print("Obs. | Arraya of Simulated Defects | posteriror    | Realized")
print("                                     Probability     Defect  ")
print("                                     of Defect               ")

for i in range(len(D)):
    print(f'{str(i).zfill(2)}   | {str(simulations[:10, i])[:-1] + "...]".ljust(6)}  | {round(posterior_probability[i], 2)}          | {int(D[i])}')

Output:

    Obs. | Arraya of Simulated Defects | posteriror    | Realized
                                         Probability     Defect  
                                         of Defect               
    00   | [0 0 0 0 0 1 1 0 0 0...]    | 0.39          | 0
    01   | [1 0 0 0 0 0 0 0 0 0...]    | 0.15          | 1
    02   | [0 0 0 0 0 0 0 0 0 0...]    | 0.19          | 0
    03   | [1 0 1 0 0 1 0 0 0 1...]    | 0.26          | 0
    04   | [0 0 0 0 1 0 0 0 1 0...]    | 0.31          | 0
    05   | [0 0 0 0 0 0 0 0 1 0...]    | 0.08          | 0
    06   | [0 0 0 0 1 0 0 0 0 0...]    | 0.06          | 0
    07   | [0 0 0 0 1 0 0 0 0 0...]    | 0.14          | 0
    08   | [1 1 1 1 0 1 1 1 1 1...]    | 0.92          | 1
    09   | [0 0 0 1 0 1 1 1 1 1...]    | 0.63          | 1
    10   | [0 0 0 0 0 0 0 0 0 0...]    | 0.14          | 1
    11   | [0 0 0 0 1 0 0 0 0 0...]    | 0.01          | 0
    12   | [1 0 0 0 0 0 0 0 0 0...]    | 0.32          | 0
    13   | [0 1 1 0 1 1 1 1 1 1...]    | 0.97          | 1
    14   | [0 1 1 0 0 0 1 1 0 0...]    | 0.31          | 0
    15   | [0 0 0 0 0 0 0 0 0 0...]    | 0.03          | 0
    16   | [0 0 0 0 0 0 0 0 0 0...]    | 0.14          | 0
    17   | [0 0 0 0 0 0 0 0 0 0...]    | 0.01          | 0
    18   | [0 0 0 0 0 0 0 0 0 0...]    | 0.03          | 0
    19   | [0 0 0 0 0 0 0 0 0 0...]    | 0.01          | 0
    20   | [0 0 0 0 0 0 0 0 0 0...]    | 0.03          | 1
    21   | [1 0 0 0 0 0 0 0 0 0...]    | 0.03          | 0
    22   | [1 1 1 0 1 1 1 1 1 1...]    | 0.89          | 1
ix = np.argsort(posterior_probability)

print("Posterior Probability of Defect | Realized Defect")

for i in range(len(D)):
    print(f"{round(posterior_probability[ix[i]], 2)}                            |   {int(D[ix[i]])}")

Output:

    Posterior Probability of Defect | Realized Defect
    0.01                            |   0
    0.01                            |   0
    0.01                            |   0
    0.03                            |   0
    0.03                            |   0
    0.03                            |   1
    0.03                            |   0
    0.06                            |   0
    0.08                            |   0
    0.14                            |   0
    0.14                            |   0
    0.14                            |   1
    0.15                            |   1
    0.19                            |   0
    0.26                            |   0
    0.31                            |   0
    0.31                            |   0
    0.32                            |   0
    0.39                            |   0
    0.63                            |   1
    0.89                            |   1
    0.92                            |   1
    0.97                            |   1
urlretrieve("https://git.io/vXtye", 'separation_plot.py')

Output:

    ('separation_plot.py', <http.client.HTTPMessage at 0x1fae3e2ca58>)
from separation_plot import separation_plot

figsize(22, 3)
separation_plot(posterior_probability, D)

f:id:koheitsutsumi223:20190323202047p:plain

posterior_probability.shape[0]

Output:

    23
figsize(22, 2.5)

p = D
separation_plot(p, D)
plt.title("Perfect model")

p = np.random.rand(23)
separation_plot(p, D)
plt.title("Random model")

constant_prob = 7. / 23 * np.ones(23)
separation_plot(constant_prob, D)
plt.title("Constant-prediction model")

Output:

    Text(0.5, 1.0, 'Constant-prediction model')

f:id:koheitsutsumi223:20190323202051p:plain

f:id:koheitsutsumi223:20190323202055p:plain

f:id:koheitsutsumi223:20190323202058p:plain

 今回の写経は以上です。 ここまで読んでいただきありがとうございました。