一日一万字の感謝の写経

一日一万字の感謝の写経をして強くなります。そのうち本を置き去りにして何かを作り始める予定です。

Numpyのgenfromtxtについて少しまとめておいた

概要

 Numpyのgenfromtxt(ドキュメント:Importing data with genfromtxt)の簡単な使い方をまとめておきます。内容としてはPythonで体験するベイズ推論 PyMCによるMCMC入門の2章 2.2.10 例題:スペースシャトル「チャレンジャー号」の悲劇で使用されたgenfromtxtを例にしてまとめておきます。

解説

 まず、使用されていたコードは以下のようなものでした。

import numpy as np

challenger_data = np.genfromtxt("data/challenger_data.csv",
                                skip_header=1, usecols=[1, 2],
                                missing_values="NA",
                                delimiter=",")

data/challenger_data.csvファイルは以下のような内容でした。

Date,Temperature,Damage Incident
04/12/1981,66,0
11/12/1981,70,1
3/22/82,69,0
6/27/82,80,NA
01/11/1982,68,0
04/04/1983,67,0
6/18/83,72,0
8/30/83,73,0
11/28/83,70,0
02/03/1984,57,1
04/06/1984,63,1
8/30/84,70,1
10/05/1984,78,0
11/08/1984,67,0
1/24/85,53,1
04/12/1985,67,0
4/29/85,75,0
6/17/85,70,0
7/29/85,81,0
8/27/85,76,0
10/03/1985,79,0
10/30/85,75,1
11/26/85,76,0
01/12/1986,58,1
1/28/86,31,Challenger Accident

このコードによって生成されるchallenger_dataは以下のようなarrayオブジェクトです。

challenger_data

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 80.,  nan],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.],
           [ 31.,  nan]])

np.genfromtxtのヘルプ

 まず、np.genfromtxtのヘルプを見てみます。

np.genfromtxt?

Output:

Signature:
np.genfromtxt(
    fname,
    dtype=<class 'float'>,
    comments='#',
    delimiter=None,
    skip_header=0,
    skip_footer=0,
    converters=None,
    missing_values=None,
    filling_values=None,
    usecols=None,
    names=None,
    excludelist=None,
    deletechars=None,
    replace_space='_',
    autostrip=False,
    case_sensitive=True,
    defaultfmt='f%i',
    unpack=None,
    usemask=False,
    loose=True,
    invalid_raise=True,
    max_rows=None,
)
Docstring:
Load data from a text file, with missing values handled as specified.

Each line past the first `skip_header` lines is split at the `delimiter`
character, and characters following the `comments` character are discarded.

Parameters
----------
fname : file, str, list of str, generator
    File, filename, list, or generator to read.  If the filename
    extension is `.gz` or `.bz2`, the file is first decompressed. Mote
    that generators must return byte strings in Python 3k.  The strings
    in a list or produced by a generator are treated as lines.
dtype : dtype, optional
    Data type of the resulting array.
    If None, the dtypes will be determined by the contents of each
    column, individually.
comments : str, optional
    The character used to indicate the start of a comment.
    All the characters occurring on a line after a comment are discarded
delimiter : str, int, or sequence, optional
    The string used to separate values.  By default, any consecutive
    whitespaces act as delimiter.  An integer or sequence of integers
    can also be provided as width(s) of each field.
skiprows : int, optional
    `skiprows` was removed in numpy 1.10. Please use `skip_header` instead.
skip_header : int, optional
    The number of lines to skip at the beginning of the file.
skip_footer : int, optional
    The number of lines to skip at the end of the file.
converters : variable, optional
    The set of functions that convert the data of a column to a value.
    The converters can also be used to provide a default value
    for missing data: ``converters = {3: lambda s: float(s or 0)}``.
missing : variable, optional
    `missing` was removed in numpy 1.10. Please use `missing_values`
    instead.
missing_values : variable, optional
    The set of strings corresponding to missing data.
filling_values : variable, optional
    The set of values to be used as default when the data are missing.
usecols : sequence, optional
    Which columns to read, with 0 being the first.  For example,
    ``usecols = (1, 4, 5)`` will extract the 2nd, 5th and 6th columns.
names : {None, True, str, sequence}, optional
    If `names` is True, the field names are read from the first valid line
    after the first `skip_header` lines.
    If `names` is a sequence or a single-string of comma-separated names,
    the names will be used to define the field names in a structured dtype.
    If `names` is None, the names of the dtype fields will be used, if any.
excludelist : sequence, optional
    A list of names to exclude. This list is appended to the default list
    ['return','file','print']. Excluded names are appended an underscore:
    for example, `file` would become `file_`.
deletechars : str, optional
    A string combining invalid characters that must be deleted from the
    names.
defaultfmt : str, optional
    A format used to define default field names, such as "f%i" or "f_%02i".
autostrip : bool, optional
    Whether to automatically strip white spaces from the variables.
replace_space : char, optional
    Character(s) used in replacement of white spaces in the variables
    names. By default, use a '_'.
case_sensitive : {True, False, 'upper', 'lower'}, optional
    If True, field names are case sensitive.
    If False or 'upper', field names are converted to upper case.
    If 'lower', field names are converted to lower case.
unpack : bool, optional
    If True, the returned array is transposed, so that arguments may be
    unpacked using ``x, y, z = loadtxt(...)``
usemask : bool, optional
    If True, return a masked array.
    If False, return a regular array.
loose : bool, optional
    If True, do not raise errors for invalid values.
invalid_raise : bool, optional
    If True, an exception is raised if an inconsistency is detected in the
    number of columns.
    If False, a warning is emitted and the offending lines are skipped.
max_rows : int,  optional
    The maximum number of rows to read. Must not be used with skip_footer
    at the same time.  If given, the value must be at least 1. Default is
    to read the entire file.

    .. versionadded:: 1.10.0

Returns
-------
out : ndarray
    Data read from the text file. If `usemask` is True, this is a
    masked array.

See Also
--------
numpy.loadtxt : equivalent function when no data is missing.

Notes
-----
* When spaces are used as delimiters, or when no delimiter has been given
  as input, there should not be any missing data between two fields.
* When the variables are named (either by a flexible dtype or with `names`,
  there must not be any header in the file (else a ValueError
  exception is raised).
* Individual values are not stripped of spaces by default.
  When using a custom converter, make sure the function does remove spaces.

References
----------
.. [1] Numpy User Guide, section `I/O with Numpy
       <http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html>`_.

Examples
---------
>>> from io import StringIO
>>> import numpy as np

Comma delimited file with mixed dtype

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

Using dtype = None

>>> s.seek(0) # needed for StringIO example only
>>> data = np.genfromtxt(s, dtype=None,
... names = ['myint','myfloat','mystring'], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

Specifying dtype and names

>>> s.seek(0)
>>> data = np.genfromtxt(s, dtype="i8,f8,S5",
... names=['myint','myfloat','mystring'], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

An example with fixed-width columns

>>> s = StringIO("11.3abcde")
>>> data = np.genfromtxt(s, dtype=None, names=['intvar','fltvar','strvar'],
...     delimiter=[1,3,5])
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('intvar', '<i8'), ('fltvar', '<f8'), ('strvar', '|S5')])
File:      c:\users\hoge\anaconda3\lib\site-packages\numpy\lib\npyio.py
Type:      function

 全ての引数を説明することは難しいので、上のコードで使用されていた引数を以下ではまとめておきます。

fname引数

 まず、fname引数についてみていきます。fnameはファイル名を指定するということは例のコードからもわかります。

 また、fileオブジェクトのように、readメソッドで文字列を返すオブジェクトを渡すこともできます。

参考:Importing data with genfromtxt

例えば、以下のようにしても同じ結果が得られます。こんな面倒な使い方をする人はあまりいないかもしれませんが・・・。ちなみに、バイト(bのオプションを指定)として読み込まないとエラーを吐かれました。

with open('data/challenger_data.csv', 'rb') as cin:
    challenger_data_2 = np.genfromtxt(cin,
                                      skip_header=1, usecols=[1, 2],
                                      missing_values="NA",
                                      delimiter=",")
challenger_data_2

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 80.,  nan],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.],
           [ 31.,  nan]])

skip_header引数

 skip_header引数はファイルの上から何行を無視するかを指定する引数です。以下の例では、先頭から10行を無視するように指定したので、challenger_data_3skip_header=1を指定したchallenger_dataより少ないレコードしか所持していません。

challenger_data_3 = np.genfromtxt("data/challenger_data.csv",
                                skip_header=10, usecols=[1, 2],
                                missing_values="NA",
                                delimiter=",")
challenger_data_3

Output:

    array([[ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.],
           [ 31.,  nan]])

usecols引数

 usecols引数はどの列を使用するかを指定する引数です。例では、usecols=[1, 2]を指定しています。challenger_data.csvをみてみると、1列目には日付が入っていて、今回はそれを無視しています。

 ちなみに、引数の順番通りにデータが渡されるらしく、以下のように列の順番を変えることもできます。

challenger_data_4 = np.genfromtxt("data/challenger_data.csv",
                                skip_header=1, usecols=[2, 1],
                                missing_values="NA",
                                delimiter=",")
challenger_data_4

Output:

    array([[  0.,  66.],
           [  1.,  70.],
           [  0.,  69.],
           [ nan,  80.],
           [  0.,  68.],
           [  0.,  67.],
           [  0.,  72.],
           [  0.,  73.],
           [  0.,  70.],
           [  1.,  57.],
           [  1.,  63.],
           [  1.,  70.],
           [  0.,  78.],
           [  0.,  67.],
           [  1.,  53.],
           [  0.,  67.],
           [  0.,  75.],
           [  0.,  70.],
           [  0.,  81.],
           [  0.,  76.],
           [  0.,  79.],
           [  1.,  75.],
           [  0.,  76.],
           [  1.,  58.],
           [ nan,  31.]])

missing_values引数とfilling_values引数

 missing_valuesは入力されたデータの中で、欠損値が何で表されているかを指定します。challenger_data.csvではNAという文字列で欠損値が表されているので、missing_values="NA"という指定になります。ただ、challenger_data.csvの最終行にはChallenger Accidentという文字列が入力されていましたが、その場合もnanに変換されていました。そこで、missing_valuesを指定せずに実行したら、同じ結果が得られました。もしかしたら、暗黙的な処理があり、ある程度の変換は行ってくれているかもしれません。

challenger_data_5 = np.genfromtxt("data/challenger_data.csv",
                                  skip_header=1, usecols=[1, 2],
                                  delimiter=",")
challenger_data_5

Output:

    array([[ 66.,   0.],
           [ 70.,   1.],
           [ 69.,   0.],
           [ 80.,  nan],
           [ 68.,   0.],
           [ 67.,   0.],
           [ 72.,   0.],
           [ 73.,   0.],
           [ 70.,   0.],
           [ 57.,   1.],
           [ 63.,   1.],
           [ 70.,   1.],
           [ 78.,   0.],
           [ 67.,   0.],
           [ 53.,   1.],
           [ 67.,   0.],
           [ 75.,   0.],
           [ 70.,   0.],
           [ 81.,   0.],
           [ 76.,   0.],
           [ 79.,   0.],
           [ 75.,   1.],
           [ 76.,   0.],
           [ 58.,   1.],
           [ 31.,  nan]])

 missing_values引数により欠損値だと判断されたデータはfilling_values引数で指定したデータが使用されるようです。今回は指定されていませんが、こちらも暗黙的な処理が裏で走るようです。

 指定できるデータ型はその列のデータ型のみのようです。無効なデータ型をfilling_valuesに渡すとエラーになります。

challenger_data_6 = np.genfromtxt("data/challenger_data.csv",
                                  skip_header=1, usecols=[1, 2],
                                  missing_values='NA', filling_values='missing_value',
                                  delimiter=",")
challenger_data_6

Error:

    ---------------------------------------------------------------------------

    ValueError                                Traceback (most recent call last)

    <ipython-input-9-3069a5bf63ec> in <module>
          2                                   skip_header=1, usecols=[1, 2],
          3                                   missing_values='NA', filling_values='missing_value',
    ----> 4                                   delimiter=",")
          5 challenger_data_6
    

    ~\Anaconda3\lib\site-packages\numpy\lib\npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows)
       1868                         dtype = np.dtype(ttype)
       1869             #
    -> 1870             output = np.array(data, dtype)
       1871             if usemask:
       1872                 if dtype.names:
    

    ValueError: could not convert string to float: 'missing_value'

float型なら問題なく設定できます。

challenger_data_6 = np.genfromtxt("data/challenger_data.csv",
                                  skip_header=1, usecols=[1, 2],
                                  missing_values='NA', filling_values=1000.,
                                  delimiter=",")
challenger_data_6

Output:

    array([[   66.,     0.],
           [   70.,     1.],
           [   69.,     0.],
           [   80.,  1000.],
           [   68.,     0.],
           [   67.,     0.],
           [   72.,     0.],
           [   73.,     0.],
           [   70.,     0.],
           [   57.,     1.],
           [   63.,     1.],
           [   70.,     1.],
           [   78.,     0.],
           [   67.,     0.],
           [   53.,     1.],
           [   67.,     0.],
           [   75.,     0.],
           [   70.,     0.],
           [   81.,     0.],
           [   76.,     0.],
           [   79.,     0.],
           [   75.,     1.],
           [   76.,     0.],
           [   58.,     1.],
           [   31.,  1000.]])

delimiter引数

 これは、データの区切り文字を表しています。例ではコンマ(,)により区切られていました。

 また、自然数か、自然数のリストを渡すと、その長さで区切られたデータが格納されるようです。使い道あるのかな・・・?

challenger_data_7 = np.genfromtxt("data/challenger_data.csv",
                                skip_header=1,
                                missing_values="NA",dtype=str,
                                delimiter=[8, 8, 8, 8])
challenger_data_7

Output:

    array([['04/12/19', '81,66,0\n', '', ''],
           ['11/12/19', '81,70,1\n', '', ''],
           ['3/22/82,', '69,0\n', '', ''],
           ['6/27/82,', '80,NA\n', '', ''],
           ['01/11/19', '82,68,0\n', '', ''],
           ['04/04/19', '83,67,0\n', '', ''],
           ['6/18/83,', '72,0\n', '', ''],
           ['8/30/83,', '73,0\n', '', ''],
           ['11/28/83', ',70,0\n', '', ''],
           ['02/03/19', '84,57,1\n', '', ''],
           ['04/06/19', '84,63,1\n', '', ''],
           ['8/30/84,', '70,1\n', '', ''],
           ['10/05/19', '84,78,0\n', '', ''],
           ['11/08/19', '84,67,0\n', '', ''],
           ['1/24/85,', '53,1\n', '', ''],
           ['04/12/19', '85,67,0\n', '', ''],
           ['4/29/85,', '75,0\n', '', ''],
           ['6/17/85,', '70,0\n', '', ''],
           ['7/29/85,', '81,0\n', '', ''],
           ['8/27/85,', '76,0\n', '', ''],
           ['10/03/19', '85,79,0\n', '', ''],
           ['10/30/85', ',75,1\n', '', ''],
           ['11/26/85', ',76,0\n', '', ''],
           ['01/12/19', '86,58,1\n', '', ''],
           ['1/28/86,', '31,Chall', 'enger Ac', 'cident\n']], 
          dtype='<U8')

 解説は以上です。記事を書いているときにわからないことがたくさん出てきたので、まだまだ知らなければいけないことはたくさんあるなぁと感じます。

 また、記事を書いている途中で、Numpyのデータ型について知っていないといけないなぁとも思いました。難しいです。

参考:Data type objects (dtype)