pythonのlenで返すもの - 記録（e_c_e

# macox
# これはja_JP.UTF-8 の環境
# UTF-8ではbyteサイズの３が返る
>>> len("あ")
3
>>> "あ"
'\xe3\x81\x82'
# uを付けるとunicode型の文字列となる
# このとき、lenは「文字数」を返す
>>> len(u"あ")
1
>>> u"あ"
u'\u3042'
# encodeで、指定された文字コードでの文字列となり
# このとき、lenは「byteサイズ」を返す
>>> len(u"あ".encode("cp932"))
2
>>> u"あ".encode("cp932")
'\x82\xa0'

きょう、忘れててハマった。
cp932のファイルを、cp932で置換して、cp932で保存する
だけなのに、、、
忘れる前に、まずはpythonのstrとunicodeの違いをおさらい。

>>> import sys
# バージョンはこちら
>>> print sys.version
2.7.2 (default, Oct 11 2012, 20:14:37) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)]
# 標準入力での文字コードはUTF-8
>>> print sys.stdin.encoding
UTF-8

# 左はUTF-8で入力されたstrで、右はunicode文字、型が違うため当然一致しない
# 暗黙の型変換ができなくて、警告が出る。
>>> "あ" == u"あ"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

# 型を確認（UTF-8で入力されたからと言ってunicodeになる、というものではない）
>>> type("あ")
<type 'str'>
>>> type(u"あ")
<type 'unicode'>

# unicode型に型変換しようとすると警告が出る
>>> unicode("あ") == u"あ"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can t decode byte 0xe3 in position 0: ordinal not in range(128)

# unicodeに型変換するには、UTF-8で入力されたstrであることを教える必要がある
>>> unicode("あ", "UTF-8") == u"あ"
True
# これも同じ（すべて、unicode型）
>>> unicode("あ", "UTF-8") == "あ".decode("UTF-8") == u"あ"
True

続きは翌日。