chardetなどのライブラリを使ってファイルのエンコーディングを判定することもできます。

2024年6月8日

そうですね、chardetライブラリなどを使ってファイルのエンコーディングを自動的に判定することができます。以下は、chardetを利用する例です。

まず、chardetをインストールします。

bash
pip install chardet

そして、次のようにコードを修正できます。

python
import chardet

# ...（他のimportなど）

for file in files:
    with file.open('rb') as f:
        # ファイルの中身を一部読み取ってエンコーディングを判定
        result = chardet.detect(f.read(10000))

    detected_encoding = result['encoding']
    confidence = result['confidence']

    print(f"Detected encoding: {detected_encoding}, Confidence: {confidence}")

    try:
        # 判定されたエンコーディングで再試行
        data_list.append(pd.read_csv(file, encoding=detected_encoding))
    except UnicodeDecodeError as e:
        print(f"UnicodeDecodeError: {e}")
        print("Failed to read CSV file even with detected encoding.")
    except pd.errors.EmptyDataError:
        print("CSV file is empty.")

このコードでは、chardet.detectを使用してファイルの最初の一部を読み取り、その部分からエンコーディングを判定しています。判定されたエンコーディングを使用してpd.read_csvを再試行します。ただし、confidence（信頼度）が低い場合や、エンコーディングが一意に判定できない場合も考慮してください。

未分類

Posted by ぼっち