faster-whisperを用いたリアルタイム音声認識(Macbook)｜Python

はじめに

この記事では、ローカルでSpeech-To-Textを行う、faster-whisperを扱う
特に、今回はリアルタイム音声認識を扱う

本記事は、whisperを用いたリアルタイム音声認識(Macbook)｜Pythonのfaster-whisper使用バージョンのプログラムを紹介する
解説はリンク先の過去記事を参照

faster-whisperとは

GitHubのリポジトリ→https://github.com/SYSTRAN/faster-whisper

簡単に言えば、OpenAIのwhisperの派生で、メモリ使用量を抑え、精度はそのまま、速さが最大４倍速くなったもの

本記事では、本家whisperとの比較は行なっていないのでどの程度速くなったのかはわからない

インストール

$ pip3 install faster-whisper

公式リポジトリによると、本家whisperとは違ってFFmpegのインストールは不要らしい
また、Pythonのバージョンは3.9以上で
（筆者の環境は3.12.7）

ソースコード

プログラムの解説は以前に書いたので、そちらを参照whisperを用いたリアルタイム音声認識(Macbook)｜Python
faster-whisper以外にも、numpyとpyaudioのインストールが必要

主に変わった点は

1行目でWhisperModelをimportする
言語の指定の仕方を”ja”に（本家は”japanese”）
Whispermodel.transcirbe()メソッドの引数と返り値
推論結果の出力方法

以下に全体のソースコードを示す

from faster_whisper import WhisperModel
import pyaudio
import numpy as np
import threading
import queue
import time


class WhisperTranscriber:
    def __init__(self, model_size='base', language='ja'):
        """
        Whisper音声認識クラス
        """
        self.model = WhisperModel(model_size, device="cpu")
        
        # 音声入力設定
        self.pyaudio = pyaudio.PyAudio()
        self.sample_rate = 16000
        self.chunk_duration = 3
        self.chunks_per_inference = int(self.sample_rate * self.chunk_duration)
        
        # スレッディング用キュー
        self.audio_queue = queue.Queue()
        self.stop_event = threading.Event()
        
        self.language = language

    def transcribe_audio(self):
        """
        リアルタイム音声文字起こし
        """
        if self.model is None:
            print("モデルの初期化に失敗しました")
            return

        audio_data = []
        while not self.stop_event.is_set():
            try:
                chunk = self.audio_queue.get(timeout=1)
                audio_data.extend(chunk)

                if len(audio_data) >= self.chunks_per_inference:
                    # テンソル変換
                    audio_np = np.array(audio_data[:self.chunks_per_inference])
                    
                    # 推論を実行
                    results, info = self.model.transcribe(
                        audio_np, 
                        language=self.language
                    )
                    
                    for result in results:
                        print(f"文字起こし結果: {result.text}")
                    audio_data = audio_data[self.chunks_per_inference:]

            except queue.Empty:
                continue
            except Exception as e:
                print(f"推論中にエラー: {e}")

    def audio_callback(self, in_data, frame_count, time_info, status):
        """
        音声データ処理用コールバック
        """
        audio_chunk = np.frombuffer(in_data, dtype=np.float32)
        self.audio_queue.put(audio_chunk)
        return (None, pyaudio.paContinue)

    def start_recording(self):
        """
        音声録音開始
        """
        self.stream = self.pyaudio.open(
            format=pyaudio.paFloat32,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunks_per_inference,
            stream_callback=self.audio_callback,
        )
        print("音声録音を開始")

    def run(self):
        """
        音声認識プロセス実行
        """
        self.start_recording()
        transcribe_thread = threading.Thread(target=self.transcribe_audio)
        transcribe_thread.start()

        try:
            transcribe_thread.join()
        except KeyboardInterrupt:
            print("\n音声認識を終了")
            self.stop_event.set()
            self.stream.stop_stream()
            self.stream.close()
            self.pyaudio.terminate()

def main():
    transcriber = WhisperTranscriber(
        model_size='small',  
        language='ja'
    )
    transcriber.run()

if __name__ == "__main__":
    main()

実行すると、3秒ごとに音声認識結果が標準出力される
今回のモデルはsmallを使用
なお、初回実行時にはモデルのダウンロードが行われる（数分程度）

transcribe()メソッドの本家whisperとの違い

本家whisperのほうでは、fp16という引数があったが、faster-whisperのほうにはないので注意

faster-whisperの返り値はタプルなので注意が必要
このプログラムではresultsとinfoという変数で受け取っている

そのうち、resultsのほうはジェネレーターオブジェクトなので、出力の仕方に注意が必要
ここでは、for文で１つずつresultという変数に代入して表示している

他にも、faster-whisperには無音区間の処理を行うかのvad_filterなど、さまざまな引数があるが、これはこれから調べていきたい

おわりに

faster-whisperを用いたリアルタイム音声認識のプログラムを紹介した

使った感じは、本家whisperとの違いを感じられなかった
長い音声データの場合だと差を感じられるかもしれない
また、メモリ使用量に関しても、調査・比較していないのでいつか気が向いたらやろうと思う

コバヤシ・ノート

faster-whisperを用いたリアルタイム音声認識(Macbook)｜Python

はじめに

faster-whisperとは

インストール

ソースコード

transcribe()メソッドの本家whisperとの違い

おわりに

Comments

コメントを残すコメントをキャンセル

Search

Categories

Recent Posts

Tags

faster-whisperを用いたリアルタイム音声認識(Macbook)｜Python

はじめに

faster-whisperとは

インストール

ソースコード

transcribe()メソッドの本家whisperとの違い

おわりに

Comments

コメントを残す コメントをキャンセル

Search

Categories

Recent Posts

Tags

コメントを残すコメントをキャンセル