Back to articles
Tutorial

Python Tutorial: Batch Convert Text to Speech with Lovo AI API

Step-by-step guide to building a Python script that converts multiple text files to professional voiceovers using Lovo AI. Perfect for audiobooks, podcasts, and video narration.

10 min read

Why Batch Text-to-Speech?

If you need to convert large amounts of text to audio, doing it manually is painfully slow:

  • Audiobook chapters
  • Course narration
  • Video voiceovers
  • Podcast scripts
  • Accessibility audio versions of blog posts
  • Lovo AI offers one of the most natural-sounding TTS APIs available. In this tutorial, we will build a Python script that batch processes text files into professional audio.

    What We Are Building

    A command-line tool that:

    1. Takes a folder of `.txt` files

    2. Converts each to speech using Lovo AI

    3. Saves the audio files with proper naming

    4. Optionally merges them into a single audio file

    5. Generates a JSON manifest with timestamps

    Prerequisites

  • Python 3.9+
  • Lovo AI API key (get from genny.lovo.ai)
  • Basic Python knowledge
  • Step 1: Project Setup

    mkdir lovo-batch-tts
    cd lovo-batch-tts
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install requests pydub python-dotenv

    Create a `.env` file:

    LOVO_API_KEY=your_api_key_here

    Step 2: The Lovo API Client

    Create `lovo_client.py`:

    import os
    import time
    import requests
    from typing import Optional, List, Dict
    from dataclasses import dataclass
    from dotenv import load_dotenv
    
    load_dotenv()
    
    API_BASE = "https://api.genny.lovo.ai/api/v1"
    API_KEY = os.getenv("LOVO_API_KEY")
    
    @dataclass
    class Voice:
        id: str
        display_name: str
        locale: str
        gender: str
    
    @dataclass
    class TTSResult:
        audio_url: str
        duration_ms: int
        word_timestamps: List[Dict]
    
    def get_headers() -> dict:
        return {
            "X-API-KEY": API_KEY,
            "Content-Type": "application/json"
        }
    
    def list_voices(locale: Optional[str] = None) -> List[Voice]:
        """Get available voices, optionally filtered by locale."""
        response = requests.get(
            f"{API_BASE}/speakers",
            headers=get_headers()
        )
        response.raise_for_status()
        
        voices = []
        for v in response.json()["data"]:
            if locale is None or v["locale"].startswith(locale):
                voices.append(Voice(
                    id=v["id"],
                    display_name=v["displayName"],
                    locale=v["locale"],
                    gender=v.get("gender", "unknown")
                ))
        return voices
    
    def generate_speech(
        text: str,
        voice_id: str,
        speed: float = 1.0,
        pitch: float = 1.0
    ) -> TTSResult:
        """Generate speech from text using specified voice."""
        
        # Step 1: Create TTS job
        create_response = requests.post(
            f"{API_BASE}/tts",
            headers=get_headers(),
            json={
                "speaker": voice_id,
                "text": text,
                "speed": speed,
                "pitch": pitch
            }
        )
        create_response.raise_for_status()
        job_id = create_response.json()["id"]
        
        # Step 2: Poll for completion
        max_attempts = 60  # 60 seconds max wait
        for _ in range(max_attempts):
            status_response = requests.get(
                f"{API_BASE}/tts/{job_id}",
                headers=get_headers()
            )
            status_response.raise_for_status()
            data = status_response.json()
            
            if data["status"] == "succeeded":
                return TTSResult(
                    audio_url=data["urls"][0],
                    duration_ms=data.get("duration", 0),
                    word_timestamps=data.get("wordTimestamps", [])
                )
            elif data["status"] == "failed":
                raise Exception(f"TTS generation failed: {data.get('error', 'Unknown error')}")
            
            time.sleep(1)
        
        raise Exception("TTS generation timed out")
    
    def download_audio(url: str, output_path: str) -> str:
        """Download audio file from URL."""
        response = requests.get(url)
        response.raise_for_status()
        
        with open(output_path, "wb") as f:
            f.write(response.content)
        
        return output_path

    Step 3: The Batch Processor

    Create `batch_tts.py`:

    import os
    import json
    import argparse
    from pathlib import Path
    from typing import List, Dict
    from dataclasses import dataclass, asdict
    from concurrent.futures import ThreadPoolExecutor, as_completed
    
    from lovo_client import list_voices, generate_speech, download_audio
    
    @dataclass
    class ProcessedFile:
        source_file: str
        audio_file: str
        duration_ms: int
        word_count: int
        success: bool
        error: Optional[str] = None
    
    def get_text_files(input_dir: str) -> List[Path]:
        """Get all .txt files from directory, sorted by name."""
        input_path = Path(input_dir)
        files = list(input_path.glob("*.txt"))
        return sorted(files, key=lambda f: f.name)
    
    def process_file(
        file_path: Path,
        voice_id: str,
        output_dir: Path,
        speed: float = 1.0
    ) -> ProcessedFile:
        """Process a single text file to audio."""
        try:
            # Read text content
            with open(file_path, "r", encoding="utf-8") as f:
                text = f.read().strip()
            
            if not text:
                return ProcessedFile(
                    source_file=str(file_path),
                    audio_file="",
                    duration_ms=0,
                    word_count=0,
                    success=False,
                    error="Empty file"
                )
            
            # Generate speech
            result = generate_speech(text, voice_id, speed=speed)
            
            # Download audio
            output_filename = file_path.stem + ".mp3"
            output_path = output_dir / output_filename
            download_audio(result.audio_url, str(output_path))
            
            return ProcessedFile(
                source_file=str(file_path),
                audio_file=str(output_path),
                duration_ms=result.duration_ms,
                word_count=len(text.split()),
                success=True
            )
        
        except Exception as e:
            return ProcessedFile(
                source_file=str(file_path),
                audio_file="",
                duration_ms=0,
                word_count=0,
                success=False,
                error=str(e)
            )
    
    def batch_process(
        input_dir: str,
        output_dir: str,
        voice_id: str,
        speed: float = 1.0,
        parallel: int = 3
    ) -> List[ProcessedFile]:
        """Process all text files in a directory."""
        
        input_path = Path(input_dir)
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        text_files = get_text_files(input_dir)
        print(f"Found {len(text_files)} text files to process")
        
        results: List[ProcessedFile] = []
        
        # Process files in parallel
        with ThreadPoolExecutor(max_workers=parallel) as executor:
            futures = {
                executor.submit(
                    process_file, f, voice_id, output_path, speed
                ): f for f in text_files
            }
            
            for future in as_completed(futures):
                file_path = futures[future]
                result = future.result()
                results.append(result)
                
                if result.success:
                    print(f"Processed: {file_path.name} ({result.duration_ms}ms)")
                else:
                    print(f"Failed: {file_path.name} - {result.error}")
        
        # Sort results by original file order
        results.sort(key=lambda r: r.source_file)
        
        return results
    
    def merge_audio_files(
        results: List[ProcessedFile],
        output_path: str
    ) -> str:
        """Merge all audio files into a single file."""
        from pydub import AudioSegment
        
        combined = AudioSegment.empty()
        
        # Add 500ms silence between segments
        silence = AudioSegment.silent(duration=500)
        
        for result in results:
            if result.success and result.audio_file:
                audio = AudioSegment.from_mp3(result.audio_file)
                combined += audio + silence
        
        combined.export(output_path, format="mp3")
        print(f"Merged audio saved to: {output_path}")
        return output_path
    
    def save_manifest(results: List[ProcessedFile], output_path: str):
        """Save processing results as JSON manifest."""
        manifest = {
            "total_files": len(results),
            "successful": sum(1 for r in results if r.success),
            "failed": sum(1 for r in results if not r.success),
            "total_duration_ms": sum(r.duration_ms for r in results),
            "total_words": sum(r.word_count for r in results),
            "files": [asdict(r) for r in results]
        }
        
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(manifest, f, indent=2)
        
        print(f"Manifest saved to: {output_path}")
    
    def main():
        parser = argparse.ArgumentParser(
            description="Batch convert text files to speech using Lovo AI"
        )
        parser.add_argument("input_dir", help="Directory containing .txt files")
        parser.add_argument("output_dir", help="Directory for output audio files")
        parser.add_argument("--voice", help="Voice ID to use (default: first en-US voice)")
        parser.add_argument("--speed", type=float, default=1.0, help="Speech speed (0.5-2.0)")
        parser.add_argument("--parallel", type=int, default=3, help="Parallel workers")
        parser.add_argument("--merge", action="store_true", help="Merge all audio into one file")
        parser.add_argument("--list-voices", action="store_true", help="List available voices and exit")
        
        args = parser.parse_args()
        
        # List voices mode
        if args.list_voices:
            print("Available voices:")
            print("-" * 60)
            for voice in list_voices():
                print(f"{voice.id}: {voice.display_name} ({voice.locale}, {voice.gender})")
            return
        
        # Get voice ID
        voice_id = args.voice
        if not voice_id:
            voices = list_voices("en-US")
            if not voices:
                voices = list_voices()
            voice_id = voices[0].id
            print(f"Using voice: {voices[0].display_name}")
        
        # Process files
        results = batch_process(
            args.input_dir,
            args.output_dir,
            voice_id,
            speed=args.speed,
            parallel=args.parallel
        )
        
        # Save manifest
        manifest_path = Path(args.output_dir) / "manifest.json"
        save_manifest(results, str(manifest_path))
        
        # Optionally merge
        if args.merge:
            merged_path = Path(args.output_dir) / "merged.mp3"
            merge_audio_files(results, str(merged_path))
        
        # Summary
        print("\n" + "=" * 60)
        print(f"Processed: {len(results)} files")
        print(f"Successful: {sum(1 for r in results if r.success)}")
        print(f"Failed: {sum(1 for r in results if not r.success)}")
        total_ms = sum(r.duration_ms for r in results)
        print(f"Total duration: {total_ms // 1000 // 60}m {(total_ms // 1000) % 60}s")
    
    if __name__ == "__main__":
        main()

    Step 4: Usage Examples

    List Available Voices

    python batch_tts.py --list-voices

    Output:

    Available voices:
    ------------------------------------------------------------
    voice_abc123: James (en-US, male)
    voice_def456: Sarah (en-US, female)
    voice_ghi789: Emma (en-GB, female)
    ...

    Convert a Folder of Text Files

    python batch_tts.py ./chapters ./audiobook --voice voice_abc123

    Convert with Custom Speed and Merge

    python batch_tts.py ./scripts ./audio --speed 0.9 --merge --parallel 5

    Example Input Structure

    chapters/
      01-introduction.txt
      02-getting-started.txt
      03-advanced-topics.txt
      04-conclusion.txt

    Example Output

    audiobook/
      01-introduction.mp3
      02-getting-started.mp3
      03-advanced-topics.mp3
      04-conclusion.mp3
      merged.mp3
      manifest.json

    Step 5: The Manifest File

    The script generates a `manifest.json` with metadata:

    {
      "total_files": 4,
      "successful": 4,
      "failed": 0,
      "total_duration_ms": 342500,
      "total_words": 2847,
      "files": [
        {
          "source_file": "chapters/01-introduction.txt",
          "audio_file": "audiobook/01-introduction.mp3",
          "duration_ms": 85200,
          "word_count": 712,
          "success": true,
          "error": null
        }
      ]
    }

    This is useful for:

  • Tracking processing status
  • Generating chapter markers
  • Creating subtitles/transcripts
  • Debugging failures
  • Advanced: Adding SSML Support

    For more control over pronunciation, add SSML support:

    def text_to_ssml(text: str) -> str:
        """Convert plain text to SSML with basic enhancements."""
        # Add pauses after periods
        text = text.replace(". ", '. <break time="300ms"/> ')
        
        # Add emphasis to quoted text
        text = re.sub(
            r'"([^"]+)"',
            r'<emphasis level="moderate">"\1"</emphasis>',
            text
        )
        
        # Wrap in SSML tags
        return f'<speak>{text}</speak>'

    Cost Estimation

    Lovo AI pricing is based on character count:

    ContentCharactersEstimated Cost
    Blog post (1000 words)~6,000~$0.30
    Book chapter (5000 words)~30,000~$1.50
    Full audiobook (50,000 words)~300,000~$15.00

    Compare to professional voice actors: $200-500 per finished hour.

    Troubleshooting

    Rate Limiting

    If you hit rate limits, reduce parallel workers:

    python batch_tts.py ./input ./output --parallel 1

    Large Files

    For very long text files, split them first:

    def split_text(text: str, max_chars: int = 5000) -> List[str]:
        """Split text into chunks at sentence boundaries."""
        sentences = text.replace('\n', ' ').split('. ')
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            if len(current_chunk) + len(sentence) < max_chars:
                current_chunk += sentence + ". "
            else:
                chunks.append(current_chunk.strip())
                current_chunk = sentence + ". "
        
        if current_chunk:
            chunks.append(current_chunk.strip())
        
        return chunks

    Conclusion

    You now have a production-ready Python tool for batch text-to-speech conversion. The script handles:

  • Multiple input files
  • Parallel processing for speed
  • Error handling and retries
  • Audio merging
  • Detailed logging and manifests
  • Use it for audiobooks, course content, video narration, or any project that needs natural-sounding voiceovers at scale.

    Get your Lovo AI API key: genny.lovo.ai

    Found this helpful?Share this article with your network to help others discover useful AI insights.