Skip to main content

GztExtractor

BETA Research Work in Progress

The GztExtractor is an LLM-based tool for extracting structured data from Sri Lankan gazette PDFs.

Features

  • Ministry/Department Initial Gazette Extraction: Extract the complete structure of ministries and their departments from initial gazettes
  • Amendment Gazette Extraction: Extract changes (additions, omissions) from amendment gazettes
  • Person Gazette Extraction: Extract personnel assignments including Ministers, State Ministers, Deputy Ministers, and Secretaries

Supported Gazette Types

TypeDescription
ministry-initialInitial gazette defining ministry structure
ministry-amendmentAmendment gazette with structural changes
personsPersonnel appointment gazettes

Setup

Prerequisites

  • Python 3.8+
  • OpenAI API key

Installation

  1. Navigate to the extractor directory:

    cd gazettes/extractor
  2. Create a virtual environment:

    python -m venv venv
  3. Activate the environment:

    # Linux/Mac
    source venv/bin/activate

    # Windows (PowerShell)
    venv/Scripts/Activate.ps1
  4. Install dependencies:

    pip install -r requirements.txt

Usage

Set API Key

# Linux/Mac
export OPENAI_API_KEY=<YOUR_API_KEY>

# Windows (PowerShell)
$env:OPENAI_API_KEY=<YOUR_API_KEY>

Run Extraction

python cli.py --type <gazette_type> --pdf <path_to_pdf> --output <output_directory>

Parameters

ParameterRequiredDescription
--typeYesOne of: ministry-initial, ministry-amendment, persons
--pdfYesPath to the gazette PDF file
--outputNoOutput directory (default: ./outputs)

Examples

Extract ministry structure from initial gazette:

python cli.py --type ministry-initial --pdf ./sample_gazette.pdf

Extract amendments:

python cli.py --type ministry-amendment --pdf ./amendment_gazette.pdf --output ./results

Extract personnel data:

python cli.py --type persons --pdf ./person_gazette.pdf

Output Format

Ministry Initial Output

{
"ministers": [
{
"name": "Ministry of Finance",
"departments": ["Department of Treasury", "Inland Revenue", "Customs"]
}
]
}

Amendment Output

{
"ADD": [
{ "ministry": "Ministry of Finance", "department": "New Department" }
],
"OMIT": [
{ "ministry": "Ministry of Finance", "department": "Old Department" }
]
}

Person Output

{
"ADD": [
{ "person": "Hon. John Doe", "ministry": "Ministry of Finance", "position": "Minister" }
],
"TERMINATE": [
{ "person": "Hon. Jane Doe", "ministry": "Ministry of Finance", "position": "Minister" }
]
}

Project Structure

extractor/
├── cli.py # Command-line interface
├── main.py # Main extraction logic
├── extractors/ # Extraction modules
├── loaders/ # PDF loading utilities
├── mergers/ # Data merging utilities
├── prompts/ # LLM prompts for each gazette type
├── assets/ # Static assets
└── requirements.txt # Python dependencies