What is gztarchiver?
gztarchiver is an automated Python-based library and command-line tool designed to extract, download, and archive official gazettes from the Sri Lankan government's documents.gov.lk website.
It simplifies the process of maintaining a well-organized, up-to-date archive of gazettes by handling downloading, validation, and cloud backup operations — all through a single configurable tool.
Why Use gztarchiver?
Manually collecting gazettes from official resources can be time-consuming, error-prone, and repetitive. gztarchiver automates this entire process, providing:
- Intelligent automation — runs daily without manual intervention
- Flexible configuration — manage everything via a simple YAML file
- Reliable logging — track successful and failed downloads
- Flag documents — flag documents using LLM
- Resume support — continue downloads from where you left off
- Organized structure — files stored by year, month, and date for easy access
Whether you're a researcher, data archivist, or government organization, gztarchiver ensures no gazette is ever missed.
Technologies Used
| Technology | Purpose |
|---|---|
| Python | Core programming language for the entire tool |
| Scrapy | Web scraping framework used to extract gazette metadata and links from the resource website |
| DeepSeek | Used for intelligent classification and data processing of extracted gazettes |
| YAML Configuration | Provides customizable options for runtime behavior and file paths |
Resource Website
All gazette data is scraped and archived from the official Sri Lankan Government Gazette Portal:
🔗 https://documents.gov.lk/
This is the official source of all public gazette publications in English, Sinhala, and Tamil.