๐งฉ What is gztarchiver?
gztarchiver is an automated Python-based library and command-line tool designed to extract, download, and archive official gazettes from the Sri Lankan government's documents.gov.lk website.
It simplifies the process of maintaining a well-organized, up-to-date archive of gazettes by handling downloading, validation, and cloud backup operations โ all through a single configurable tool.
๐ก Why Use gztarchiver?
Manually collecting gazettes from official resources can be time-consuming, error-prone, and repetitive. gztarchiver automates this entire process, providing:
- ๐ง Intelligent automation โ runs daily without manual intervention
- ๐งฉ Flexible configuration โ manage everything via a simple YAML file
- ๐ Reliable logging โ track successful and failed downloads
- ๐ Flag documents โ flag documents using LLM
- ๐ ๏ธ Resume support โ continue downloads from where you left off
- ๐งพ Organized structure โ files stored by year, month, and date for easy access
Whether you're a researcher, data archivist, or government organization, gztarchiver ensures no gazette is ever missed.
โ๏ธ Technologies Used
| Technology | Purpose |
|---|---|
| Python | Core programming language for the entire tool |
| Scrapy | Web scraping framework used to extract gazette metadata and links from the resource website |
| DeepSeek | Used for intelligent classification and data processing of extracted gazettes |
| YAML Configuration | Provides customizable options for runtime behavior and file paths |
๐ Resource Website
All gazette data is scraped and archived from the official Sri Lankan Government Gazette Portal:
๐ https://documents.gov.lk/
This is the official source of all public gazette publications in English, Sinhala, and Tamil.
๐ Daily Execution Schedule
The gztarchiver tool is configured to run automatically every day at 20:00 (8:00 PM) local time. During this scheduled run, it:
- Checks for new gazettes published that day
- Downloads all available files in all supported languages
- Validates the downloaded PDFs
- Flag documents using deepseek LLM
- Updates the metadata and log files
- Syncs results to the configured archive
This ensures the archive remains fresh and synchronized daily.
๐งพ Summary
| Category | Description |
|---|---|
| Tool Name | gztarchiver |
| Purpose | Automate gazette extraction and archiving |
| Language | Python |
| Frameworks | Scrapy, DeepSeek, YAML |
| Data Source | documents.gov.lk |
| Execution Time | Daily at 20:00 |
| Status | Under active development ๐ง |