Skip to main content

Configuration Guide

This guide explains the dataset configuration schema and how to configure datasets for auditing.

Configuration File

The main configuration file is config/datasets.json. It defines:

  • Application URL for UI testing
  • Years to audit
  • Dataset definitions with validation rules and navigation paths

Schema Overview

{
"platform": "string",
"name": "string",
"description": "string",
"app_url": "string",
"years": [2020, 2021, 2023, 2024],
"datasets": [
{
"name": "string",
"description": "string",
"category": "string",
"github_repo": "string",
"file_path": "string",
"branch": "string",
"data_url": "string",
"data_path": "string",
"expected_columns": ["string"],
"validations": [...],
"navigation": {...},
"selectors": {...}
}
]
}

Top-Level Fields

FieldTypeRequiredDescription
platformstringYesPlatform name (e.g., "OpenGINXplore")
namestringYesHuman-readable audit name
descriptionstringNoDescription of the audit purpose
app_urlstringYesBase URL of the web application
yearsarray[int]YesYears to audit by default
datasetsarray[object]YesDataset definitions

Example

{
"platform": "OpenGINXplore",
"name": "Sri Lanka Open Data Audit",
"description": "Audit of datasets from the OpenGINXplore open data platform",
"app_url": "https://openginxplore.opendata.lk/data?startDate=2020-01-01&endDate=2025-12-31",
"years": [2020, 2021, 2023, 2024],
"datasets": [...]
}

Dataset Fields

Basic Information

FieldTypeRequiredDescription
namestringYesUnique dataset name
descriptionstringNoHuman-readable description
categorystringYesUI category (e.g., "Tourism")

GitHub Source

FieldTypeRequiredDescription
github_repostringYesRepository in format owner/repo
file_pathstringYesPath to data file (supports {year} placeholder)
branchstringNoGit branch (default: "main")
data_urlstringYesRaw URL to data (supports {year} placeholder)

Data Structure

FieldTypeRequiredDescription
data_pathstringNoJSON path to data rows (e.g., "rows", "data")
expected_columnsarray[string]YesRequired column names
data_transformstringNoTransform to apply (e.g., "aggregate_monthly")

Example Dataset

{
"name": "Top 10 Source Markets",
"description": "Top 10 tourism source markets with arrivals and market share",
"category": "Tourism",
"github_repo": "LDFLK/datasets",
"file_path": "data/statistics/{year}/datasets/Top 10 source markets/data.json",
"branch": "main",
"data_url": "https://raw.githubusercontent.com/LDFLK/datasets/main/data/statistics/{year}/datasets/Top%2010%20source%20markets/data.json",
"data_path": "rows",
"expected_columns": ["Country", "Arrivals", "Share"]
}

Validations

Validations define data quality rules to check during the data integrity phase.

Validation Schema

{
"validations": [
{
"name": "string",
"type": "string",
"column": "string",
"value": "any"
}
]
}

Validation Types

min_rows

Verify minimum number of data rows.

{
"name": "minimum_rows",
"type": "min_rows",
"value": 10
}

value_not_empty

Verify a column has no empty values.

{
"name": "country_not_empty",
"type": "value_not_empty",
"column": "Country"
}

numeric_column

Verify a column contains numeric values.

{
"name": "arrivals_numeric",
"type": "numeric_column",
"column": "Arrivals"
}

Complete Validations Example

{
"validations": [
{
"name": "minimum_rows",
"type": "min_rows",
"value": 10
},
{
"name": "country_not_empty",
"type": "value_not_empty",
"column": "Country"
},
{
"name": "arrivals_numeric",
"type": "numeric_column",
"column": "Arrivals"
}
]
}

Navigation defines how to reach the dataset in the web UI using Selenium.

{
"navigation": {
"tree_path": ["string", "string", ...],
"steps": [
{
"type": "click|wait|delay",
"selector": "string",
"by": "xpath|css",
"description": "string",
"timeout": 10,
"delay": 2,
"seconds": 2
}
]
}
}

Step Types

click

Click an element.

{
"type": "click",
"selector": "//*[text()='Tourism']",
"by": "xpath",
"description": "Click Tourism category",
"delay": 2
}
FieldTypeRequiredDefaultDescription
typestringYesMust be "click"
selectorstringYesXPath or CSS selector
bystringNo"xpath"Selector type
descriptionstringNoHuman-readable description
delayintNo2Seconds to wait after click

wait

Wait for an element to appear.

{
"type": "wait",
"selector": "//table",
"by": "xpath",
"timeout": 15,
"description": "Wait for data table to load"
}
FieldTypeRequiredDefaultDescription
typestringYesMust be "wait"
selectorstringYesXPath or CSS selector
bystringNo"xpath"Selector type
timeoutintNo10Maximum seconds to wait
descriptionstringNoHuman-readable description

delay

Pause execution for a fixed time.

{
"type": "delay",
"seconds": 3,
"description": "Wait for dynamic content"
}
FieldTypeRequiredDefaultDescription
typestringYesMust be "delay"
secondsintNo2Seconds to pause
descriptionstringNoHuman-readable description

Simple Navigation (2 levels)

{
"navigation": {
"tree_path": ["Tourism", "Top 10 Source Markets"],
"steps": [
{
"type": "click",
"selector": "//*[text()='Tourism']",
"by": "xpath",
"description": "Click Tourism category"
},
{
"type": "click",
"selector": "//*[contains(text(), 'Top 10 Source Markets')]",
"by": "xpath",
"description": "Click Top 10 Source Markets"
},
{
"type": "click",
"selector": "//p[contains(text(), 'Top 10 Source Markets')]",
"by": "xpath",
"description": "Click dataset card"
},
{
"type": "wait",
"selector": "//table",
"by": "xpath",
"timeout": 15,
"description": "Wait for data table to load"
}
]
}
}

Deep Tree Navigation (4 levels)

{
"navigation": {
"tree_path": ["Tourism", "Arrivals", "By Country", "Tourist Arrivals By Country"],
"steps": [
{
"type": "click",
"selector": "//*[text()='Tourism']",
"by": "xpath",
"description": "Click Tourism"
},
{
"type": "click",
"selector": "//*[text()='Arrivals']",
"by": "xpath",
"description": "Click Arrivals"
},
{
"type": "click",
"selector": "//*[text()='By Country']",
"by": "xpath",
"description": "Click By Country"
},
{
"type": "click",
"selector": "//*[contains(text(), 'Tourist Arrivals By Country')]",
"by": "xpath",
"description": "Click Tourist Arrivals By Country"
},
{
"type": "wait",
"selector": "//table",
"by": "xpath",
"timeout": 15,
"description": "Wait for data table to load"
}
]
}
}

Selectors

Selectors define elements to check after navigation completes.

Selectors Schema

{
"selectors": {
"element_name": {
"selector": "string",
"by": "xpath|css",
"extract_text": true|false,
"extract_list": true|false
}
}
}

Selector Options

FieldTypeRequiredDefaultDescription
selectorstringYesXPath or CSS selector
bystringNo"xpath"Selector type
extract_textboolNofalseExtract text content
extract_listboolNofalseExtract multiple elements

Example

{
"selectors": {
"data_table": {
"selector": "//table",
"by": "xpath"
},
"table_rows": {
"selector": "//table//tr",
"by": "xpath",
"extract_list": true
}
}
}

Data Formats

The framework supports two JSON data formats:

Columnar Format

{
"columns": ["Country", "Arrivals", "Share"],
"rows": [
["India", 416974, 20.3],
["Russia", 201920, 9.8]
]
}

Object Array Format

[
{"Country": "India", "Arrivals": 416974, "Share": 20.3},
{"Country": "Russia", "Arrivals": 201920, "Share": 9.8}
]

The framework automatically detects and handles both formats.

XPath Selector Tips

Text Matching

//*[text()='Exact Text']           # Exact match
//*[contains(text(), 'Partial')] # Contains
//*[starts-with(text(), 'Start')] # Starts with

Element Types

//button[text()='Click']           # Button
//a[text()='Link'] # Anchor
//p[contains(text(), 'Para')] # Paragraph
//table # Table
//table//tr # Table rows
//table//th # Table headers
//table//td # Table cells

Attributes

//*[@id='myId']                    # By ID
//*[@class='myClass'] # By class
//*[contains(@class, 'partial')] # Contains class
//input[@type='text'] # By attribute

Combining Conditions

//button[text()='Submit' and @type='submit']
//div[@class='card' and contains(text(), 'Tourism')]

Discovering Selectors

Use the explore command to discover available selectors:

# Explore page structure
python main.py explore "https://openginxplore.opendata.lk/data"

# Test navigation with extract-table
python main.py extract-table "https://openginxplore.opendata.lk/data" \
-k "//*[text()='Tourism']" \
--wait 3 \
--no-headless

Watch the browser to verify each click reaches the expected element.