Quick Start

Get up and running with OpenGIN Tracer in minutes.

Installation

To install the latest version of OpenGIN Tracer, use pip:

pip install opengin-ingestion

Or install from source if you have the repository cloned:

pip install -e .

Hands-on Example

Important: Gemini API Key Required

You must set the GOOGLE_API_KEY environment variable for the tracer to work. Without it, the system will run in Mock Mode, returning static dummy data for every page (which will look like duplicate tables).

export GOOGLE_API_KEY="your_api_key_here"

In this guide, we will use a generated sample PDF to demonstrate how to extract tables using the Python API.

1. The Scenario

We have a 5-page PDF file (data/quickstart_sample.pdf) where each page contains a table with a title.

Goal: Extract the table from each page into a separate CSV file.
Output: 5 CSV files (or JSONs) containing the structure data.

(If you don't have the sample PDF, you can generate it running python scripts/generate_sample_pdf.py)

2. Create Metadata Schema (Optional)

Create a file named metadata.yml to define what extra information you want to extract about the data:

fields:
  - name: table_category
    description: The category of the table (e.g., Financial, Inventory, Staff)
    type: string
  - name: confidence_score
    description: A score from 0 to 1 indicating confidence in extraction
    type: float
  - name: row_count
    description: Number of rows in the table
    type: integer
  - name: column_count
    description: Number of columns in the table
    type: integer


### 3. Create the Extraction Script

Create a file named `quickstart_extract.py` with the following content:

```python
import os
import yaml
from opengin.tracer.agents.orchestrator import Agent0

# Define what we want to extract
EXTRACTION_PROMPT = """
    **Objective:** Extract the table from the current page.
    
    **Instructions:**
    1. **Identify**: Locate the table and its title/heading.
    2. **Naming**: Use the table title as the table name. Convert it to snake_case.
    3. **Extract**: Extract all rows and columns accurately.
    4. **Metadata**: Extract the metadata as defined in the schema, one metadata per table.
    5. **Separate**: If multiple tables exist, extract them as separate entities, each with its own name.
"""

def main():
    # 1. Setup paths
    input_pdf = "data/quickstart_sample.pdf"
    pipeline_name = "quickstart_run"
    metadata_file = "metadata.yml"
    
    if not os.path.exists(input_pdf):
        print(f"Error: {input_pdf} not found. Please run scripts/generate_sample_pdf.py first.")
        return

    # 2. Load Metadata Schema
    metadata_schema = None
    if os.path.exists(metadata_file):
        with open(metadata_file, "r") as f:
            metadata_schema = yaml.safe_load(f)
        print("Loaded metadata schema.")

    # 3. Initialize the Orchestrator
    agent0 = Agent0()
    
    print(f"Initializing pipeline for: {input_pdf}")
    run_id, metadata = agent0.create_pipeline(
        pipeline_name,
        input_pdf,
        os.path.basename(input_pdf)
    )

    # 4. Run the Pipeline
    print(f"Running extraction (Run ID: {run_id})...")
    agent0.run_pipeline(pipeline_name, run_id, EXTRACTION_PROMPT, metadata_schema=metadata_schema)
    
    # 5. Success Message
    print(f"Extraction complete! Check results in: pipelines/{pipeline_name}/{run_id}/output/")

if __name__ == "__main__":
    main()

4. Run the Script

Execute the script in your terminal:

python quickstart_extract.py

4. Check Results with CLI

OpenGIN comes with a CLI to manage and inspect your pipeline runs.

List Runs: See all your pipeline executions.

opengin tracer list-runs

Output:

+----------------+--------------------------------------+-----------+-------+
| Pipeline       | Run ID                               | Status    | Pages |
+================+======================================+===========+=======+
| quickstart_run | 9ce866d1-58f0-46e1-a369-0f8ca26e8c54 | COMPLETED |     5 |
+----------------+--------------------------------------+-----------+-------+

Inspect Run: Get detailed information about a specific run using its name and ID.

opengin tracer info quickstart_run <YOUR_RUN_ID>

Output:

{
  "pipeline_name": "quickstart_run",
  "run_id": "9ce866d1-58f0-46e1-a369-0f8ca26e8c54",
  "created_at": "2025-12-26 07:22:13.297861",
  "status": "COMPLETED",
  "page_count": 5,
  "current_stage": "EXPORTING",
  "input_file": "pipelines/quickstart_run/9ce866d1-58f0-46e1-a369-0f8ca26e8c54/input/quickstart_sample.pdf"
}

Output Files:
 - sample_data_table.csv
 - sample_data_table_metadata.json

View Data: Navigate to the output directory shown in the info command to see your CSVs and JSONs.
```
ls pipelines/quickstart_run/<YOUR_RUN_ID>/output/
```

Next Steps

Explore the Architecture to understand how the orchestrator works.
Check out the Tutorials for more complex examples.

Installation​

Hands-on Example​

1. The Scenario​

2. Create Metadata Schema (Optional)​

4. Run the Script​

4. Check Results with CLI​

Next Steps​