Integrations

Integrations¶

LDA integrates with various tools and platforms to enhance your workflow. This guide covers built-in integrations and how to connect LDA with your existing toolchain.

Version Control Systems¶

Git Integration¶

LDA works seamlessly with Git for code versioning while maintaining separate data provenance tracking.

Automatic Git Commits¶

Configure automatic commits after tracking:

# lda_config.yaml
integrations:
  git:
    auto_commit: true
    commit_message: "LDA: {message}"
    include_manifest: true

Git Hooks¶

Install LDA git hooks:

# Install pre-commit hook
lda git install-hooks

# Manual hook setup
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
lda validate --strict
EOF
chmod +x .git/hooks/pre-commit

`.gitignore` Configuration¶

LDA automatically creates appropriate .gitignore:

# LDA generated
*.tmp
*.log
.lda_cache/
*_sandbox/

# Large data files (tracked by LDA)
data/*.csv
data/*.parquet
outputs/*.pkl

# But track manifests
!*/manifest.json
!lda_manifest.csv

GitHub Integration¶

GitHub Actions¶

# .github/workflows/lda-validation.yml
name: LDA Validation

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install LDA
        run: |
          pip install lda-tool

      - name: Validate project
        run: |
          lda validate --strict
          lda test --all

      - name: Check tracking
        run: |
          lda changes --check

GitHub Issues Integration¶

Link LDA sections to GitHub issues:

# lda_config.yaml
integrations:
  github:
    repo: "owner/repo"
    issue_tracking:
      enabled: true
      section_prefix: "LDA:"

Cloud Storage¶

AWS S3¶

Store large files and backups in S3:

# lda_config.yaml
integrations:
  s3:
    bucket: "my-lda-bucket"
    prefix: "projects/{project_code}"
    backup:
      enabled: true
      schedule: "daily"
    large_files:
      threshold: "100MB"
      auto_upload: true

S3 Commands¶

# Upload project to S3
lda s3 upload --bucket my-bucket

# Download from S3
lda s3 download --bucket my-bucket --project PROJ001

# Sync with S3
lda s3 sync

Google Cloud Storage¶

integrations:
  gcs:
    bucket: "my-lda-bucket"
    project: "my-gcp-project"
    credentials: "${GOOGLE_APPLICATION_CREDENTIALS}"

Azure Blob Storage¶

integrations:
  azure:
    container: "lda-projects"
    account: "myaccount"
    key: "${AZURE_STORAGE_KEY}"

Database Integration¶

PostgreSQL¶

Store metadata and tracking in PostgreSQL:

integrations:
  postgres:
    host: "${DB_HOST}"
    port: 5432
    database: "lda_tracking"
    user: "${DB_USER}"
    password: "${DB_PASSWORD}"
    schema: "lda"

Database Schema¶

-- LDA tracking schema
CREATE SCHEMA IF NOT EXISTS lda;

CREATE TABLE lda.projects (
    id SERIAL PRIMARY KEY,
    code VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(200),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE lda.file_tracking (
    id SERIAL PRIMARY KEY,
    project_id INTEGER REFERENCES lda.projects(id),
    file_path VARCHAR(500),
    hash VARCHAR(64),
    size BIGINT,
    modified_at TIMESTAMP,
    analyst VARCHAR(100)
);

MongoDB¶

For document-based tracking:

integrations:
  mongodb:
    uri: "${MONGODB_URI}"
    database: "lda"
    collection: "tracking"

Jupyter Integration¶

Notebook Tracking¶

Track Jupyter notebooks with LDA:

# In Jupyter notebook
import lda.jupyter

# Enable automatic tracking
lda.jupyter.enable_tracking()

# Manual checkpoint
lda.jupyter.checkpoint("Completed data preprocessing")

Magic Commands¶

# Load LDA magic commands
%load_ext lda.jupyter

# Track cell output
%%lda track
df = pd.read_csv("data.csv")
df = df.dropna()
df.to_csv("cleaned_data.csv")

# Show project status
%lda status

IDE Integration¶

VS Code Extension¶

Install the LDA VS Code extension:

code --install-extension lda-tool.vscode-lda

Features: - Syntax highlighting for lda_config.yaml - Command palette integration - Status bar tracking indicator - Inline validation

PyCharm Plugin¶

Configure PyCharm for LDA:

Install LDA plugin from marketplace
Configure project interpreter
Set up file watchers:

<!-- .idea/watcherTasks.xml -->
<TaskOptions>
  <option name="command" value="lda" />
  <option name="arguments" value="track --auto" />
  <option name="checkSyntaxErrors" value="false" />
</TaskOptions>

CI/CD Integration¶

Jenkins¶

// Jenkinsfile
pipeline {
    agent any

    stages {
        stage('Setup') {
            steps {
                sh 'pip install lda-tool'
            }
        }

        stage('Validate') {
            steps {
                sh 'lda validate --strict'
            }
        }

        stage('Track') {
            steps {
                sh 'lda track --all'
            }
        }

        stage('Export') {
            steps {
                sh 'lda export manifest --output manifest.json'
                archiveArtifacts artifacts: 'manifest.json'
            }
        }
    }
}

GitLab CI¶

# .gitlab-ci.yml
stages:
  - validate
  - track
  - report

lda-validate:
  stage: validate
  script:
    - pip install lda-tool
    - lda validate --strict

lda-track:
  stage: track
  script:
    - lda track --all
    - lda changes --check

lda-report:
  stage: report
  script:
    - lda export report --format html --output report.html
  artifacts:
    paths:
      - report.html

Workflow Automation¶

Apache Airflow¶

Create LDA DAGs:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'lda',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
}

dag = DAG(
    'lda_workflow',
    default_args=default_args,
    schedule_interval='@daily',
)

validate = BashOperator(
    task_id='validate',
    bash_command='lda validate --strict',
    dag=dag,
)

track = BashOperator(
    task_id='track',
    bash_command='lda track --all',
    dag=dag,
)

export = BashOperator(
    task_id='export',
    bash_command='lda export manifest --output /data/manifest.json',
    dag=dag,
)

validate >> track >> export

Make Integration¶

# Makefile
.PHONY: init validate track report clean

init:
    lda init --template research

validate:
    lda validate --strict

track:
    lda track --all --message "$(MSG)"

report:
    lda export report --format html --output report.html

clean:
    rm -rf lda_sandbox/
    rm -f *.log

workflow: validate track report

Notification Systems¶

Slack Integration¶

integrations:
  slack:
    webhook_url: "${SLACK_WEBHOOK}"
    notifications:
      on_track: true
      on_error: true
      on_validate: false
    channel: "#lda-updates"

Email Notifications¶

integrations:
  email:
    smtp_server: "smtp.gmail.com"
    smtp_port: 587
    username: "${EMAIL_USER}"
    password: "${EMAIL_PASS}"
    recipients:
      - "team@example.com"
    notifications:
      daily_summary: true
      error_alerts: true

Data Science Tools¶

Pandas Integration¶

import pandas as pd
import lda.pandas

# Enable LDA tracking for pandas
lda.pandas.enable_tracking()

# Read with tracking
df = pd.read_csv("data.csv")  # Automatically tracked

# Save with tracking
df.to_csv("output.csv")  # Automatically tracked

# Manual tracking
with lda.pandas.track_operation("data_cleaning"):
    df = df.dropna()
    df = df[df['value'] > 0]

MLflow Integration¶

import mlflow
import lda.mlflow

# Configure MLflow with LDA
lda.mlflow.configure()

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("alpha", 0.01)

    # Train model
    model = train_model()

    # Log with LDA tracking
    mlflow.log_model(model, "model")
    lda.track("models/model.pkl", "Trained model v1")

DVC Integration¶

# dvc.yaml
stages:
  preprocess:
    cmd: python preprocess.py
    deps:
      - data/raw.csv
    outs:
      - data/processed.csv
    meta:
      lda_track: true
      lda_section: "sec01_preprocessing"

Custom Integrations¶

Creating Integration Plugins¶

# my_integration.py
from lda.integrations import Integration

class MyServiceIntegration(Integration):
    """Custom service integration."""

    def __init__(self, config):
        super().__init__(config)
        self.api_key = config.get("api_key")
        self.endpoint = config.get("endpoint")

    def connect(self):
        """Establish connection."""
        self.client = MyServiceClient(
            self.api_key,
            self.endpoint
        )

    def on_track(self, files):
        """Handle file tracking."""
        for file in files:
            self.client.upload(file)

    def on_validate(self, results):
        """Handle validation results."""
        if results.errors:
            self.client.alert(results.errors)

Integration Configuration¶

integrations:
  custom:
    my_service:
      class: "my_integration.MyServiceIntegration"
      api_key: "${MY_SERVICE_KEY}"
      endpoint: "https://api.myservice.com"
      events:
        - track
        - validate

Best Practices¶

1. Security¶

Never commit credentials to version control
Use environment variables for sensitive data
Encrypt data in transit and at rest
Implement proper access controls

2. Performance¶

Use async operations for external services
Implement caching for frequently accessed data
Batch operations when possible
Set appropriate timeouts

3. Reliability¶

Implement retry logic with backoff
Handle service outages gracefully
Provide fallback mechanisms
Log integration events

4. Monitoring¶

integrations:
  monitoring:
    prometheus:
      enabled: true
      port: 9090
    metrics:
      - tracking_operations
      - validation_errors
      - integration_failures