JSON Format Reference

Quick Navigation

Output Formats Top-Level Structure Common Fields Record Types Entity Coordinates Field Optionality Data Type Conventions

Introduction

This document describes the JSON output format generated by the transcription system. The JSON format provides structured data extracted from historical documents, including transcriptions, translations, and parsed record information.

This technical documentation section will be extended in the future to include additional technical details about the system's data formats, processing pipelines, and integration specifications.

Output Formats

The system generates JSON output in two primary formats:

Single Record JSON

Individual transcription result download format. Each JSON object represents a single processed image/document.

Batch JSON

Array format containing multiple records from an upload. Used when downloading all transcription results for an entire upload at once.

Top-Level Structure

The base JSON structure for a single transcription result:

{
  "image_file": "string (filename of the source image)",
  "transcription_original": "string (original language transcription)",
  "translation_en": "string (English translation)",
  "detected_language": "string (e.g., 'Polish', 'Latin', 'German', 'Russian', 'other')",
  "parsed_records": <JsonElement> | null,
  "metadata": {
    "processed_at": "DateTime (ISO format)"
  }
}

Field Descriptions

image_file: The filename of the source image that was transcribed
transcription_original: Full transcription of the document in its original language
translation_en: Complete English translation of the transcribed text
detected_language: Language detected in the source document
parsed_records: Array of parsed record objects (see Parsed Records Schema below). Omitted if parsing was not performed
metadata.processed_at: Timestamp when the transcription was completed

Batch Format

When downloading multiple records, the output is an array of the above structure:

[
  {
    "image_file": "...",
    "transcription_original": "...",
    "translation_en": "...",
    "detected_language": "...",
    "parsed_records": [...],
    "metadata": { "processed_at": "..." }
  },
  ...
]

Common Fields (All Record Types)

All parsed records include these common fields:

{
  "record_type": "birth | marriage | death | notary | land_record | court_record | index | other",
  "record_number": "string (omit if null)",
  "language_detected": "Polish | Latin | German | Russian | other",
  "script": "Latin",
  "jurisdiction": {
    "parish_or_office": "string (omit if null)",
    "village_or_town": "string (omit if null)",
    "gmina": "string (omit if null)",
    "powiat": "string (omit if null)",
    "gubernia_or_wojewodztwo": "string (omit if null)"
  },
  "dates": {
    "record_date": "YYYY-MM-DD | YYYY-MM | YYYY (omit if null)",
    "event_date": "YYYY-MM-DD | YYYY-MM | YYYY (omit if null)",
    "date_precision": "day | month | year | unknown"
  },
  "event_place": {
    "place_name": "string",
    "house_number": "string",
    "parish_church": "string"
  },
  "religion": "Roman Catholic | Greek Catholic | Jewish | Lutheran | other | unknown",
  "signatures_or_marks": "string (omit if null)",
  "source_excerpt_diplomatic": "string (omit if null)",
  "summary_for_indexing": "string (3-6 sentences in English)",
  "translation_en_modern": "string (full translation)",
  "quality": {
    "confidence": 0.0-1.0,
    "issues": ["string array"]
  },
  "notes": {
    "missing": ["string array"],
    "inference": [
      {
        "field": "string",
        "reason": "string"
      }
    ]
  }
}

Common Field Notes

record_type: Determines which type-specific fields are included
record_number: Omitted if not present in the source document
dates: Uses ISO 8601 format with varying precision levels
quality.confidence: Float value between 0.0 and 1.0 indicating extraction confidence
notes.inference: Array documenting fields that were inferred rather than explicitly stated

Record Type-Specific Fields

Each record type includes additional fields specific to that type. Click on a record type below to see its structure:

Birth Records (record_type: "birth")

{
  "participants": {
    "child": {
      "given_names": "string (original from record)",
      "given_names_local": "string (Polish translation)",
      "surname": "string",
      "sex": "male | female | unknown",
      "legitimacy": "legitimate | illegitimate | unknown",
      "birth_order": "integer (omit if null)"
    },
    "parents_of_child": {
      "father": {
        "given_names": "string",
        "given_names_local": "string",
        "surname": "string",
        "age_years": "integer",
        "occupation_or_status": "string",
        "residence": "string"
      },
      "mother": {
        "given_names": "string",
        "given_names_local": "string",
        "surname": "string",
        "maiden_name": "string",
        "age_years": "integer",
        "residence": "string"
      }
    },
    "godparents": [
      {
        "given_names": "string",
        "given_names_local": "string",
        "surname": "string",
        "residence": "string"
      }
    ],
    "witnesses": [
      {
        "given_names": "string",
        "given_names_local": "string",
        "surname": "string",
        "age_years": "integer",
        "occupation_or_status": "string",
        "residence": "string"
      }
    ]
  }
}

Note: Birth records never include groom, bride, deceased, marriage_specific, notary, property, or financial sections.

Marriage Records (record_type: "marriage")

{
  "participants": {
    "groom": {
      "given_names": "string (original from record)",
      "given_names_local": "string (Polish translation)",
      "surname": "string",
      "age": { "years": "integer", "approximate": false },
      "marital_status": "bachelor | widower | unknown",
      "occupation_or_status": "string",
      "residence": "string",
      "parents": {
        "father": {
          "given_names": "string",
          "given_names_local": "string",
          "surname": "string",
          "status": "alive | deceased | unknown"
        },
        "mother": {
          "given_names": "string",
          "given_names_local": "string",
          "surname": "string",
          "maiden_name": "string",
          "status": "alive | deceased | unknown"
        }
      }
    },
    "bride": {
      "given_names": "string (original from record)",
      "given_names_local": "string (Polish translation)",
      "surname": "string",
      "maiden_name": "string",
      "age": { "years": "integer", "approximate": false },
      "marital_status": "single | widow | unknown",
      "occupation_or_status": "string",
      "residence": "string",
      "parents": { /* same structure as groom.parents */ }
    },
    "witnesses": [ /* same structure as birth witnesses */ ]
  },
  "marriage_specific": {
    "banns_dates": ["YYYY-MM-DD"],
    "consents": "string",
    "previous_spouses": "string",
    "church_or_civil": "church | civil | unknown"
  }
}

Note: Marriage records never include child, parents_of_child, godparents, deceased, notary, property, or financial sections.

Death Records (record_type: "death")

{
  "participants": {
    "deceased": {
      "given_names": "string (original from record)",
      "given_names_local": "string (Polish translation)",
      "surname": "string",
      "sex": "male | female | unknown",
      "age": { "years": "integer", "approximate": false },
      "occupation_or_status": "string",
      "residence": "string",
      "birthplace": "string",
      "parents_or_spouse": "string"
    },
    "witnesses": [ /* same structure as birth witnesses */ ]
  }
}

Note: Death records never include child, parents_of_child, godparents, groom, bride, marriage_specific, notary, property, or financial sections.

Notary Records (record_type: "notary")

Notary records use a different structure for legal documents (contracts, sales, deeds, testaments, powers of attorney):

{
  "record_type": "notary",
  "document_number": "string (omit if null)",
  "document_type": "sale | contract | deed | testament | power_of_attorney | lease | mortgage | other",
  "notary": {
    "given_names": "string",
    "surname": "string",
    "title": "string",
    "office_location": "string"
  },
  "parties": [
    {
      "role": "seller | buyer | grantor | grantee | testator | beneficiary | lessor | lessee | mortgagor | mortgagee | other",
      "given_names": "string",
      "given_names_local": "string",
      "surname": "string",
      "residence": "string",
      "occupation_or_status": "string"
    }
  ],
  "property": {
    "description": "string",
    "location": "string",
    "boundaries": "string",
    "area_or_size": "string",
    "parcel_number": "string"
  },
  "financial": {
    "transaction_value": "string",
    "currency": "string",
    "payment_terms": "string",
    "fees_or_taxes": "string"
  }
}

Note: Notary records never include vital record sections (participants.child, participants.groom, participants.bride, participants.deceased, marriage_specific).

Land Records (record_type: "land_record")

Land records document property transactions (transfers, surveys, mortgages, leases):

{
  "record_type": "land_record",
  "transaction_type": "sale | inheritance | mortgage | lease | survey | partition | exchange | donation | other",
  "parties": [ /* same structure as notary parties */ ],
  "property": {
    "description": "string",
    "location": "string",
    "parcel_number": "string",
    "boundaries": "string",
    "area": "string",
    "improvements": "string",
    "land_use": "string"
  },
  "financial": {
    "value": "string",
    "currency": "string",
    "payment_terms": "string",
    "encumbrances": "string"
  }
}

Note: Land records never include vital record sections or notary object.

Court Records (record_type: "court_record")

Court records document legal proceedings (judgments, petitions, guardianship):

{
  "record_type": "court_record",
  "case_number": "string (omit if null)",
  "court_name": "string",
  "case_type": "civil | criminal | inheritance | guardianship | bankruptcy | appeal | other",
  "parties": [
    {
      "role": "plaintiff | defendant | petitioner | respondent | appellant | judge | witness | guardian | ward | executor | heir | creditor | debtor | other",
      "given_names": "string",
      "given_names_local": "string",
      "surname": "string",
      "residence": "string",
      "occupation_or_status": "string"
    }
  ],
  "case_details": {
    "subject_matter": "string",
    "claims": "string",
    "evidence": "string"
  },
  "decision": {
    "outcome": "string",
    "terms": "string",
    "costs": "string"
  },
  "property_involved": { "description": "string", "location": "string", "value": "string" },
  "financial_amounts": {
    "amount_claimed": "string",
    "amount_awarded": "string",
    "currency": "string"
  }
}

Note: Court records never include vital record sections, notary object, or standard property/financial sections (uses property_involved and financial_amounts instead).

Index Records (record_type: "index")

Index records represent index pages listing names and record references:

{
  "record_type": "index",
  "index_record_type": "birth | marriage | death | mixed | unknown",
  "index_entries": [
    {
      "name": "string (full name as listed in index)",
      "record_number": "string | null",
      "page": "string | null"
    }
  ],
  "summary_for_indexing": "string"
}

Note: Index records use a minimal schema and never include participant, property, or financial sections.

Entity Coordinates (Conditional)

Entity coordinates are only present when entity tagging is enabled during transcription. The entities field contains an array of detected entities with bounding box coordinates:

{
  "entities": [
    {
      "id": "string (unique identifier, e.g., 'E1', 'E2')",
      "text": "string (entity text as it appears in the document)",
      "type": "name | surname | place | parish | date | occupation | witness | godparent | relationship | entity",
      "box": {
        "x_min": 0-1000,
        "y_min": 0-1000,
        "x_max": 0-1000,
        "y_max": 0-1000
      }
    }
  ]
}

Important Notes

Coordinates use a normalized 0-1000 scale (not pixel values) for resolution independence
The id field matches entity identifiers embedded in the transcription text
Entity types correspond to semantic categories detected in the document
This field is omitted entirely if entity tagging was not enabled

Field Optionality Rules

Always Present Fields

These fields are always included in the output (may be null):

image_file, transcription_original, translation_en, detected_language
metadata.processed_at
parsed_records[].record_type, parsed_records[].language_detected
parsed_records[].summary_for_indexing, parsed_records[].translation_en_modern
parsed_records[].quality

Omitted When Null/Empty

These fields are completely omitted from the JSON output if they are null or empty:

parsed_records (entire field if parsing was not performed)
record_number, signatures_or_marks, source_excerpt_diplomatic (if not present)
Any jurisdiction sub-fields or date fields (if null)
birth_order, maiden_name, residence, occupation_or_status (if not present)
Array fields that are empty (e.g., godparents: [] is omitted)

Conditional Based on Record Type

These sections are only included for their respective record types:

Birth: participants.child, participants.parents_of_child, participants.godparents
Marriage: participants.groom, participants.bride, marriage_specific
Death: participants.deceased
Notary: notary, property, financial
Land: property, financial
Court: case_details, decision, property_involved, financial_amounts
Index: index_entries

Data Type Conventions

Dates

Format: ISO 8601 (YYYY-MM-DD, YYYY-MM, or YYYY)
Precision: Indicated by date_precision field
Examples: "1847-10-23", "1847-10", "1847"

Names

Dual field system:
• given_names: Original name as it appears in the record
• given_names_local: Polish equivalent or localized version
Surnames: Single field, may include gender-specific endings

Confidence Scores

Type: Float
Range: 0.0 to 1.0
Interpretation: Higher values indicate higher confidence in extraction accuracy

Coordinates

Type: Integer
Range: 0-1000 (normalized scale)
Purpose: Resolution-independent bounding box coordinates
Note: Not pixel values; normalized for different image resolutions

Enumerated Values

Many fields use specific enumerated values. Common enumerations:

record_type: birth, marriage, death, notary, land_record, court_record, index, other
language_detected: Polish, Latin, German, Russian, other
sex: male, female, unknown
religion: Roman Catholic, Greek Catholic, Jewish, Lutheran, other, unknown
marital_status: bachelor, widower, single, widow, unknown
date_precision: day, month, year, unknown

Array Structure

The parsed_records field is always an array, even for single-record documents:

Single record: [{...}]
Multiple records: [{...}, {...}, {...}]

This ensures consistent parsing regardless of document content.