Within Duplicates

Why Big UFO Datasets Double Count Cases

Large combined datasets can preserve the same case under different dates, places, shapes or source labels unless matching is source-aware.

On this page

  • How merged catalogues inherit duplicates
  • Why near matches are hard to detect
  • Source identifiers and conservative matching rules
Preview for Why Big UFO Datasets Double Count Cases

Introduction

Large UFO catalogues often look larger than they really are because they merge records from different reporting systems without fully resolving duplicates. A single sighting may appear in a witness report, a local newspaper, an investigator’s file, a historical catalogue and a later online database. When those sources are combined, the same underlying event can survive as several separate rows. The result is not necessarily fraud or bad faith. It is often a consequence of preserving source material while avoiding accidental deletion of genuine cases.

Merged Data illustration 1 This problem becomes more visible as researchers combine older archives with modern reporting systems such as the National UFO Reporting Center (NUFORC), historical catalogues and privately maintained databases. The challenge is that UFO reports rarely arrive with a universal case identifier. Matching records therefore depends on imperfect clues such as dates, locations, descriptions and witness narratives, all of which can vary from source to source. [Center for UFO Studies]cufos.orgUFOCAT Codebook 2023Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give…Published: June 6, 2024

How merged catalogues inherit duplicates

Most large UFO databases are not built from a single stream of reports. Instead, they are assembled from multiple collections that were created for different purposes and under different standards.

A historical newspaper archive may record a sighting using the publication date rather than the observation date. A civilian reporting centre may store the witness’s own description. A researcher compiling a catalogue years later may summarise the event in a few sentences and assign a shape category. When those records are merged, software may see them as different incidents even when they refer to the same event.

The Center for UFO Studies’ UFOCAT database openly acknowledges this problem in its design. Its codebook explains that a record typically represents one witness or witness group reporting one event through one source, but it also notes that witnesses, events and sources are not always clearly separable. UFOCAT therefore preserves source-level records while providing mechanisms to group records that refer to the same underlying event. The database treats the earliest or most complete account as a primary record and links related source entries around it. [Center for UFO Studies]cufos.orgUFOCAT Codebook 2023Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give…Published: June 6, 2024

This approach highlights an important distinction:

  • A catalogue may count reports rather than events.
  • Different reports may describe the same event.
  • Preserving provenance often means retaining multiple versions.
  • Removing duplicates too aggressively risks erasing legitimate evidence.

For database builders, preserving source history is often considered safer than forcing uncertain merges.

Why the same sighting can look different in different datasets

The biggest obstacle to deduplication is that UFO reports are rarely standardised at the moment they are created.

Even basic details may shift between sources:

FieldCommon variationDateDifferent time zones, midnight crossings, reporting delaysLocationCity centre versus suburb, county versus municipalityShape“Disc”, “oval”, “light”, “sphere” or “unknown” for the same objectDurationEstimated differently by separate witnessesWitness countOne source lists a group, another lists individualsNarrativeExpanded, shortened or edited by investigators

A witness might tell a newspaper that an object was “circular”, while a later catalogue codes it as “disc”. Another database might classify the same event simply as a “light”. Automated matching systems often rely on these fields, so category differences can prevent records from being linked.

Location creates similar problems. A sighting near a town may be recorded under the nearest city in one database and under a rural district in another. Historical records are especially vulnerable because place names, administrative boundaries and map references can change over time.

These discrepancies become more significant when datasets contain hundreds of thousands of records. Even small inconsistencies can generate large numbers of apparent unique cases.

Why near-matches are hard to detect

Many readers assume duplicate detection is a simple matter of comparing dates and locations. In practice, the problem resembles record linkage challenges found in medicine, insurance claims and software defect tracking.

A strict matching rule creates false negatives. A loose matching rule creates false positives.

Consider three reports:

  1. A sighting on 14 June at 22:05 near Bristol.
  2. A sighting on 14 June at 22:15 in northern Bristol.
  3. A sighting on 15 June shortly after midnight in South Gloucestershire.

These may represent:

  • One event described differently.
  • Two separate events.
  • Three unrelated observations.

A computer cannot know which interpretation is correct without additional context.

The problem becomes even harder when databases contain historical records. Older reports may lack precise coordinates, exact times or witness names. Many archives include summaries written decades after the original event occurred. Matching therefore becomes probabilistic rather than certain.

Researchers in other fields face similar issues. Large pharmacovigilance systems that collect adverse-event reports for medicines treat duplicate identification as a major analytical challenge because separate reports can describe the same case while differing in dates, wording and metadata. Modern duplicate-detection systems rely on probabilistic matching rather than exact identity checks. [arXiv]arxiv.orgarXivA Scalable Predictive Modelling Approach to Identifying Duplicate Adverse Event Reports for Drugs and VaccinesMarch 31, 2025…Published: March 31, 2025

The same logic applies to UFO catalogues. Exact matching misses many duplicates, but flexible matching risks combining unrelated sightings.

Merged Data illustration 2

Source identifiers matter more than object descriptions

One of the most effective ways to manage duplicates is not to compare the reported object at all, but to track the source history of the report.

UFOCAT’s structure illustrates this principle. Rather than attempting to create a single perfectly cleaned event list, it preserves source-specific records and groups them through record relationships. The database can therefore show how several reports relate to one event without destroying information about where each version originated. [Center for UFO Studies]cufos.orgUFOCAT Codebook 2023Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give…Published: June 6, 2024

This source-aware approach addresses a common failure in merged catalogues.

Imagine that:

  • A newspaper story appears in 1978.
  • An investigator copies the story into a case file.
  • A UFO book summarises the investigator’s account.
  • An online database imports the book entry.

If the final database ignores source lineage, it may treat all four entries as independent reports. Yet they ultimately derive from the same original observation.

Tracking provenance allows researchers to distinguish:

  • Independent witnesses.
  • Independent investigations.
  • Re-publications of the same information.
  • Secondary summaries derived from earlier sources.

Without provenance data, duplicate detection becomes far more speculative.

Why conservative matching often leaves duplicates in place

Many database maintainers intentionally tolerate some duplication.

The reason is straightforward: deleting a genuine report is often considered a worse error than retaining an extra copy.

Suppose two records share:

  • The same date.
  • The same town.
  • Similar descriptions.
  • Similar durations.

They might still represent separate sightings occurring during a local flap, where multiple unusual lights were reported in the same area on the same night.

If an automated process merges them incorrectly, researchers lose information that may never be recoverable.

As a result, many catalogue builders adopt conservative rules:

  • Merge only when confidence is high.
  • Preserve uncertain matches as separate records.
  • Flag suspected duplicates rather than deleting them.
  • Retain source-specific entries for audit purposes.

This means that merged UFO datasets can remain partially duplicated even after extensive cleaning.

The goal is often traceability rather than numerical purity.

Merged Data illustration 3

Modern mega-datasets amplify the problem

Recent projects have attempted to combine several major UFO databases into single searchable systems containing hundreds of thousands of records. Public discussions around these efforts frequently emphasise cross-referencing and deduplication because the scale makes manual review impossible. [Reddit]reddit.com614505 ufo sighting records from 5 major databasesBest sources for alien reports. Most reported UFO sightings locations. UFO hotspots in the USA…

The larger the merged collection becomes, the more duplicate pathways emerge:

  • Multiple databases may already contain copies of the same historical case.
  • Independent scrapes of the same website may introduce repeated entries.
  • Archived versions and updated versions may both survive.
  • Corrections may appear as new records rather than replacements.

A merged catalogue can therefore inherit duplicates from every source it absorbs while simultaneously creating new duplicates during the integration process.

This is one reason why raw sighting totals should be interpreted cautiously. A database containing 300,000 reports does not automatically represent 300,000 distinct UFO events.

What inflated counts actually tell researchers

Duplicate records do not necessarily make a database useless. They simply change what the numbers mean.

A high report count may indicate:

  • Extensive public interest.
  • Multiple witness accounts of notable events.
  • Strong archival coverage.
  • Repeated circulation of historically famous cases.
  • Dataset merging without complete deduplication.

It does not automatically indicate a corresponding number of independent unexplained phenomena.

For researchers, the most informative question is often not “How many reports exist?” but “How many unique events remain after source-aware grouping?” UFOCAT’s grouping structure was designed partly to address exactly that distinction. [Center for UFO Studies]cufos.orgUFOCAT Codebook 2023Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give…Published: June 6, 2024

The persistence of duplicates in major UFO datasets is therefore less a sign of negligence than a reflection of a difficult trade-off. Catalogue builders must choose between preserving historical source material and producing a cleaner event count. Most large archives prioritise preservation, which means some level of duplication is likely to remain even in carefully maintained databases. [Center for UFO Studies]cufos.orgUFOCAT Codebook 2023Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give…Published: June 6, 2024 [Reddit]reddit.comMUFON/Open, Kaggle scrapes, and others). Each sighting has…Read more…

Amazon book picks

Further Reading

Books and field guides related to Why Big UFO Datasets Double Count Cases. Use these as the next step if you want deeper reading beyond the article.

BookCover for Calling Bullshit

Calling Bullshit

By Carl T. Bergstrom, Jevin Darwin West

Directly helps readers spot misleading merged datasets, duplication, and overconfident numerical claims.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/2504.03729
    Source snippet

    arXivA Scalable Predictive Modelling Approach to Identifying Duplicate Adverse Event Reports for Drugs and VaccinesMarch 31, 2025...

    Published: March 31, 2025

  2. Source: reddit.com
    Title: 614505 ufo sighting records from 5 major databases
    Link: https://www.reddit.com/r/UFOB/comments/1sjoz1f/614505_ufo_sighting_records_from_5_major_databases/
    Source snippet

    Best sources for alien reports. Most reported UFO sightings locations. UFO hotspots in the USA...

  3. Source: reddit.com
    Link: https://www.reddit.com/r/UFOs/comments/1oz3i0h/im_releasing_a_cleaned_enriched_ufo_dataset_327k/
    Source snippet

    [MUFON]({{ 'mufon/' | relative_url }})/Open, Kaggle scrapes, and others). Each sighting has...Read more...

  4. Source: arxiv.org
    Link: https://arxiv.org/html/2411.02401v1
    Source snippet

    A Civilian Astronomer's Guide to UAP Research5 Nov 2024 — While a single UFO report might be linked to one event, thousands of reports ca...

  5. Source: cufos.org
    Title: UFOCAT Codebook 2023
    Link: https://cufos.org/PDFs/UFOCAT%20Codebook%202023.pdf
    Source snippet

    Center for UFO StudiesUFOCAT 2023June 6, 2024 — A closely related principle established in the earlier versions of UFOCAT was that a give...

    Published: June 6, 2024

Additional References

  1. Source: ada-nuforc-analysis.github.io
    Link: https://ada-nuforc-analysis.github.io/
    Source snippet

    NUFORC Report AnalysisThe NUFORC checks each reports for fakes or hoax and comments them accordingly. The reports are classified by their...

  2. Source: cdn.nationalarchives.gov.uk
    Link: https://cdn.nationalarchives.gov.uk/documents/the-ufo-files-extract.pdf
    Source snippet

    UFO FILESHe has a long-standing interest in UFOs and other aerial phenomena, and has worked with the [National Archives]({{ 'archives/' | relative_url }}) in promoting UFO m...

  3. Source: observablehq.com
    Link: https://observablehq.com/%407a596ea349ab5d1c/ufo-sightings

  4. Source: kaggle.com
    Title: UF O Sightings Analysis This dataset contains 80,332 reported UFO sightings from
    Link: https://www.kaggle.com/code/mcarre/ufo-sightings-analysis
    Source snippet

    UFO Sightings AnalysisThis dataset contains 80,332 reported UFO sightings from November 11, 1906 to May 08, 2014, each a potential glimps...

    Published: November 11, 1906

  5. Source: levelup.gitconnected.com
    Title: i used an llm to analyze 140 000 ufo reports the aliens are real 3d589ec4055d
    Link: https://levelup.gitconnected.com/i-used-an-llm-to-analyze-140-000-ufo-reports-the-aliens-are-real-3d589ec4055d
    Source snippet

    The Aliens...4 Mar 2026 — What happens when you use AI to analyze 140000 UFO reports? A humorous data-driven dive into a world of alien...

  6. Source: researchgate.net
    Link: https://www.researchgate.net/publication/385588760_Hyperconflation_Recommending_a_Relational_Alternative_to_the_Datacentric_Approach_to_UAP
    Source snippet

    Available on Amazon. This book examines the intersection of politics and religion as relates to UFO/UAP...Read more...

  7. Source: cs.ubc.ca
    Link: https://www.cs.ubc.ca/~tmm/courses/547-17F/projects/hayley-theodore/report.pdf
    Source snippet

    Want to Believe: A Visualization of UFO Siting Reportsby TSH Guillou — In this work we present an interactive visualization tool for exam...

  8. Source: facebook.com
    Title: this is a map of all reported ufo sightings 1906 2014image esri
    Link: https://www.facebook.com/OfficialQI/posts/this-is-a-map-of-all-reported-ufo-sightings-1906-2014image-esri/4712025862145246/
    Source snippet

    This is a map of all reported UFO sightings, 1906-2014....This is a map of all reported UFO sightings, 1906-2014. (Image: ESRI.)...

  9. Source: youtu.be
    Link: https://youtu.be/MR1VCguoeaE
    Source snippet

    "Spectral methods for time series analysis [https://youtu.be/onrdJFlXbMg](https://youtu.be/onrdJFlXbMg) [https://youtu.be/BokqiGJhqhA](https://youtu.be/BokqiGJhqhA) [https://youtu.be/frQ4m77OLGs..."](https://youtu.be/frQ4m77OLGs...")...

  10. Source: youtube.com
    Title: Governments Using AI To Decode Massive UFO Databases | WION Podcast
    Link: https://www.youtube.com/watch?v=adCsqd_-M94
    Source snippet

    AI Found Hidden Patterns in 150,000 UFO Reports | ft. Christian Stepien, National UFO Database CTO - YouTube AI Found Hidden Patterns in...

Topic Tree

Follow this branch

Parent topic

Duplicates How One UFO Sighting Becomes Many Records

Related pages 4