How Can We Help?

Search icon

Search Results

Deduplication in Web Profiles

Deduplication in Web Profiles involves finding and merging identical or very similar faculty records to improve data accuracy and efficiency for display on Web Profiles. This process helps ensure that a record appears only once across all university websites, whether at the university, department, or individual profile level.

Duplicate records may occur for several reasons, such as:

  • A user entering the same record more than once in FAR.
  • Coauthors from the same institution each entering the same publication.

The Web Profiles system automatically detects and merges these duplicates to present a single, accurate record for each scholarly activity.

 

 

How Deduplication Works

Data Scanning

The system scans the FAR database for potential duplicate records based on defined criteria to create small pools of potential duplicates. Only records set to Publicly Display are included.

Records in each pool share one or more of the following:

  1. The same title
  2. The same coauthor last name
  3. The same year

Deduplication checks are performed within the same scholarship type (for example, conference proceedings are only compared to other conference proceedings). Similar activities across different types, such as a conference presentation later published as a journal article, are not deduplicated because they represent distinct activities.

 
 
 

Duplicate Identification

Within each pool, potential duplicates are evaluated using a combination of unique identifiers and metadata field comparisons.

The system checks:

  • Unique identifiers: DOI, PMID, URL, Scopus publication ID, or patent number
  • Metadata fields: title, subtitle, volume, issue, journal, number of pages, publication date, ISBN, and host publication title

 

Unique Identifier Check:
If two records share the same Unique Identifier, they are considered duplicates.

 

Field Comparison: If no Unique Identifier is present, the available metadata on records are compared to each other. For each field with data, a similarity score between 0 (Completely different) and 1 (Identical) is calculated and all fields with a similarity score are used as input for a cosine similarity calculation.

For each content type (e.g. Book review, journal article, creative production performance), a specific threshold is then applied to the cosine result. Pairs of potential duplicates above the specified threshold are considered duplicates.

 
 

Record Merging

Duplicates are merged into a single, master record via the following process:

Record selection: A "target" or “representative” record is chosen from the group based on which record has the most complete metadata or information.  

Metadata selection:  Once a target has been selected, metadata from the group of records is used to fill in the available fields on a record. The graphic below provides a simplified overview of the metadata selection process.

Author listings: Author listings are created by combining and tracking individual authors positions in each record in the duplicate group. The most frequent position for a specific author is then used as the final author position in the target record. 

This merging process runs daily, ensuring that updates made in FAR are included in the next day’s Web Profile update.

Metadata Selection Process

flow diagram of record merging process
 
 

 

Visibility on Web Profiles

Once deduplication and merging are complete:

  • The target record appears on the Web Profiles of all internal authors who have set the record to Publicly Display = Yes.
  • If an internal author marks a record as Publicly Display = No, it will not appear on their individual profile, but it may still appear on:
    • The organization or department profile, and
    • The profiles of coauthors who have opted to display it publicly.

 

Integration with Scopus and Metadata Enrichment

For books, book chapters, journal articles, and conference proceedings that also exist in Scopus, the Scopus record takes precedence and is displayed on Web Profiles regardless of what exists in FAR.

If discrepancies appear between what is displayed and what is expected, faculty should contact the publisher to make updates in Scopus.

 

After deduplication and merging, the system attempts to match all remaining records to Scopus to enrich them with key metadata. This enrichment process can take up to a week. 

 

FAQ

Why are some records still appearing duplicated?

Despite deduplication, some records may still appear more than once due to:

  • Distinct scholarly activities that share similar titles (for example, a conference presentation and a later publication).
  • Differences in scholarship types that prevent records from being compared.
  • Records missing key metadata fields used to identify duplicates.
 
 

 

Was this article helpful?
Give feedback about this article