DSpace Data Harvesters

DSpace Data Harvesters

Automatic ingestion of publications from global scholarly databases - directly into your DSpace repository

What is DSpace Data Harvesters?

For years, keeping a repository current has depended on manual submission by researchers or time-consuming data entry by administrative staff. DSpace Data Harvesters by PCG Academia change this fundamentally. Instead of waiting for publications to arrive, the harvester connects to major global scholarly databases – Web of Science, Scopus, OpenAlex and Crossref – and pulls in records automatically. Institutional publications are matched using two complementary methods: the institution’s ROR identifier, and the affiliation information declared in the source database. This two-track approach ensures that no publication is missed. Every harvested record enters a review workflow before publication, so institutions gain both automation and full editorial control.

Whether seeding a new repository from scratch or keeping an established one continuously up to date, Data Harvesters turn repository completeness from an ongoing effort into a natural outcome of how the system works.

Key benefits

ikona

Automatic retrieval of publications from Web of Science, Scopus, OpenAlex, and Crossref – no manual submission required

ikona

Precise institutional matching via ROR identifier or source-system affiliation data – two complementary methods that together ensure no relevant publication is overlooked

ikona

Controlled review workflow – every harvested record is verified, enriched, and approved by staff before going live

ikona

Bulk approval for large volumes – practical handling of high-volume ingestion without turning every item into a separate task

ikona

Continuous scheduled harvesting – the repository stays current with the institution’s actual research output automatically

ikona

Implementation delivered by PCG Academia – DSpace Platinum Service Provider – guaranteeing the highest service standards and years of experience in the academic environment

Key features

Multi-source harvesting

  • automatic retrieval of publication records from Web of Science, Scopus, OpenAlex, and Crossref in a single, unified workflow

Flexible institutional matching

  • publications are identified using the institution’s ROR identifier or the affiliation information held in the source database – two complementary methods that work together to ensure complete coverage, so no relevant publication is missed regardless of how authors have described their institutional affiliation in any given paper

Review and enrichment workflow

  • every harvested record enters a dedicated review queue where staff can verify metadata, add full-text files, correct errors, or approve records in bulk before they go live in the repository

Scheduled and on-demand harvesting

  • harvesters can run on automated schedules to keep the repository continuously up to date, or be triggered manually to seed a new repository with a large volume of existing publications in a single operation

How does it fit into the university ecosystem?

ikona

Seamless DSpace integration

  • data Harvesters are built natively for DSpace and DSpace CRIS. Harvested records land directly in the repository’s standard item workflow, with no additional middleware or manual data transfer.
ikona

Connection to global scholarly infrastructure

  • by drawing from the same databases that the rest of the scholarly world uses to track research – Web of Science, Scopus, OpenAlex, Crossref – the repository stays in sync with authoritative, globally recognised sources.
ikona

Human oversight at every step

  • nothing harvested from an external source goes straight into the live repository. The review workflow ensures that staff remain in full control of what is published, with options for individual review or bulk approval depending on volume.
ikona

Flexible deployment

  • data Harvesters can be deployed within existing DSpace on-premises or cloud environments, and can be configured for different harvesting sources, institutional identifiers, and scheduling requirements.

Proven approach, real results

A solution that turns repository completeness from a manual effort into an automatic, ongoing outcome – removing the dependency on individual researchers remembering to submit.

ikona

Built on DSpace – the most widely adopted open-source repository platform in academic institutions worldwide and maintained by PCG Academia, a LYRASIS-certified Platinum DSpace Service Provider

ikona

A clear, controlled data flow: external database → harvesting → review → live repository – with staff in full control at every stage

ikona

Ready for:

  • new harvesting sources as they become available;
  • changes in institutional identifier standards (ROR, ORCID, local IDs);
  • growing repository volume without growing administrative burden

Next steps

👉See how DSpace Data Harvesters can work for your institution
ikona

Process demo: harvesting from Web of Science, Scopus, and OpenAlex — with a walkthrough of the review and approval workflow

ikona

Architecture consultation: matching the harvesting configuration to your institution’s identifier setup and repository environment

ikona

Discussion of the implementation scenario and expected outcomes for your repository’s completeness and staff workload

Discover all our products

  • WEBCON
  • Education
  • DSpace
  • WEBCON for Finance and HR

    Process and integration platform for finance and HR in higher education.

    Learn More
  • WEBCON for Education

    Digital Student Folder and the study administration process in a low-code model.

    Learn More
  • WEBCON for Science

    Low-code for research management, powered by data from CRIS systems and repositories

    Learn More
  • Blackboard

    An innovative learning platform for all models of academic education

    Learn More
  • Blackboard Ally

    Ally is a tool that enhances digital accessibility and supports inclusive teaching

    Learn More
  • Inspera

    A digital examination platform supporting authentic assessment

    Learn More
  • Panopto

    A platform leveraging AI for video content, combining creation, storage, and smart management of academic recordings in one place

    Learn More
  • DSpace CRIS

    The most mature form of research management — from repository to a full Current Research Information System

    Learn More
  • DSpace GLAM

    Repository platform for cultural, heritage, and memory institutions

    Learn More
  • DSpace AI Search & Accessibility

    Intelligent search and accessibility of research resources powered by AI

    Learn More
  • DSpace Low-code

    Flexible management of research processes without coding

    Learn More

Contact us to schedule a DEMO

Contact the Business Department

Łukasz Wawer

SCIENCE Business Line Manager
Fill out the contact form