New tool transforms presentation videos into searchable PDFs

Wed, 1st Jan 2025

A new software tool has been developed to convert presentation videos into structured, searchable PDF summaries, as introduced in a recent study published in the journal SoftwareX.

PV2DOC has been designed by researchers led by Professor Hyuk-Yoon Kwon at Seoul National University of Science and Technology in South Korea. The software integrates visual and audio data from presentation videos, transforming them into well-organised PDF documents. This process aims to simplify data analysis, reduce storage needs, and improve information accessibility.

The widespread use of presentation videos has increased, especially following the COVID-19 pandemic. These videos, while engaging, present challenges such as difficulty in searching for specific information and significant storage demands due to large file sizes.

PV2DOC distinguishes itself from other video summarising tools, which often require accompanying transcripts and struggle when only the video is available. By combining visual and audio data, PV2DOC becomes a comprehensive solution for converting videos into documents.

The study outlining PV2DOC's development was published online on 11 October 2024 and subsequently appeared in Volume 28 of SoftwareX on 1 December 2024.

"For users who need to watch and study numerous videos, such as lectures or conference presentations, PV2DOC generates summarised reports that can be read within two minutes," explains Prof. Kwon.

"Additionally, PV2DOC manages figures and tables separately, connecting them to the summarised content so users can refer to them when needed," Kwon says.

To process image data, PV2DOC extracts frames at one-second intervals, using the structural similarity index to identify unique frames. Object detection models, Mask R-CNN and YOLOv5, are employed to detect objects within these frames. The software then applies a figure merge technique to combine fragmented images and uses the Google Tesseract engine for optical character recognition, extracting and organising text in a structured manner.

On the audio front, PV2DOC uses the Whisper model, an open-source speech-to-text tool, to convert spoken content into text. The TextRank algorithm is then used to summarise the main points of the transcription. The final product is a Markdown document, which can be exported as a PDF, reflecting the organised content of the original video.

By structuring unorganised video data, PV2DOC enhances video accessibility and reduces the storage required for video content. "This software simplifies data storage and facilitates data analysis for presentation videos by transforming unstructured data into a structured format, thus offering significant potential from the perspectives of information accessibility and data management. It provides a foundation for more efficient utilisation of presentation videos," says Prof. Kwon.

The research team plans ongoing advancements in video content accessibility. The team's next aim is to train a large language model to provide a question-answering service, allowing users to receive accurate, contextually relevant answers based on video content.

Share on: