Multi-Transformer-Based Ensemble Embedding Model for Enhanced Vector Search in NoSQL Database: A Comparative Statistical and Performance Analysis

Narut  Butploy; Kanokwan  Khiewwan; Jaturong  Thongchai; Sawet  Somnugpong; Pakin  Maneechot; Phrommate  Verapan; Khumphicha  Tantisantisom; Karthikeyan  Velmurugan

doi:https://doi.org/10.33889/IJMEMS.2025.10.6.086

Narut Butploy
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.

Kanokwan Khiewwan
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.

Jaturong Thongchai
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.

Sawet Somnugpong
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.

Pakin Maneechot
Department of Smart Grid Engineering, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand

Phrommate Verapan
Department of Information Technology, Faculty of Science and Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand

Khumphicha Tantisantisom
Department of Information Technology, Faculty of Science and Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.

Karthikeyan Velmurugan
Department of Technology Engineering, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, 62000, Thailand.

DOI https://doi.org/10.33889/IJMEMS.2025.10.6.086

Received on February 14, 2025

;

Accepted on June 29, 2025

Abstract

Transformer-based embedding models are widely used for similarity search as they are reliable and efficient for capturing semantic similarity. This study uses all-MiniLM-L6-v2, paraphrase-MiniLM-L6-v2 and all-distilroberta-v1 transformer-based embedding models to find the similarity search for Wikipedia documents. All three transformer models are ensembled for enhanced semantic search, and Principal Component Analysis (PCA) is applied to ensure smooth assembly of a different dimensionality model. To understand the strength of the proposed transformer models, 2,000 Wikipedia documents were arbitrarily selected and converted into vectors before storing them in MongoDB. The ground truth of the proposed transformer-based models was examined using 996 TREC questions. The all-MiniLM-L6-v2 and paraphrase-MiniLM-L6-v2 consume less memory than all-distilroberta-v1 model. However, the ensemble process abruptly increased the memory usage to 924.79 MB, higher than individual models. Following that, the average execution time for each query increased to 0.1031 seconds. Beneficially, the ensemble+PCA attained higher precision@10 and recall, resulting in a higher F1 score with an average of 0.5094. The error analysis method indicates that the ensemble+PCA approach significantly improved the semantic search with a higher relevant rate to the raised query. Furthermore, ensemble-based PCA methods are recommended for large dataset handling and are suitable for real-time applications.

Keywords- Sentence transformer, Vector search, NoSQL Databases, Ensemble with PCA, Semantic search.

Citation

Butploy, N., Khiewwan, K., Thongchai, J., Somnugpong, S., Maneechot, P., Verapan, P., Tantisantisom, K., & Velmurugan, K. (2025). Multi-Transformer-Based Ensemble Embedding Model for Enhanced Vector Search in NoSQL Database: A Comparative Statistical and Performance Analysis. International Journal of Mathematical, Engineering and Management Sciences, 10(6), 1860-1879. https://doi.org/10.33889/IJMEMS.2025.10.6.086.

Volume 10 (2025)

Number 6 (December)

Pages 1860-1879

PDF

Downloads: 3

International Journal of Mathematical, Engineering and Management Sciences

eISSN: 2455-7749 . Open Access

Multi-Transformer-Based Ensemble Embedding Model for Enhanced Vector Search in NoSQL Database: A Comparative Statistical and Performance Analysis