Narut Butploy
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.
Kanokwan Khiewwan
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.
Jaturong Thongchai
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.
Sawet Somnugpong
Department of Computer Technology, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.
Pakin Maneechot
Department of Smart Grid Engineering, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand
Phrommate Verapan
Department of Information Technology, Faculty of Science and Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand
Khumphicha Tantisantisom
Department of Information Technology, Faculty of Science and Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, Thailand.
Karthikeyan Velmurugan
Department of Technology Engineering, Faculty of Industrial Technology, Kamphaeng Phet Rajabhat University, Kamphaeng Phet, 62000, Thailand.
DOI https://doi.org/10.33889/IJMEMS.2025.10.6.086
Abstract
Transformer-based embedding models are widely used for similarity search as they are reliable and efficient for capturing semantic similarity. This study uses all-MiniLM-L6-v2, paraphrase-MiniLM-L6-v2 and all-distilroberta-v1 transformer-based embedding models to find the similarity search for Wikipedia documents. All three transformer models are ensembled for enhanced semantic search, and Principal Component Analysis (PCA) is applied to ensure smooth assembly of a different dimensionality model. To understand the strength of the proposed transformer models, 2,000 Wikipedia documents were arbitrarily selected and converted into vectors before storing them in MongoDB. The ground truth of the proposed transformer-based models was examined using 996 TREC questions. The all-MiniLM-L6-v2 and paraphrase-MiniLM-L6-v2 consume less memory than all-distilroberta-v1 model. However, the ensemble process abruptly increased the memory usage to 924.79 MB, higher than individual models. Following that, the average execution time for each query increased to 0.1031 seconds. Beneficially, the ensemble+PCA attained higher precision@10 and recall, resulting in a higher F1 score with an average of 0.5094. The error analysis method indicates that the ensemble+PCA approach significantly improved the semantic search with a higher relevant rate to the raised query. Furthermore, ensemble-based PCA methods are recommended for large dataset handling and are suitable for real-time applications.
Keywords- Sentence transformer, Vector search, NoSQL Databases, Ensemble with PCA, Semantic search.
Citation
Butploy, N., Khiewwan, K., Thongchai, J., Somnugpong, S., Maneechot, P., Verapan, P., Tantisantisom, K., & Velmurugan, K. (2025). Multi-Transformer-Based Ensemble Embedding Model for Enhanced Vector Search in NoSQL Database: A Comparative Statistical and Performance Analysis. International Journal of Mathematical, Engineering and Management Sciences, 10(6), 1860-1879. https://doi.org/10.33889/IJMEMS.2025.10.6.086.