Together Yet Apart: Multimodal Representation Learning for Personalised Visual Art Recommendation
B. Yilma, and L. Leiva. Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, page 204-214. ACM, (June 2023)
DOI: 10.1145/3565472.3592964
Abstract
With the advent of digital media, the availability of art content has greatly expanded, making it increasingly challenging for individuals to discover and curate works that align with their personal preferences and taste. The task of providing accurate and personalized Visual Art (VA) recommendations is thus a complex one, requiring a deep understanding of the intricate interplay of multiple modalities such as image, textual descriptions, or other metadata. In this paper, we study the nuances of modalities involved in the VA domain (image and text) and how they can be effectively harnessed to provide a truly personalized art experience to users. Particularly, we develop four fusion-based multimodal VA recommendation pipelines and conduct a large-scale user-centric evaluation. Our results indicate that early fusion (i.e, joint multimodal learning of visual and textual features) is preferred over a late fusion of ranked paintings from unimodal models (state-of-the-art baselines) but only if the latent representation space of the multimodal painting embeddings is entangled. Our findings open a new perspective for a better representation learning in the VA RecSys domain.
Description
Together Yet Apart: Multimodal Representation Learning for Personalised Visual Art Recommendation | Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization
%0 Conference Paper
%1 Yilma_2023
%A Yilma, Bereket A.
%A Leiva, Luis A.
%B Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization
%D 2023
%I ACM
%K art-recommender recommender umap2023
%P 204-214
%R 10.1145/3565472.3592964
%T Together Yet Apart: Multimodal Representation Learning for Personalised Visual Art Recommendation
%U https://doi.org/10.1145%2F3565472.3592964
%X With the advent of digital media, the availability of art content has greatly expanded, making it increasingly challenging for individuals to discover and curate works that align with their personal preferences and taste. The task of providing accurate and personalized Visual Art (VA) recommendations is thus a complex one, requiring a deep understanding of the intricate interplay of multiple modalities such as image, textual descriptions, or other metadata. In this paper, we study the nuances of modalities involved in the VA domain (image and text) and how they can be effectively harnessed to provide a truly personalized art experience to users. Particularly, we develop four fusion-based multimodal VA recommendation pipelines and conduct a large-scale user-centric evaluation. Our results indicate that early fusion (i.e, joint multimodal learning of visual and textual features) is preferred over a late fusion of ranked paintings from unimodal models (state-of-the-art baselines) but only if the latent representation space of the multimodal painting embeddings is entangled. Our findings open a new perspective for a better representation learning in the VA RecSys domain.
@inproceedings{Yilma_2023,
abstract = {With the advent of digital media, the availability of art content has greatly expanded, making it increasingly challenging for individuals to discover and curate works that align with their personal preferences and taste. The task of providing accurate and personalized Visual Art (VA) recommendations is thus a complex one, requiring a deep understanding of the intricate interplay of multiple modalities such as image, textual descriptions, or other metadata. In this paper, we study the nuances of modalities involved in the VA domain (image and text) and how they can be effectively harnessed to provide a truly personalized art experience to users. Particularly, we develop four fusion-based multimodal VA recommendation pipelines and conduct a large-scale user-centric evaluation. Our results indicate that early fusion (i.e, joint multimodal learning of visual and textual features) is preferred over a late fusion of ranked paintings from unimodal models (state-of-the-art baselines) but only if the latent representation space of the multimodal painting embeddings is entangled. Our findings open a new perspective for a better representation learning in the VA RecSys domain.},
added-at = {2023-06-27T08:36:44.000+0200},
author = {Yilma, Bereket A. and Leiva, Luis A.},
biburl = {https://www.bibsonomy.org/bibtex/24f693c4126cec2a872ac432e530b3291/brusilovsky},
booktitle = {Proceedings of the 31st {ACM} Conference on User Modeling, Adaptation and Personalization},
description = {Together Yet Apart: Multimodal Representation Learning for Personalised Visual Art Recommendation | Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization},
doi = {10.1145/3565472.3592964},
interhash = {51c1af10c6b79b12e22932eefe30ce90},
intrahash = {4f693c4126cec2a872ac432e530b3291},
keywords = {art-recommender recommender umap2023},
month = jun,
pages = {204-214},
publisher = {{ACM}},
timestamp = {2023-06-27T08:36:44.000+0200},
title = {Together Yet Apart: Multimodal Representation Learning for Personalised Visual Art Recommendation},
url = {https://doi.org/10.1145%2F3565472.3592964},
year = 2023
}