Long-term effectiveness of immersive VR simulations in undergraduate science learning: lessons from a media-comparison study

Our main goal was to investigate if and how using multiple immersive virtual reality (iVR) simulations and their video playback, in a science course, affects student learning over time. We conducted a longitudinal study, in ecological settings, at an undergraduate field-course on three topics in environmental biology. Twenty-eight undergraduates were randomly assigned to either an iVR-interaction group or a video-viewing group. During the field-course, the iVR group interacted with a head-mounted device-based iVR simulation related to each topic (i.e. total three interventions), while the video group watched a pre-recorded video of the respective simulation on a laptop. Cognitive and affective data were collected through the following checkpoints: a pre-test before the first intervention, one topic-specific post-test immediately after each intervention, a final post-test towards the end of the course, and a longitudinal post-test deployed approximately 2 months after the course. Through a descriptive analysis, it was found that student performance on the knowledge tests increased considerably over time for the iVR group but remained unchanged for the video group. While no within- or between-group differences were noted for intrinsic motivation and self-efficacy measures, students in the iVR group enjoyed all the simulations, and perceived themselves to benefit from those simulations.


Introduction
Head-mounted device-based (i.e. HMD-based) immersive virtual reality (hereafter, iVR) simulation interfaces are powerful learning and teaching media in science, technology, engineering and mathematics (STEM). They afford learner's engagement and interaction with abstract or inaccessible scientific entities, phenomena and concepts in innovative ways (Pande 2020;Pande and Chandrasekharan, 2017). With the advent of diverse iVR interfaces and their potential applications in higher educational settings, educational technologists are increasingly focusing on how these interfaces could be utilised to support student learning of scientific content (e.g. laboratory procedures, concepts) and also affective aspects of student engagement with that content (Kafai and Dede 2014). However, despite its promises, and the enthusiasm among educational technology communities, iVR is still not considered a mainstream technology in higher education (Cochrane 2016;Gutierrez-Maldonado, Andres-Pueyo, and Talarn-Caparros 2015). The actual implementation of iVR as an educational technology involves logistical, technical and economic challenges (Khor et al. 2016;Liagkou, Salmas, and Stylios 2019). Importantly, there exist several gaps in current research on instructional effectiveness of iVR in STEM education: Firstly, research on the learning-teaching effectiveness of iVR, particularly in university-level STEM education, is limited in volume (Jensen and Konradsen 2018;Radianti et al. 2020). Secondly, the existing volume of research fails to provide a coherent and conclusive picture of whether using iVR in university STEM teaching improves students' learning of the content (Checa and Bustillo 2020;Concannon, Esmail, and Roberts 2019). Thirdly, and perhaps intricately related to the second, there are limited process analyses of how learning happens in/with iVR (Concannon, Esmail, and Roberts 2019;Schott and Marshall 2018), particularly in ecological settings (Hew and Cheung 2010). Finally, research fails to provide insights into the long-term effectiveness and usefulness of iVR-based instruction (Sánchez, Lumberas, and Silva 1997).
This paper attempts to fill the gaps in the recent and still limited literature on the usefulness and effectiveness of iVR-based instruction in higher STEM education.

Recent research on iVR in STEM education: Gaps and limitations
In a systematic review, Concannon, Esmail and Roberts (2019) found that most studies reporting implementation of iVR-based instruction were situated within STEM education or allied fields (e.g. health sciences, psychology education). Most iVR studies in STEM education involved controlled experiments (e.g. iVR instruction vs. no instruction) and media comparison experiments evaluating the instructional effectiveness of iVR in comparison to other modes of instruction (e.g. traditional lecture-based, video-based, desktop-based). These studies typically investigate the effects of iVR instruction on various cognitive (e.g. procedural learning, retention and recall, conceptual learning) and non-cognitive (e.g. motivation, enjoyment) aspects of learning (Hew and Cheung 2010;Kavanagh et al. 2017;Parong and Mayer 2018). For instance, Lamb et al. (2018) recently showed that iVR not only leads to better learning outcomes (i.e. gain in test scores) on molecular biology topics, but also triggers significantly higher cognitive engagement and processing (e.g. neural activation) as compared to video lectures. The authors also reported that learning science through iVR simulations is cognitively (e.g. in terms of activity in certain brain areas) equivalent to engaging in serious educational games and hands-on activities (Lamb et al. 2018). On the other hand, a study comparing the learning and presence effects of two levels of immersion demonstrated that iVR instruction overloaded the students cognitively (Makransky, Terkildsen, and Mayer 2019). In this study, iVR did lead to a higher feeling of presence and liking among participants as opposed to a desktop. However, it yielded lower knowledge-related learning outcomes compared to a desktop. This indicates that liking an intervention does not necessarily lead to better learning outcomes.

(page number not for citation purpose)
Another study (Makransky, Borre-Gude, and Mayer 2019) compared the effects of iVR simulation, with those of an identical desktop simulation, and text-based instruction, in training undergraduate students for general lab safety. They found that students from the iVR condition significantly outperformed the other two conditions on practical lab tests and feeling of self-efficacy. However, iVR did not add any advantage to knowledge retention, intrinsic motivation, and perceived enjoyment, compared to the other two conditions (Makransky, Borre-Gude, and Mayer 2019). Several other studies have demonstrated positive effects of iVR on most affective aspects of learning regardless of: iVR's success in supporting actual content or conceptual learning (Garcia-Bonete, Jensen, and Katona 2019;Madden et al. 2020;Parong and Mayer 2018;Tan and Waugh 2013), the implementation-related logistical limitations of iVR (Ba et al. 2019;Makransky, Terkildsen, and Mayer 2019) and the cyber-sickness that iVR may induce (e.g. Moro, Štromberga, and Stirling 2017). These effects have been demonstrated in isolation and also in comparison to other media of instruction (e.g. desktop simulations, video demonstrations, text-based instruction). While research on iVR instruction fails to provide conclusive results on its effectiveness, even in comparison with other traditional as well as new instructional modes, studies examining iVR simulations in combination with other interventions, such as generative learning strategies (Fiorella and Mayer 2016), report positive effects on various cognitive and affective aspects of learning. In a media-comparison study, asking students to summarise parts of a recently experienced iVR lesson increased their knowledge-related learning outcomes in contrast to students who received instruction through traditional slide show-based lessons (Parong and Mayer 2018). In a media (iVR vs. video) and methods comparison experiment (pre-training vs. no-pretraining), students who received pre-training about certain concepts in biology (e.g. the cell) before receiving iVR instruction were found to perform significantly better in post-tests related to knowledge retention, transfer and self-efficacy, as compared to students who did not receive any pre-training, or those who received pre-training in combination with video-based instruction (Meyer, Omdahl, and Makransky 2019). In a similar media (iVR vs. video) and methods (enactment vs. no-enactment) comparison experiment, Andreasen et al. (2019) showed that performing an enactment of scientific procedures (e.g. pipetting) immediately after interacting with an iVR simulation of those procedures (i.e. iVR + enactment) leads to significantly better learning outcomes in terms of retention and transfer as compared to only interacting with an iVR simulation, watching its video or even watching a video and thereafter performing enactment (Andreasen et al. 2019;Makransky et al. 2020). They also found positive effects of the iVR + enactment intervention on intrinsic motivation and self-efficacy. However, these positive results may not identify and/or justify the specific role iVR plays in the learning process.
Most research on the instructional effectiveness of iVR in STEM education involves participants experiencing only a single exposure to the iVR environment (Concannon, Esmail, and Roberts 2019;Makransky, Borre-Gude, and Mayer 2019;Makransky, Terkildsen, and Mayer 2019;Southgate 2020;Vergara et al. 2019). As a result, it could be argued that the positive affective learning outcomes associated with iVR interventions reported by many of these studies, could be a temporary consequence of the well-known 'novelty effect' (Moos and Marroquin 2010). The novelty effect is typically manifested as, or characterised by, higher ratings reported by participants on the self-report affective measurement scales subjected to them immediately after an intervention. Anecdotally, this effect may result from participants' increased excitement around the new technology. iVR is often an overwhelming experience for students, particularly for those with little or no prior exposure to immersive virtual environments. Hence, such students are more likely to rate themselves highly positively across the various affective measurement items immediately after such an experience (Chen et al. 2016). This could also explain the lack of conclusive success of iVR interventions in comparison to interventions based on other media, particularly in relation to knowledge retention, conceptual understanding, and other cognitive-procedural aspects of learning (Moro, Štromberga, and Stirling 2017;Radianti et al. 2020). Moreover, the promising affective influences of iVR on students may not be long-term as it has been found that students' increased interest in technology-based activities may not result in an extended interest in the corresponding academic content (Torff and Tirotta 2010). It is thus critical to the progress of research and development in the educational technology domain to understand the long-term instructional effectiveness of iVR-based interventions (Madani et al. 2016;Southgate 2020).
Finally, and most importantly, there is a lack of longitudinal studies focusing on the long-term usefulness and/or effectiveness of iVR and iVR-based interventions in STEM education (Chittaro and Buttussi 2015;Wu, Yu, and Gu 2020).

Study objective and research question
To contribute to the current body of research on instructional effectiveness of iVR in higher STEM education, and to address the above-mentioned issues related to iVR's long-term as well as ecological validity, a cross-sectional media-comparison quasi-experiment was conceptualised at a 6-day undergraduate science course. This experiment sought answers to the following multi-faceted research questions: • How does student interaction with multiple iVR simulations on different topics in environmental biology affect their (1) learning of the content related to those topics, (2) intrinsic motivation, (3) self-efficacy, and (4) perceived learning, and enjoyment over time? (Cross-sectional/longitudinal patterns) • How does the effect of iVR simulations on the above-mentioned learning outcomes compare with video-viewing (where students watch videos of the respective simulations)? (Media comparison) Students participating in this course learned about the following three topics in environmental biology -photosynthesis, biodiversity, and food webs. For each topic, a student interacted with either an iVR simulation (iVR condition) or a pre-recorded video of the respective simulation (video condition). In addition, all students attended common lectures, fieldtrips, and laboratory work related to each topic. Through a pre-test, we collected data on students' prior knowledge related to all the three topics (i.e. established a knowledge baseline), as well as self-report affective measures related to students' engagement with those topics (e.g. self-efficacy, enjoyment). To understand possible changes in knowledge-related and affective learning outcomes over time, post-intervention data were collected in the following manner: one post-test immediately after each intervention (i.e. total three topic-specific post-tests -1, 2 and 3); a final post-test on all the three topics on the 5th day of the course (final post-test), and a follow-up post-test (same as the pre-test and the final post-test) approximately 2 months after the last intervention (longitudinal post-test).

Pedagogical rationale behind the study
One major pedagogical motivation behind our iVR intervention, as well as the study, was the urge to facilitate experiential learning among students, of the relationships between experimental procedures and the scientific concepts involved in those procedures (Kolb 1984;Schott and Marshall 2018). University STEM students have often been found to struggle with a mismatch between their theoretical knowledge about a topic (e.g. photosynthesis and subcellular locations of the different reactions occurring during photosynthesis), and the practical or procedural knowledge related to that topic (e.g. extraction of photosynthetic pigments from algae; personal observation; OECD 2018). Due to limited time and resources, it is difficult for most university teachers to support students, for instance, by providing them more than one opportunity to perform experiment(s), and experience the procedures and the theoretical constructs behind them. The pedagogical interventions based on multiple iVR simulations tested in this study were conceptualised to help undergraduate students develop a sense of the connections between the theoretical material presented in the environmental biology course and the practical and hands-on elements of that course (e.g. how the theory behind sampling, or analysis techniques, connects with the choice of a specific sampling technique in the field). We hoped that the three simulations, as well as their videos, could help students experience the experiments related to the respective topics in environmental biology before they proceeded to perform them in the real laboratory.
In the next section, we discuss the study in detail in terms of the pedagogical and research-related steps taken.

Sample
Twenty-eight fourth semester undergraduate students (14 female) from a major university in Denmark participated in a 6-day residential environmental biology fieldcourse at a pre-planned off-campus site. Randomly, half the students were assigned to the iVR condition (eight female) while the other half were given the video condition (six female).
The study was conducted according to the Helsinki declaration, and written consent was gathered from all participants prior to the study, in accordance with the European Union General Data Protection Regulation (GDPR) guidelines, as well as the university's local regulations.

Materials iVR simulations and videos
The iVR simulations and their respective videos for the three topics in environmental biology were provided by the company named Labster. The simulations used in the study were: Pigment Extraction (Labster 2019a), Food web Structure (Labster 2019b), and Biodiversity (Labster 2019c). Figure 1 shows snapshots of some important moments from each simulation. For a detailed description of each simulation, please refer to Appendix 1. Each student in the iVR group was provided with a Lenovo Mirage Solo HMD, which ran the respective Labster simulation through the Labster applet supported by the Google Daydream platform. Each simulation covered different types of activities related to the respective topic (e.g. laboratory work, fieldwork, reading of theory in text, problem-solving). Though reviewing the designs of interactive iVR learning environments is out of scope of this paper for practical reasons, it is important to comment briefly on the general design of the three simulations used in this study, primarily because this has implications for our experimental choices (e.g. comparison condition). Firstly, the learning activities in the simulations are linear and their sequence is predetermined -a student is first placed in a context (usually in the form of a story -e.g. encountering complex life forms while visiting an exoplanet), and is then posed a task, which requires the student to visit a laboratory (e.g. sampling and analysing biodiversity). Inside the simulation, the student is orally instructed by a drone about each step the student should take in order to proceed through the simulation (e.g. wear a lab-coat, start or run equipment by clicking a button, teleport to another desk, point-and-click on a device to use it, read theory on the pad if necessary). The instruction is also delivered in text through a virtual text-pad. Secondly, the interactivity in these simulation designs is particularly minimal, as all the student can do, is turn their head around to see elements in the 360-degree environment and point-and-click to activate an element (in comparison, many other iVR learning simulations use full-body or gesture-based interaction with or without haptic feedback). The point-and-click interaction happens through a laser pointer-like controller; it is not hugely different from the mouse-based interactions one may have in a desktop environment. This is primarily why, in the media-comparison experiment, we did not choose to compare the iVR condition with a desktop condition.  Comprehensive screen recording (video + audio) of each simulation is professionally made available by Labster. Each video is approximately 30 min long (roughly equivalent to the average time taken to complete the respective simulation). In terms of content coverage, the videos are identical to their respective simulation. As the sequence of events in most Labster simulations is fixed, the video recordings do not differ much from the simulations in content presentation.
Each student watched the respective video on a laptop screen and listened to the audio through headphones. The student could play, pause, review or forward the video at will.
Considering all the design aspects of the simulation (and the hardware facilitating it), the main difference between the two conditions was that the iVR students experienced immersion, and a higher sense of agency/control.

Data gathering tools (tests and questionnaires)
The knowledge-related pre-and post-tests for each topic consisted of nine multiple-choice questions (Appendix 2) for each of the three topics/interventions.

Experimental protocol
The quasi-experiment began on the 1st day with briefing the students about the details of the course, the teachers involved, and a general schedule of events. The students were also briefed about the schedule of the experimental elements of the course (e.g. pretest >>> division into treatment groups >>> interaction with the respective virtual learning tool and so on). Each student then read and signed a consent form. The consent form also randomly assigned each student a code that indicated their treatment condition (iVR or video). Each student would use this code throughout the study while responding to the data-collection tools. After that, the students were administered a pre-test comprising knowledge-related questions on all the three topics (i.e. 27 multiple-choice questions in total), followed by the affective questionnaire on intrinsic motivation and self-efficacy. On completion of the pre-test, the iVR group was taken to another room equipped with ready-to-use HMDs (with earphones), one for each student. The iVR group received a brief pre-training on how to use the equipment (e.g. wearing the HMD, functions of the controller buttons). As the iVR group interacted with the 1st simulation (Pigment Extraction) in this other room, the students in the video group watched its respective video on a laptop. The video was made available via a link. Two researchers in each room monitored the students. Each student, on completion of the iVR or video intervention, answered post-test 1, containing questions on pigment extraction, and affective measures (intrinsic motivation, self-efficacy, perceived learning and enjoyment). A similar protocol was followed on the 2nd and the 4th days in relation to the food webs and biodiversity topics/simulations (and the corresponding post-tests 2 and 3) respectively. Except, there was no pre-test on these days. In addition, the students also took part in common lectures, field trips and laboratory work related to the respective topic on that day. On the 5th day, students took the final post-test (all the 27 questions related to the three topics combined).
Finally, the students were administered the longitudinal post-test, approximately 2 months after the course, to measure if there were any long-term effects of the interventions.
A close collaboration with the course teacher (last author) ensured a smooth and cohesive integration of the experimental material and protocol with the regular course activities.

Analysis
Due to a technical failure during the final (biodiversity) iVR simulation, all the data related to the third intervention/topic were omitted from the analysis. One student from the iVR group did not participate in the study after the first intervention due to cyber-sickness, and three students from the video group failed to participate in at least one of the interventions. Hence, our effective sample for the data analysis included 24 participants (13 in the iVR group, and 11 in the video group). Further, the longitudinal post-test unfortunately recorded low response rate, with only eight iVR participants and nine video participants responding to all the questions.
We were interested to understand if and how interaction with multiple iVR simulations changes the cognitive and affective aspects of student learning over time, in comparison to watching videos of those simulations. In consideration of our limited effective sample size, our use of statistics to report the results is for descriptive purposes only. We do not present any significance-related claims.
The graphical data presentations discussed in the results section were made using GraphPad Prism 8.3.1 (San Diego, USA). Data on knowledge trends are presented as means ± standard deviation (SD). Data on affective endpoints (self-efficacy, intrinsic motivation, perceived learning and enjoyment) are reported on the 7-point Likert scale. Frequency is presented as a cumulative percentage of students reporting the individual score and calculated according to the equation: Cumulative percent number of students giving score x = (frequency of students reporting the score x / total sum of all Likert scores) × 100%, where x is equal to the Likert score 1-7.

Theoretical expectations
Based on the literature, particularly in relation to the novelty effect (e.g. Chen et al. 2016;Moos and Marroquin 2010), the following trends were expected in knowledge-related and affective outcomes among our participants over time.

Knowledge trends
We expected both within-and between-group differences to emerge as the study progressed.

Within-group trends through the checkpoints
In comparison to their respective baseline scores, the scores for both groups were expected to improve in post-test 1, perhaps due to sensitisation, priming and/or learning (Lana 2009;Willson and Putnam 1982). The scores for the iVR group were then expected to improve further thereafter through to the final post-test, but drop for the video group after post-test 1 due to lack of interaction in the video-viewing process (Ryan and Deci 2000). Finally, the scores in the longitudinal post-test were expected to remain unchanged for the iVR group but to drop for the video group (this is clari-fied further in the 'between-group differences section').

Between-group differences
The gain after the first intervention was expected (possibly due to priming and/or learning) to be similar for both the groups. However, after the second intervention, the scores for the iVR group were expected to remain increasingly higher than the video group, as the novelty effect among iVR students, if any, would fade away due to increasing familiarity with the immersive environment (Chen et al. 2016). In parallel, the video group would gradually lose interest due to lack of sense of agency in learning (Ryan and Deci 2000).

Affective trends
The affective expectations are also discussed as within-as well as between-group comparisons.

Within-group trends through the checkpoints
A gain in the affective outcomes was expected for both groups, attaining a peak right after the first intervention (i.e. in post-test 1). We expected a drop in student ratings across all the affective measurement scales after the second intervention (i.e. in posttest 2) for both groups.
Hidi and Renninger's four-phase model of interest development seems particularly relevant in this context (Hidi and Renninger 2006;Renninger and Hidi 2015). It predicts that students' interest is initially triggered by extrinsic elements in learning environments (in this case video or iVR) through interaction. With continued exposure to the triggering environments, the students enter a second phase of maintained situational interest, which may lead to an emergence of individual interest (phase 3) and/or well-developed individual interest (phase 4). While examining student interest was not particularly intended in this study, we use Schiefele's definition of interest as content-specific intrinsic motivation (Schiefele 1991). Further, considering the strong links with intrinsic motivation, similar trends were expected for self-efficacy, perceived learning and enjoyment for both the groups (Bandura 1997;Wigfield and Eccles 2002).

Between-group comparison
Due to the novelty effect, gains in the affective measures for the iVR group were expected to be larger in magnitude than those for the video group after the first intervention. We did not expect students to experience the novelty effect while watching videos of simulations.
Further, a bigger drop in student ratings across all the affective measures was expected for the video group, as these students would lack an experience or sense of agency (e.g. making choices and experiencing relatedness) as opposed to students in the iVR group (Ryan and Deci 2000).

Knowledge trends
In terms of the cognitive aspects (i.e. knowledge performance), the pre-test served as a proxy for students' baseline understanding of the topics and ensured that pre-existing knowledge did not influence the results of the study. The video group's baseline knowledge performance (mean percent = 70.2% ± 16.7%) was higher than that of the iVR group (61.1% ± 16.3%). Figure 2 shows performance trends in the knowledge-related tests captured at the different assessment checkpoints during the study. As expected for the knowledge-related within-group trend, the scores for both groups improved after the first intervention (as shown by difference between pre-test and post-test 1 scores). However, the knowledge gain immediately after the first intervention was higher for the iVR group (percent means: post-test1 = 77.8% ± 13.6%, post-test2 = 72.7% ± 19.0%, final post-test = 74.3% ± 13.9%) than for the video group (percent means: post-test1 = 76.8% ± 11.6%, Figure 2. Performance trends in the multiple-choice questions related to topic knowledge (mean percent accuracy). There was an increase in knowledge retention for both groups. However, the increase for the iVR group (about 30% increase) was more than that for the Video group (less than 10%).

(page number not for citation purpose)
post-test2 = 72.7% ± 16.7%, final post-test = 68.7% ± 18.1%), which had negligible effect of the interventions on test performance. Scores for the iVR group, on the other hand, remained stable after the first intervention, maintaining the overall gain relative to the baseline performance. Further, performance in the longitudinal post-test (that was deployed 2-months after the intervention) suggested a long-term retention of this knowledge gain on average by participants in the iVR group (mean score = 81.2% ± 23.7%). In comparison, the longitudinal post-test scores for the video group revealed no conclusive pattern of performance (mean score = 76.4% ± 14.5%). In summary, there was an increase in knowledge performance for both groups, especially for the iVR group (nearly 30% increase for iVR group vs. less than 10% for the video group).
In summary, self-efficacy remained high for both groups throughout the study, with a slight, but comparable increase for both groups in the proportion of students scoring 7 on the Likert scale.

Perceived learning
The average perception of having learnt through the intervention increased considerably and steadily among students in both groups over time as the course progressed (means iVR: post-test1 = 4.7 ± 1.6, post-test2 = 5.2 ± 1.4, final post-test = 6.0 ± 1.0, Figure 4. Trends in self-efficacy (in cumulative percent of participants choosing a particular rating on the 7-point Likert scale). Self-efficacy remained high for both groups throughout the study, with a slight increase in the proportion of students scoring 7 on the Likert scale.
In summary, perceived learning was similar in the beginning and at the end of the study (longitudinally). However, there were changes during the study, with an increase in perceived learning immediately after the interventions (final post-test) for both groups (i.e. about 50% increase in number of students scoring 7 on the Likert scale).

Enjoyment
Finally, while the change in enjoyment of the respective intervention was negligible for both groups over time, students in the iVR group enjoyed the interventions more (means: post-test1 = 4.7 ± 1.9, post-test2 = 4.2 ± 1.5, final post-test = 4.1 ± 1.9, longitudinal post-test = 4.5 ± 2.1). In contrast, students in the video group largely reported boredom, as indicated by their overall below-4 ratings (means: post-test 1 = 3.8 ± 1.6, post-test2 = 3.4 ± 1.5, final post-test = 3.2 ± 1.5, longitudinal posttest = 4.6 ± 2.5). Interestingly, the number of students scoring 7 on the Likert scale increases dramatically for the Video group through to the final long-term test. However, the proportion of students reporting high enjoyment (ratings of 5 and above on a 7-point Liker scale) often remained 50% or less for the iVR group (Figure 6), including in the longitudinal post-test (iVR: post-test1 = 61%, post-test2 = 49%, final post-test = 49%, longitudinal post-test = 52%). For the video group, the proportion of students reporting ratings of 5 and above decreased (post-test1 = 49%, post-test2 = 34%, final post-test = 25%, longitudinal post-test = 55%); inversely, the proportion of students reporting boredom (i.e. ratings of 4 or less) increased considerably through the different checkpoints during the course before increasing again after the 2-month delay (video: post-test1= 51%, post-test2 =6 6%, final post-test = 75%, longitudinal post-test = 45%). In summary, though there was a considerable increase in the number of students in the video group scoring 7 on the Likert scale, the mean score remained the same throughout the study for both groups.

Summary and discussion
The primary objective of this study was to investigate if and how interaction with multiple iVR field and laboratory simulations, and their video playbacks, affects student learning over time. Overall trends observed in student performance on multiple-choice knowledge tests indicate that multiple exposures of iVR simulations result in a considerable gain in topic-knowledge over time, as compared to watching videos of those simulations. Student performance in the longitudinal post-test indicates that interacting with iVR simulations is more effective in supporting long-term retention of knowledge ( Figure 1). This is among the first evidences to indicate positive longitudinal knowledge-related learning outcomes of multiple exposures of iVR (Mikropoulos and Natsis 2011;Southgate 2020). These results are consistent with much of the cross-sectional and longitudinal work investigating how virtual interventions (e.g. virtual simulations, games, e-learning) affect learning in higher education settings over time (Gaupp, Drazic, and Körner 2019;Hanus and Fox 2015;Madani et al. 2014Madani et al. , 2016.

(page number not for citation purpose)
Although the affective results do not show any conclusive cross-sectional or longitudinal between-group differences for intrinsic motivation and self-efficacy, students in both groups were highly motivated and self-confident about working on the topics (larger proportions of overall ratings above 5 on a 7-point Likert scale; Figures 2 and 3). This indicates that both interventions were successful in triggering student interest (Makransky et al. 2020). The gradual increase in the proportion of students in both groups who choose higher ratings on the respective scales further indicates that multiple exposures of the respective interventions helped in maintaining this interest among students over time (Hidi and Renninger 2006;Renninger and Hidi 2015;Wigfield and Eccles 2002).
Concerning perceived learning, the proportion of students who expressed a positive feeling about learning the topics through the simulations increased during the course for both groups. However, there was a drop in this proportion longitudinally (Figure 4), suggesting that the perceived learning benefits of both interventions faded eventually. This result can be easily explained for the video-group, where students experienced little control over (and consequently, little active involvement in) the various aspects of the intervention (e.g. due to passive viewing). These students were more likely to reflect on the video-viewing process as a not so beneficial activity (Lee, Wong, and Fung 2010;Makransky and Lilleholt 2018). However, it is not clear why even iVR students did not perceive the intervention to be beneficial in the long term. It is possible that their responses to this scale were negatively influenced (knowingly or unknowingly) by perceived difficulty of the knowledge-related questions in the longitudinal test (e.g. cognitive benefit; Makransky and Lilleholt 2018). In other words, if students perceived the knowledge-related questions to be more difficult in the longitudinal test than during the course due to the delay in testing, they were more likely to choose low ratings on the perceived learning scale in the longitudinal test. In every test, the affective questionnaires were presented after the knowledge-related multiple-choice questions. Students, thus, did have some opportunity to reflect on their performance on the knowledge test before recording their affective responses. A decline in the feeling of having experienced a novel intervention (novelty effect) over time could also explain why there was a longitudinal drop in the perceived benefits of iVR instruction (Tokunaga 2013).
Interestingly, the data trends on the affective outcomes did not provide any direct evidence of the novelty effect among students in the iVR group (we neither expected nor found this effect for video intervention). This could be because students in our iVR group were relatively more familiar and/or used to immersive technologies (e.g. through gaming, museum visits or previous exposure at other courses; Renninger and Hidi 2015). However, given that immersive interfaces are becoming increasingly accessible due to their cheap cost and availability for individual use in different forms (Radianti et al. 2020), future studies may not need to consider the novelty factor depending on the population/sample demographics.
Further, iVR was relatively more enjoyable than the (non-interactive/passive) video-viewing intervention, which was rated as 'boring' by an increasing proportion of participants during the course (Bailenson et al. 2008). However, even for iVR, most students rated their enjoyment of the immersive experience to be below 5 on a 7-point Likert scale throughout the different checkpoints ( Figure 5). This indicates that they only marginally enjoyed the iVR simulations and/or did not like some aspects of it.
Lessons from educational technology design studies have strongly suggested how various cognitive and affective learning outcomes are determined by, or (at the least are) dependent largely upon, the nature of iVR interface design (Pekrun 200. There is a great diversity in the designs of iVR learning environments, as well as the design principles, theories inspiring them (Wu, Yu, and Gu 2020). While the iVR simulations used in our study are interactive, a user can be said to experience, for instance, limited control over the interface elements and the sequence of interaction with those elements. These simulations have a pre-determined and strict sequence of events (e.g. science laboratory/experimental protocol) on which a user has no control. Moreover, navigation in the simulations is based on a system utilising only three degrees-offreedom. This is actualised as a mouse-like point-and-click interaction in a rather detailed and immersive virtual environment. These aspects could explain the observed affective outcome trends for the iVR group in our study. Several structural equation modeling approaches have suggested that the extent to which a learner experiences control over various elements of an interface predicts some important affective learning outcomes, such as perceived learning, and enjoyment (Pekrun and Stephens 2010;Plass and Kaplan 2016).

Conclusion and implications
Our study presents promising results on the cross-sectional and longitudinal effects of multiple exposures of different iVR simulations for the learning of different topics in environmental biology. The study also indicated positive effects of multiple iVR interventions on perceived learning and enjoyment. These learning effects of iVR were presented in comparison with the learning effects of desktop-based video-playbacks. The strengths of our study and overall methodological approach are: (1) the ecological implementation of the experiment involving a teacher-supported systematic integration of multiple iVR simulations and their videos in a regular field course without disrupting other course elements (Merchant et al. 2014), and (2) the longitudinal nature; comprising multiple, immediate as well as delayed data-collection checkpoints.
Incorporating these strong dimensions in our approach, however, acted as a double-edged sword. For instance, to maintain ecological validity and fidelity of implementation, it was necessary to sacrifice experimental control during the study (e.g. control for diffusion of treatments). Further, the use of multiple data-collection checkpoints and the inability to capture if and how this influenced the data also turned out to be a limitation of our study. Multiple testing checkpoints involved asking the same questions to students several times. While verbal-data related to student experiences were not collected in this study, exhaustion and boredom were observed among many students, as they also felt increasingly 'annoyed', particularly during the later tests (i.e. post-tests deployed towards the end of the course). In addition, some students also perceived the integration of technology and the accompanying multiple tests in the course pedagogy as 'additional work', corroboratively indicating a growing disengagement among students from the interventions over time.
Though such cross-sectional/longitudinal experimental designs have a clear advantage in capturing micro as well as long-term changes in student cognitive outcomes, future studies are strongly recommended to consider alternative ways of capturing/assessing these changes. It may help to advise students/participants, in such studies, that repeated testing might increase their understanding of the subject matter and eventually improve their overall performance in formal university assessments. Finally, for future ecological studies as well as iVR-integrated pedagogies to be more (page number not for citation purpose) successful, advanced orientation sessions for students may be needed to better prepare them for such technology-enriched experiences alongside other pedagogical elements.
Nevertheless, our study is one of the first and few works investigating the longitudinal effects of using iVR simulations on various cognitive and affective aspects of learning compared to watching pre-recorded videos of those simulations.
Despite our low sample size, and weak but indicative results on longitudinal learning outcome trends, our study strongly demonstrates the necessity of such investigative approaches to better understand if and how innovative iVR simulations for higher education could be designed and implemented in practice. To better test the novelty effect hypothesis, it may help to include the Focused Attention and Presence scale in future investigations.

Competing interests
All the authors declare that they have no competing interests with regards to this manuscript.

Notes on contributors
All the authors (alphabetically arranged initials -AES, AT, BM, MEM, PMJ and PP) collectively designed and conducted the experiment at the site. PMJ was the coordinating teacher for the environmental biology course, where the experiment was conducted. AES, AT, BM, MEM and PP carried out data analysis independently as well as collaboratively. PP wrote the manuscript. All the other authors critically reviewed and edited the manuscript. PP co-supervised the study.