Submitted

Last Updated May 7, 2024

Learner//Meets//Future: AI-Enabled Assessments Challenge

OATutor-GenAI

Team Leader

Shreya Bhandari

Solution Overview & Team Lead Details

What is the name of your solution?

OATutor-GenAI

Provide a one-line summary of your solution.

We aim to develop the framework to autonomously generate a computer tutor covering any STEM subject.

What type of organization is your solution team?

Other, including part of a larger organization (please explain below; may include individuals or small teams affiliated with a university)

If you selected Other, please explain here.

We are the OATutor team, which is part of the Computational Approaches to Human Learning (CAHL) lab at UC Berkeley (School of Education). In addition to our core OATutor team in the lab (which consists of the equivalent of three full-time employees (FTE)), we benefit from the shared personnel resources of the university, which adds valuable expertise and support to our efforts.

Film your elevator pitch.

—

What is your solution?

Open Adaptive Tutor (OATutor) is the first full-featured, open-source adaptive tutoring system based on Intelligent Tutoring System (ITS) principles. It combines its codebase (MIT license) with five full-semester textbooks worth of adaptive learning material made available under a CC BY 4.0 license.

Being the first such open project, OATutor affords the research community an unprecedented opportunity for iterative improvement through learning engineering and learning sciences advancement through nimble experimentation and exact replication of peer study settings. In this project, we propose innovations to the tutor content authoring process that have the potential to radically decrease the time and cost to create ITS-like adaptive tutors. We chose to build upon the OATutor project, as it provides an open-source base for adaptive tutoring that also fits with our goal to open-source all materials produced from this effort for reuse by institutions and others in the research community. We will venture to create a generative AI version of this system and evaluate it, compared to the non-generative version, in higher education environments while ensuring that Pre-K-8 learners and their educators can adapt the system to their needs. The process of deployment in these settings will help us fine-tune OATutor and report on the learner feedback configurations that were most effective. The mechanism by which the base system has led to learning is its use of mastery learning to constantly reassess and remediate along granular skill dimensions. The proposed modifications to the system under this award would increase its efficacy among learners by using generative AI to tailor both the items and help to the unique academic experiences of each student. While we will evaluate the efficacy of the modifications on standard STEM syllabi used in community college settings, the ability to create a tutor in an arbitrary STEM area will also be of particular utility to Pre-K-8 learners, educators, and others who need flexibility in educational settings.

Through our open-sourcing and creative commons licensing of all materials produced by this effort, we believe we can drive down the costs of ITS systems from $30 per student, to a level close to $0 (all of our offerings will be free, open-source). Our partners in California and New York, as well as any other institution, can use, operate, and own the tool after the study. Thus, industry, schools, and the academic research community can also contribute to, fork, or modify our project to converge on more effective tutoring paradigms, iteratively improving over time with the engineering behind the project made available to future MIT Solve teams. We will ensure LTI capabilities, integration with a wide variety of LMS, and easy reproducibility without the need for extensive technical understanding. Currently, there is no codebase like OATutor, whereby best practices of learning engineering are shared.

How will your solution impact the lives of priority Pre-K-8 learners and their educators?

Current tutoring systems are largely static in content and learning objective orientation, taking substantial effort to pivot to different topics, contexts, learners, and teaching approaches. Additionally, schools often include a large number of students and have a limited number of resources (per student) available, including teaching staff. This stresses the fact that teachers do not have enough time to effectively redesign teaching and learning tasks and to provide in-time personalized feedback for each student, which is critical to student academic success (Maier & Klotz, 2022). To assist teachers in this task, and to improve students’ conditions for learning, the use of OATutor and the implications of our A/B testing experiments will allow for: 1) broader access and dissemination of course learning materials, and 2) the provisioning of immediate feedback to students, with the completely adaptive tutoring system intervention developed under this grant made available for researchers and practitioners to be able to experiment and innovate with individual components of the system (e.g. user interface, content modification), without the burden of needing to design all from scratch (e.g. content management, adaptive item selection and LMS integration; Pardos et al. 2023).

Our endeavor is critical for several reasons. First, it provides the opportunity for students to fill knowledge gaps in their key subjects. Second, and more importantly, it can be seen as an instrument to promote equity. Simply put, we will be able to understand which feedback approaches are most effective for student performance and eventually provide students who have not had the opportunity to get proficient in necessary topics (due to socioeconomic background, regional differences in schools’ STEM educational practices, and lack of role models) with targeted feedback techniques. While our partnerships and collaborations have focused on college students in STEM education, the platform's flexible design also allows Pre-K-8 educators to adapt the system to their specific needs.

In the long run, the project’s results will make STEM education accessible to more students, from various socioeconomic backgrounds, thus promoting diversity, inclusion, and access in STEM education. The successful integration of OATutor will also reduce costs of teaching certain programs and improve learning performance. The findings will also be impactful in terms of the generalizability of the adaptation techniques that can be applied to other education contexts offering similar programs and courses as well as to the school context. Also, the results will show how the emerging affordances of generative AI can be effectively harnessed for STEM educational transformation. The integration of OATutor through our A/B tests has the potential to enable a future delivery at scale, creating a positive and lasting impact on student academic success in STEM education. All extensions of OATutor will be made available under a permissive open-source license. The openness and transparency of this project, we believe, will allow the impact of the work to be amplified by follow-up efforts pursued with and independent of the PI.

How are you and your team (if you have one) well-positioned to deliver this solution?

Our OATutor system is the only ITS system that has been open-sourced and is the first system to evaluate the efficacy of an initial, basic form of GenAI hints (i.e., worked solutions; Pardos & Bhandari, 2023) using gold standard pre-post test learning gains and GenAI questions (Bhandari, Liu, & Pardos, 2023). In 2023, our educational experiments and learning engineering work have been published at high quality academic venues including NeurIPS, Science, The Internet and Higher Education, and CHI. The PI of this project (https://bse.berkeley.edu/zachary-pardos) has 17 years of intelligent tutoring systems research experience, including a postdoc at MIT CSAIL studying edX MOOCs in 2013 following a PhD in CS from WPI. The UC Berkeley proposal team consists of the OATutor lead developer (EECS undergraduate), three Education PhDs in tutor pedagogy, HCI, and psychometric measurement, an Education MA in Educational Data Science, and an EECS undergraduate research developer. We have shown a commitment to transparency with our application of an MIT License to the codebase and Creative Commons BY 4.0 license applied to all content produced associated with the project, found at OATutor.io.

Which dimension(s) of the challenge does your solution most closely address?

Providing continuous feedback that is more personalized to learners and teachers, while highlighting both strengths and areas for growth based on individual learner profiles
Other

Which types of learners (and their educatiors) is your solution targeted to address?

Grades Pre-Kindergarten-Kindergarten - ages 3-6
Grades 1-2 - ages 6-8
Grades 3-5 - ages 8-11
Grades 6-8 - ages 11-14
Other

What is your solution’s stage of development?

Pilot

Please share details about why you selected the stage above.

We selected “Pilot” because the base version of OATutor is built, rigorously tested, and licensed under CC and MIT licenses. It's the first system to assess the efficacy of generative AI hints using gold-standard pre- and post-test learning gains, with foundational work on GenAI hints (Pardos & Bhandari, 2023) and questions (Bhandari, Liu, & Pardos, 2023). OATutor currently supports 4,000 active college students. We already have commitments from several outside teams (in California and New York) to contribute to OATutor and incorporate the GenAI research innovations proposed here.

In what city, town, or region is your solution team headquartered?

Berkeley, CA, USA

In what country is your solution team headquartered?

United States

Is your solution currently active (i.e. being piloted, reaching learners or educators) within the US?

In which US states do you currently operate?

The base version of OATutor currently supports 4,000 active college students (across the US and internationally). OATutor-GenAI is in progress with certain aspects already published at high quality academic venues. For example, we evaluated the efficacy of an initial, basic form of GenAI hints (i.e., worked solutions; Pardos & Bhandari, 2023) using gold standard pre-post test learning gains and GenAI questions (Bhandari, Liu, & Pardos, 2023). We already have commitments from several outside teams (in California and New York) to contribute to OATutor and incorporate the GenAI research innovations proposed here.

Who is the Team Lead for your solution?

Dr. Zachary A. Pardos (Associate Professor of Education at UC Berkeley)

More About Your Solution

What makes your solution innovative?

Our solution is innovative in its integration of generative AI with intelligent tutoring systems (ITS) to address the high cost and time involved in creating quality assessment items and tutorial content. Traditional ITS projects, such as ASSISTments and the base version of OATutor, often require extensive training for authors and significant time investments for problem creation. By leveraging generative AI, we aim to significantly reduce these barriers while also personalizing content to learners.

In our OATutor project, we emphasize a collaborative design approach involving stakeholders at every stage. Our methodology, detailed in the CHI 2023 publication (Pardos et al., 2023), incorporates insights from a broad range of participants, including content authors, students, and researchers, all of whom contribute to the iterative design process. We plan to continue this stakeholder-oriented ethos in our enhancements to the system, incorporating generative AI. Teachers will play a pivotal role in providing feedback during the content production process, ensuring the system effectively meets their needs.

In addition, we believe a strong distinguishing characteristic of our system is that it aligns with open-source values, which our team has instilled since we began development of the base tutor four years ago. Our solution also distinguishes it from products like Khanmigo, in that 1) we are open-source and FREE, 2) our approach to implementing a tutor, including prompting, is completely transparent, and 3) our system gives teachers control over the tutor, its tone, tenor, and subject matter, 3) our tutor communicates students’ progress out to the learning management system for the teacher to see, and 4) we conduct rigorous learning gain evaluations of our tutor at various stages of development to measure if the GenAI augmentations are contributing to any degradation in learning compared to the human-produced tutor control.

The iterative enhancement of OATutor-GenAI will be guided by data collected at various stages of the project. The first stage involves refining our prompt engineering for item generation based on psychometric evaluation. In the second stage, we will leverage crowdsourcing platforms to quickly evaluate learning effects in various personalized learning configurations. Finally, we will collect learning gain data from classroom implementations, using A/B testing to determine which system variants perform best. This data-driven iterative process will allow us to optimize OATutor-GenAI’s performance effectively. While it has become well understood what the value of shared datasets is, less appreciated is the value of shared code. Shared data allows many groups to benefit from the labor of a single group or organization. Shared code allows for the same. We believe the value proposition of sharing 100% of the best practices learned in this process will “force multiply” future efforts to an even greater extent than the sharing of data.

Describe the core AI and other technology that powers your solution.

Our solution harnesses the power of AI, specifically Natural Language Processing (NLP) and Machine Learning (ML). While the base version of OATutor uses utilizes Knowledge Tracing, which applies a Hidden Markov Model to estimate student mastery of specific skills, the following approaches will be added to the current version of OATutor.io to create OATutor-GenAI:

CHATGPT FEEDBACK-GENERATION: Our proposed solution involves utilizing ChatGPT to automate the generation of hints for any subject. Nascent research out of our lab, Computational Approaches to Human Learning, has begun to tackle this thread, starting with work that suggests that LLM-based hint generation in math is already on par with human-generated hints and questions (Pardos & Bhandari, 2023). We have laid the foundation for the community to partake in this work by implementing a dynamic hinting framework within OATutor, by which an API endpoint can be specified to query an LLM using contextual information (e.g., student answer, question text, prompt template) passed to the API by OATutor. We have integrated ChatGPT to understand problem contexts and generate contextually relevant hints.

CHATGPT QUESTION-GENERATION: High quality question authoring is particularly labor intensive. The process for authoring items for a summative test involves several stages of testing the items with a sample of respondents, measuring their psychometric properties, refining them, and re-testing until an item pool with sufficient psychometric properties is arrived at. Using Item Response Theory (IRT), we have found ChatGPT generated algebra questions to be on par or better than OpenStax (an open-source textbook) items, with higher discriminating power (Bhandari, Liu, & Pardos, 2023). We will continue research with ChatGPT to create questions, also analyzed through an IRT based experiment comparing the psychometric properties of ChatGPT generated items with gold standard textbook items.

For ChatGPT feedback and question generation, we will develop both teacher and student personalized hints/questions. Teacher personalized hints/questions are ChatGPT-generated hints/questions personalized to a description of the class and preferred teaching style expressed by the teacher. Student personalized hints will have students input information they feel is important for their learning and embed this biography in our ChatGPT prompting for question and hint generation. We will also experiment with personalizing to a student’s transcript data (if available).

CHATGPT SKILL-TAGGING: We will use ChatGPT to tag each question to skills from an educational taxonomy (e.g., US Common Core), and fine-tuning the model using few-shot examples. We also intend to evaluate its ability to tag content from different languages and work on approaches to optimizing this multilingual skill tagging performance. Performance on this sub-task will be evaluated using cross-validation of existing problem text and skill label datasets in English, Korean, and Japanese.

ML-ASSISTED CONTENT REVIEW: Our solution employs ML algorithms to assist in reviewing educational content, ensuring high quality and relevance.

Finally, we will compare the learning gain of the completely GenAI-produced tutor to their human expert-produced ITS to establish the current SOTA in fully GenAI tutoring performance. All materials will be open-sourced, allowing for further advancement of this SOTA by other researcher groups and organizations.

How do you know that this technology works?

Nascent research out of our lab shows that LLM-based hint generation in math is already on par with human-generated hints (Pardos & Bhandari, 2023). Further, our experiments using Item Response Theory (IRT) revealed that ChatGPT-generated algebra questions performed as well as, or better than, open-source textbook items from OpenStax in terms of their ability to discern students' understanding (Bhandari, Liu, & Pardos, 2023). These results highlight the potential of AI-driven question generation to enhance assessment quality.

Additionally, recognizing that large language models (LLMs) sometimes produce hallucinations (Shuster et al., 2021), we are developing advanced techniques to mitigate this issue. We will enhance mitigation techniques for accurate dynamic item and hint generation. Building on our previous success with a self-consistency technique (Wang et al., 2022) that reduced hallucinations in algebra worked solutions to near 0% (Pardos & Bhandari, in-press), we will automate this process and explore additional methods detailed in Tonmoy et al. (2024).

We plan to build on this progress by combining these approaches into a cohesive framework. In previous research, we used crowdsourcing to evaluate the effectiveness of AI-generated hints and questions. Now, we will transition to in-classroom deployments for real-world validation.

What is your approach to ensuring equity and combating bias in your implementation of AI?

Our approach to ensuring equity and combating bias in AI centers around continuous research, practical application, and adherence to best practices in ethical AI. Through our forthcoming research, we aim to address these challenges directly, making sure our implementations align with the highest standards of fairness and inclusivity.

Optimizing for Intersectional Fairness:

Our forthcoming work (Mangal & Pardos, in-press) provides a practical guide for implementing equitable and intersectionality-aware ML in education. We intend to incorporate these strategies into our project to ensure that our AI algorithms identify and mitigate biases that could adversely affect different intersectional groups. By understanding and addressing these nuances, we aim to improve the accuracy and fairness of our AI models across diverse populations.

Bridging Language Disparities:

In our other upcoming work (Kwak & Pardos, in-press), we explore skill tagging for multilingual educational content using large language models (LLMs). By leveraging these insights, our project aims to ensure that all users, regardless of their primary language, receive equitable support from the AI system.

Our team will integrate these research findings into the development of OATutor-GenAI. This proactive and research-driven approach to bias mitigation guides our continuous improvement efforts and ensures that our AI system delivers fair and effective educational assistance to all users.

How many people work on your solution team?

Our solution team comprises the equivalent of three full-time employees (FTE). In addition to our core team, we also benefit from the shared personnel resources of the university, which adds valuable expertise and support to our efforts.

How long have you been working on your solution?

3.5 years (including base version)

If your solution has a website or an app, provide the links here:

OATutor Frontend (Base Version): https://cahlr.github.io/OATutor/

Your Future Plans

What is your plan for being pilot ready (if not already) within the next year, and what evidence can you provide that you are on track to meet your goals?

Experimental Goals:

June 2024 - December 2024 - 1) Content curation through ML-assisted review, 2) Planning with collaborators

January 2025 - May 2025 - 1) A/B testing (OATutor vs. OATutor-GenAI), 2) A/B testing analysis

August 2025 - December 2025 - 1) A/B testing (OATutor-GenAI teacher personalized vs learner personalized vs both), 2) A/B testing analysis, 3) Aggregate findings across contexts and evaluate if there was a single most effective variant of OATutor-GenAI across contexts, 4) Hold workshops on using OATutor-GenAI and prep to pilot in classrooms with priority learners (with piloting immediately after).

Research Goals:

June 2024 - May 2025 - 1) ML assisted review of content, 2) Integrating LLMs into the content curation process 3) Hallucination mitigation techniques literature review 4) Implementation of basic hallucination mitigation techniques (self-consistency), 5) Real time personalized item generation with LLMs

August 2025 - December 2025 - 1) Real time personalized item generation with LLMs continued, 2) Background literature review on learning sciences (i.e. best ways to prompt teachers and students, ChatGPT prompt engineering), 3) Implementation of advanced hallucination mitigation techniques (based on Tonmoy et al. (2024)), 4) Disseminate research results

We’ve already evaluated the efficacy of an initial, basic form of GenAI hints (i.e., worked solutions; Pardos & Bhandari, 2023) using gold standard pre-post test learning gains and GenAI questions (Bhandari, Liu, & Pardos, 2023). We also are working on using ChatGPT to tag each question to skills from an educational taxonomy (e.g., US Common Core), and fine-tuning the model using few-shot examples. We will utilize learnings from these research threads to create our solution.

What are your plans to ensure your solution is available, accessible, and affordable to priority learners at scale?

Our researcher user base is international, and with the proposal activities we plan to broaden the system’s impacts by leveraging our international ties and collaborating on proposals for similar deployments internationally. In the past, the OATutor team has introduced the system to researchers through workshops in the field in Japan, Sweden, Denmark, Portugal, China, and the US. We have generated a healthy mailing list from these interactions and will reach out to researchers and engage them in the new GenAI version of the work. Using our single codebase, industry and the academic research community can contribute to/fork/modify our project to converge to more effective tutoring paradigms. The open-source nature of the project allows for iterative improvement over time.

Besides external researchers, another scalability mechanism involves leveraging relationships with our partner institutions. For example, they will both help us identify partner sites as well as help promote the use of OATutor-GenAI as an innovative Open Educational Resource to its audience of school systems. OATutor-GenAI will become one of the resources curated by our partners, which will teach other programs how to adapt it to their use cases.

Why are you applying to the Learner//Meets//Future Challenge?

Financial Support: We require funding to expand OATutor's capabilities in autonomously generating content and personalizing learner assistance. The award would support essential research and development efforts, allowing us to rapidly experiment with different pedagogical features at our partner sites. This financial backing will also enable us to conduct critical research into domain expert-involved prompt engineering and strategies to mitigate LLM hallucinations, ensuring reliable and accurate content generation.

Guidance and Scale: Access to the challenge’s network will amplify our reach by connecting us with new partnerships and markets. With guidance from Solve and the Bill & Melinda Gates Foundation, we aim to establish relationships with educators, policymakers, and institutions. This network will help us navigate market challenges, enhance our solution’s scalability, and maximize our impact on diverse learner populations.

In which of the following areas do you most need partners or support?

Financial (e.g. accounting practices, pitching to investors)
Product / Service Distribution (e.g. collecting/using data, measuring impact)

Solution Team:

Shreya Bhandari