research-article

Open access

Co-ML: Collaborative Machine Learning Model Building for Developing Dataset Design Practices

Authors:

Tiffany Tseng,

Matt J. Davidson,

Luis Morales-Navarro,

R. Benjamin ShapiroAuthors Info & Claims

ACM Transactions on Computing Education, Volume 24, Issue 2

Article No.: 25, Pages 1 - 37

https://doi.org/10.1145/3641552

Published: 16 April 2024 Publication History

PDF eReader

Abstract

Machine learning (ML) models are fundamentally shaped by data, and building inclusive ML systems requires significant considerations around how to design representative datasets. Yet, few novice-oriented ML modeling tools are designed to foster hands-on learning of dataset design practices, including how to design for data diversity and inspect for data quality.

To this end, we outline a set of four data design practices (DDPs) for designing inclusive ML models and share how we designed a tablet-based application called Co-ML to foster learning of DDPs through a collaborative ML model building experience. With Co-ML, beginners can build image classifiers through a distributed experience where data is synchronized across multiple devices, enabling multiple users to iteratively refine ML datasets in discussion and coordination with their peers.

We deployed Co-ML in a 2-week-long educational AIML Summer Camp, where youth ages 13–18 worked in groups to build custom ML-powered mobile applications. Our analysis reveals how multi-user model building with Co-ML, in the context of student-driven projects created during the summer camp, supported development of DDPs including incorporating data diversity, evaluating model performance, and inspecting for data quality. Additionally, we found that students’ attempts to improve model performance often prioritized learnability over class balance. Through this work, we highlight how the combination of collaboration, model testing interfaces, and student-driven projects can empower learners to actively engage in exploring the role of data in ML systems.

1 Introduction

Machine learning (ML) has dramatically impacted a wide-range of domains including healthcare, entertainment, and communication. As ML systems play an increasing role in our daily lives, there is a growing need for ML education to support people to engage purposefully with ML technologies. Improving ML literacy not only strengthens technical understanding of ML, but also promotes civic engagement, interest in ML, and skill-building for future creators of and with ML technologies.

Several frameworks have been proposed for introducing ML in primary and secondary education, as well as for non-experts more broadly [44, 70, 83]. These frameworks propose foundational concepts people should learn about ML, including that data plays a central role in how ML models make decisions, and that humans play a role in shaping the ways data is selected. Yet these frameworks are initial guidelines, and research is needed to inform both teaching and tool design that can effectively support non-experts in learning these ideas. Even within higher education, prior work has stressed the importance of engaging with dataset considerations essential to model robustness and generalizability, which are often overlooked in favor of educating future ML practitioners about model architecture and the use of off-the-shelf datasets [61]. Because many harmful problems with ML applications today result from a lack of representative data [12, 25], educational efforts to increase understanding about the role of data in ML are especially needed.

A growing number of tools have been created for beginners to build ML models using their own data [1, 13, 67, 72, 85]. However, existing tools have limited support for learning to design datasets for inclusive models because they are,

(1)

optimized for data collection and testing from a single user, increasing the likelihood of imbalanced or biased data that may not generalize to other use cases

(2)

center on ephemeral live classification, where model results are evaluated in real time, lacking support for reviewing and debugging misclassifications over time and monitoring how model performance might change in response to refining data.

To address these gaps, we developed a novel ML model-building tool called Co-ML for multi-user data collection and model testing on tablets. We made collaboration a central feature of the Co-ML experience because we envisioned that working with others to gather data, analyze data, and test models can surface multiple points of view beyond those an individual learner might consider on their own. Through a collective experience for reviewing, discussing, and debugging ML models and datasets, we expected that learners would consider, address, and enact key dataset design practices (DDPs) such as incorporating dataset diversity and inspecting for data quality as they built their models with Co-ML.

In this article, we describe how features of Co-ML were designed with these DDPs in mind. We then share our evaluation of Co-ML through a 2-week long ML Summer Camp pilot for high school girls and non-binary youth in partnership with the non-profit Kode with Klossy. In the camp, youth worked in groups to create ML models with Co-ML and build their own mobile applications applying, these models to personally relevant topics of their choice such as healthy eating and sustainable fashion. Throughout the camp, our research team studied how students worked together to build their ML applications. Our mixed methods analytical approach incorporates data sources such as observation notes and audio recordings of student interactions, in-app logs of how students navigated the app, design journals students updated while building their final projects, and post-camp semi-structured interviews to learn about challenges novices faced when designing and debugging ML models.

Through our analysis, we examine the following research question: How does collaboration supported by Co-ML shape the learning of four DDPs: (1) incorporating dataset diversity, (2) evaluating model performance and its relationship to data, (3) balancing datasets, and (4) inspecting for data quality?

Our contributions are

(1)

A framework of DDPs for learning about the role of data in ML systems.

(2)

A description of how features in Co-ML, a novel, collaborative ML modeling app, along with an accompanying learning experience, were designed to foster the development of DDPs.

(3)

A discussion of how collaboration enabled by Co-ML shaped student understanding of each dataset design practice through our deployment and study of Co-ML in 2-week summer camps for teenagers.

2 Related Work

We begin with an overview of professional ML data work to provide context about data considerations in ML practice, followed by a summary of ML education efforts for youth and research on collaborative learning. Finally, we summarize the gaps in existing novice-oriented ML tools for fostering DDPs, which leads into our description of the Co-ML application and how we designed a collaborative learning experience to address these gaps.

Our work focuses specifically on ML, a subset of AI in which computers detect patterns in data to make useful predictions about new data. In particular, we align with efforts to empower more people to contribute to ML (even without deep expertise in ML algorithms and architecture) by building supports for machine teaching [54, 65], which emphasizes human teachers and how they interact with data to build ML systems. In doing so, our goal is to enable beginners to reason about the role data plays in ML models by directly engaging in building datasets and models.

2.1 The Role of Data in ML

Data is foundational to building ML models and largely determines model performance, robustness, and generalizability [61]. Preparing data is an essential part of the ML modeling pipeline, estimated to take up 80% of a data scientist’s time [55]. Data preparation involves many considerations across the model-building process, from planning and prioritizing what data to initially collect, to iterating on data in response to model performance issues or shifting project goals [33]. Failing to properly design datasets fundamentally jeopardizes the usefulness of a model, as the classic adage “garbage in, garbage out” advises [6], and can lead to data cascades or compounding negative downstream effects as a result of data issues [61]. Using unrepresentative data, or data that do not accurately reflect a model’s intended users and use cases, can result in problematic, biased models that discriminate on a range of attributes including race and gender [12, 16, 50].

In preparing data, model designers need to consider multiple dimensions of data quality that are context dependent [68]. These dimensions include how representative the data is, or how comprehensively the data reflects characteristics important for the model’s ultimate use; how clean the data is (that the data is properly labeled and does not contain duplicates, for example); how balanced a dataset is, or that the distribution of samples across classes is equitable (the opposite of this is considered class imbalance); and how diverse the data is, or that the data reflects variability in the populations using a model and the contexts in which the model is used [27, 34]. While ML practitioners can leverage different techniques to measure the diversity of their dataset [49], they also need to take into account learnability, as some classes may be inherently harder for a model to learn and thus require more data than others [34].

In an effort to improve their models, practitioners often iterate on their data. Typically ML models include training, testing, and validation datasets, where training data is used to train a model, validation data is used to assess model performance when tuning model parameters, and testing data is for evaluating performance of a trained model. Iteration often includes adding, removing, or modifying samples from any of these datasets. Adding data can involve expanding the dataset by collecting additional samples from a random population, or more targeted efforts to address underrepresentation or class imbalance for a specific label; removing data happens in response to identifying noisy or erroneous data [33]. Prior work examining practitioners’ sensemaking practices when reviewing data found that collaboration, where multiple people interact and discuss data, can help support decision making about and inform understanding of data [40].

Despite the importance of data work in professional ML practice, ML education efforts, as well as ML research, have historically focused on model architecture rather than data design practices. Undergraduate and graduate courses in AI and ML often use existing toy datasets from Kaggle or similar platforms, limiting opportunities for students to actively learn about data preparation practices and data quality issues firsthand [61]. As we consider how AI and ML education might expand in service of even younger audiences in K-12, we see the need and opportunity for foundational data design practices to be incorporated into tools and learning activities so that learners can authentically encounter and engage with data as they learn how ML models work.

2.2 ML Education for Youth

Efforts to build out educational resources for learning ML are motivated by the growing role ML technologies play in our daily lives, alongside issues like a lack of transparency and public understanding of how these systems work. While efforts to integrate AI education into school took place as early as the 1970s, there has been a marked growth in instructional resources over the past half decade [48], with a shift toward ML more specifically [37]. The importance of ML education is further recognized by this journal specifically, with a special issue of TOCE from 2019 highlighting opportunities for creative applications in arts and design [26, 64].

Over the last decade, a number of AI Literacy frameworks have been proposed, outlining educational opportunities for teaching how AI and ML systems work. These frameworks include the AI4K12 5 Big Ideas in Artificial Intelligence [71], AI Competencies [44], and design frameworks for K-12 AI Education [83]. Within these frameworks, there are several core considerations around data in ML models, including that computers learn from data (Big Idea 3 from Reference [71] and Competency 12 from Reference [44]), and that training datasets are often constructed and edited by humans and affects how models perform (Competency 13 [44]). These frameworks also stress the importance of accounting for bias that can result when populations are underrepresented in training data, and that humans play a role in dataset construction, cleaning, and verification that can minimize these biases. Arastoopour Irgens and colleagues, for example, References [4, 5] investigated how children (aged 9–13) used Google Teachable Machine to build and critique models, guided by curriculum that engaged them in considering the impact of biased datasets as well as the social, ethical, and political implications of ML technologies. Because ML systems have fundamental differences compared with classical notional machines, Tedre et al. have argued that computational thinking frameworks need to be expanded to accommodate ML practices and ideas, such as inductive problem solving techniques and trial and error approaches to debugging [69].

The AI4K12 guidelines around Learning (Big Ideas 3) underscore the need for big data, describing how “Large amounts of training data are required“ along with “thousands to millions of trial and error experiments to solve narrowly defined problems.“ However, transfer learning is enabling people to build functioning models on smaller datasets by modifying pre-trained models [77, 80]. Young learners can effectively use transfer learning to iteratively improve models by analyzing misclassifications and revising datasets quickly in the context of small projects [31, 75]. Transfer learning presents opportunities for novices to use their own data to build and evaluate ML models rather than relying on the use of existing large datasets; prior work has also found that learners using personally-relevant data were able to reason about the mechanisms of an ML-system and self-advocate against potentially harmful results [56].

While proposed AI Literacy frameworks put forth ideas for age-appropriate ML concepts, research is needed to validate the appropriateness of these frameworks and to determine practices, tools, and approaches that can best support learning of these ideas. A growing number curricular tools and resources have been developed for age-appropriate introductions to AI [22, 48]. Existing efforts aimed at youth at the secondary education level (aged 11–18) often forefront ethics [2, 42, 43, 52, 79], centering on the use and exploration of existing ML models. In contrast to using off-the-shelf ML models, our work focuses on the design and implementation of ML tools that allow learners to create ML models and projects using data they collect themselves.

Several novice-oriented tools for ML model building extend popular blocks-based programming environments to accommodate ML features, including extensions to Scratch [79], MIT App Inventor [67, 74], and Snap! [37]. These examples typically combine off-the-shelf pre-trained models with interfaces for beginners to add their own data and train a model [20, 41], supporting novices building applications with features like sentiment analysis and image detection. Other novice-oriented ML tools include production applications like Google Teachable Machine [13] and prototype research tools like PlushPal [72] and AlpacaML [84, 85], where beginners can create image, gesture, or sound recognition models. Typically, novice-oriented ML modeling tools incorporate live classification interfaces for evaluating a model; for example, for an image classifier, the interface may show in real time what the model’s top prediction is based on live streaming camera input. Evaluations of these tools suggest that when students use ML tools to create personally meaningful projects, they are able to demonstrate increased understandings of ML concepts [2, 20, 23, 24, 36, 37, 67, 75, 85].

Despite these affordances, we identified several gaps in existing ML modeling tools. First, these tools center around individual use, where a single person trains and tests a model using only their own data. Having data from a single user limits opportunities to work with diverse data representing use cases and perspectives beyond what an individual may consider on their own; important dataset considerations like dataset diversity and model generalization may thus be more difficult to encounter and resolve.

Second, because existing tools center model evaluation on real-time live classification, they lack support for systematic review of misclassified examples and assessment of model improvement in response to changes (such as adding different training data or tuning model parameters over repeated iterations). Typically, in professional ML practice, models are assessed with regard to test datasets, a portion of collected data specially reserved for evaluation; yet, beginner-friendly ML tools today lack support for test datasets, making it more difficult for users to assess whether or not their model is improving.

Finally, existing tools are largely web-based and are not supported on mobile devices. Mobile-supported data collection could more flexibly accommodate learners collecting data in the wild (as compared with front-facing laptop webcams for images and video, for example), which may further support opportunities for expanding dataset diversity.

2.3 Collaborative Learning

Collaboration with peers has the potential to positively impact student learning in appropriately designed activities [9, 58]. Collaborative work can activate socio-cognitive processes for learning: asking questions, explaining one’s thinking, providing a critique, and resolving differing perspectives, all of which are difficult or impossible to accomplish when working alone [7]. The benefits of peer-to-peer collaboration are well-documented in computer science education [32, 59], with prior work highlighting how, through collaboration, students can learn how to share responsibility while completing a task [46], participate in and learn from group discussions [53], and pair program with increased confidence [8, 78]. Students engaging in peer collaboration are also more likely to persist in the discipline [11, 60]. Furthermore, prior work has found active discussion in peer-to-peer problem solving [7] to correlate with novelty of student-created designs [17].

Standard pair programming models typically involve students working alongside one another on a single machine, with one person “driving” the experience by having control over a code editor. Prior work on pair programming has explored how collaborative dialogue between students may positively shape how students debug and learn together [35, 57], with a range of types of dialogue that may be more or less supportive [18].

In contrast to this notion of pair programming, our work focuses on synchronous co-editing, where multiple people can edit a single project simultaneously on multiple devices. While synchronous editing has been examined in the context of collaborative blocks-based editors [46, 63], multi-user experiences in the context of data science and ML has focused almost exclusively on professionals. For example, recent work has contributed computational notebooks that support multi-user synchronous editing [29, 76] for professional data scientists, finding that synchronous editors can increase group exploration and reduce communication costs compared with developers working in individual notebooks [76]. More broadly, an important feature of collaborative software tools is improving group awareness so that team members can better coordinate actions and develop a shared mental model for their work [19].

In the context of ML education, recent work has highlighted a need for research to examine the role collaboration can play in student learning [62]. Studies conducted with families (adults and children) have shown affordances for group learning and reasoning about ML [21, 45] by supporting parent-child dialogue and distinct parental facilitation roles. While some related work in ML education for youth described in Section 2.2 involves students working in collaborative learning settings like such as classrooms [75], multi-day camps [73], workshops [85], and online interventions [79], to the best of our knowledge, only two previous studies have specifically studied the nature and affordances of peer-to-peer student collaboration when working on ML projects.

Prior work examining how youth use Google Teachable Machine described an activity where students built models individually using workshop-supplied objects, then swapped models with others for testing. This helped students realize situations where a model might not work well for new users [23]. Kaspersen and colleagues [38] developed VotestratesML, a web-based tool designed for students (aged 17–20) in social studies classes to explore ML in small groups. Using VostratratesML with pre-existing datasets, students collaboratively made decisions about tuning model parameters and making revisions based on choices of features, algorithms, and model output. Because groups’ results were projected publicly, students were more engaged in discussions about their results.

An area understudied in prior ML education research is how learners can collaboratively design models using their own constructed datasets, rather than supplied materials or existing toy datasets. Our work with Co-ML explores the potential of this type of collaborative experience in the context of students building models addressing topics of personal relevance.

2.4 Research Opportunities for Fostering Dataset Design Practices for Novices

Our review of current ML education efforts identified that existing novice-oriented ML tools center around a single-user experience where an individual collects only their own data, providing limited opportunities for learners to create diverse, balanced datasets that meet the needs of more than a single individual or use case. Additionally, these tools lack robust model evaluation metrics and features that enable users to engage with key skills in professional ML practice, such as building test datasets and evaluating their model performance across multiple iterations of their datasets.

We argue that a collaborative modeling experience may help mitigate these issue because (1) larger datasets can be built more quickly by multiple users compared with a single user, potentially providing more opportunities for issues like imbalanced data to arise; (2) individual differences in data collection strategies may become visible through a multi-user experience, since people may have different points of view on what is considered to be representative data—and this may, in turn, lead to more diverse data; and (3) a collaborative experience provides opportunities for collective discussion of model issues, which may help learners deepen their insights about how ML models work through active discussion with their peers. Our hope is that a collaborative ML modeling tool can help learners create diverse datasets and link the design of their datasets with the performance of a ML system.

Next, we summarize the specific DDPs we designed Co-ML to support, followed by a description of the design of the Co-ML experience.

3 Dataset Design Practices

Summarizing the considerations and practices of ML practitioners for collecting data (described in 2.1 and by [27, 33, 34, 61]), we propose a set of DDPs for fostering learning about the role of data in ML modeling. Specifically, this set of DDPs is intended to foster student learning as they collect and construct their own datasets in the context of building models for their own use.

We will refer to Table 1 as we describe how Co-ML was designed to support these DDPs, particularly in the context of novices building supervised ML image classifiers. Furthermore, we present the results of our evaluation of Co-ML with respect to how students engaged with these DDPs as they designed custom ML models.

Table 1.

	Dataset Design Practice	Description
DDP1	Incorporating dataset diversity	Ensuring that data is representative and accounts for the diverse characteristics of a label and the variety of use cases where a model might be used
DDP2	Evaluating model performance and its relationship to data	Understanding how well a model is performing Identifying gaps or confounding factors in data that might impact model performance Assessing whether a model has improved after dataset revisions and model retraining
DDP3	Balancing datasets	Designing datasets that have roughly equal distribution of samples across labels and ensuring model performance is consistent across labels
DDP4	Inspecting for data quality	Checking that data is properly labeled and of sufficient quality (e.g., that image data is not blurry)

Table 1. Dataset Design Practices

4 Co-ML System and Companion Starter App

To support beginners with thoughtful exploration of the role of data in ML model performance, we designed a collaborative modeling mobile app called Co-ML. Co-ML is a tablet-based app that supports a multi-user experience for collecting image data, training an ML image classifier, and testing the model’s performance. Multiple people can collect image data using the camera on their individual tablet, and the data is synchronized across devices so that everyone is working with the same shared dataset as they iterate on their models.

In this section, we share our design goals in creating Co-ML along with a description of the app. Because our intention is to support users with integrating ML models into custom applications, we also describe a companion starter app we designed to be used alongside Co-ML to create mobile apps integrating ML models. While analysis of students’ use of the starter app is out of scope for our study, we share its design to provide context for interpreting our descriptions of what students built.

4.1 Co-ML Design Goals

We had two primary design goals for the Co-ML application to enable: (1) diverse perspectives in the model-building process, and (2) iterative model testing for monitoring how model performance changes in response to dataset revisions. In this section, we describe how these two goals drive the design of the features of Co-ML and how they relate to the DDPs (Section 3) we intend to foster.

For our first design goal of enabling diverse perspectives, we imagined that individuals may differ in the ways they collect data, including how they take photographs and the contexts in which they take them. We expected these differences to emerge when multiple people contribute to a dataset, especially with a mobile interface that allows flexible data collection in a variety of settings. As a result, a core consideration in the design of Co-ML was how the data is synchronized and displayed throughout the modeling experience so that users are more likely to encounter other people’s perspectives and practices as they build their models. We designed Co-ML to support learners in incorporating those differences by diversifying their dataset (DDP1), considering class balance as it relates to sample distribution (DDP3), and inspecting and assessing data quality from multiple individuals (DDP4).

For our second design goal of enabling iterative model testing, we designed Co-ML to provide feedback about how well the model is working to best support learners exploring the relationship between data and model performance (DDP 2). To this end, the app provides actionable, beginner-friendly feedback to let users know how well their model works at the moment and whether the model has improved after retraining. We also recognize that iteration is easier when there is minimal friction in the model training process, so we aimed to reduce model training time and latency when synchronizing data across devices.

4.2 Co-ML System Description

Co-ML has a data-focused ML model building experience that incorporates the following user flow: (1) defining an ontology of labels, (2) collecting data, (3) training a model, (4) evaluating model performance, and (5) iterating on a model by modifying the label ontology and or datasets. After creating a project, users can generate a shareable link (sent via e-mail or text message, for example) and invite collaborators to contribute. Here, we walk through an example scenario involving multiple people collaborating on a project to identify various fruits, with each person using their own tablet to collect data.

4.2.1 Define Ontology of Labels in a Classifier.

Users provide the names of labels that their classifier will be able to predict. Any user can add, rename, or delete labels in a project.

4.2.2 Collect Data.

Selecting a label from their list of labels opens up the tablet camera, where image data for that label can be captured. When adding images, a stream of recently added images for a given label (consolidated from all users within the project) are displayed alongside the camera feed to bring awareness to what images have already been added and to encourage users to inspect different ways others might be capturing their training data. This is shown in Figure 1. Furthermore, the number of images per label is displayed next to the label name so that users can monitor how their dataset is being updated and potentially consider labels that might need more attention due to a lack of data or imbalance (DDP3).

Fig. 1.

At any point, users can tap the View All Data button to view the Training Data Dashboard, which displays training data collected across all users, organized by label. The dashboard consists of a grid of training images, with 25 images visible at a time on a 11” tablet screen to encourage users to look for patterns or gaps in their dataset and inspect for any issues around data quality (DDP4). To support users inspecting lots of data at once, we designed Co-ML for tablets as opposed to mobile phones.

Co-ML enables a distributed data collection experience, with users having agency to choose how to take photos with their own tablet and position the camera and objects they are photographing. We imagined that this distributed experience could support the emergence of individual differences in data collection strategies, which may ultimately support groups diversifying their dataset (DDP1).

Image data collected in Co-ML are stored and synchronized using private cloud-based data storage, with image data accessible only to users within a shared project.

4.2.3 Train a Model.

Models are trained on device using the Create ML API [3], with model training taking approximately 5–10 seconds for datasets of a couple of thousand images (though larger datasets are supported). We minimized training time in an effort to reduce friction for users iterating on their model and datasets. The trained model is stored locally and not synced across devices, as we imagined, users might want to test the model in different states, such as by comparing model performance before and after modifying a dataset by adding more images.

4.2.4 Evaluate Model Performance.

Immediately after a model is trained, a camera interface appears where users can test the model on new data. They can take a photograph and view model results (Camera Classification Mode), or turn on real time classification (Live Classification Mode) where a live bar chart of confidence levels displays how the model is interpreting camera data in real-time similar to Google Teachable Machine. When users capture images in Camera Classification Mode, they can indicate whether the model was correct in its prediction and provide the correct label if the data was misclassified, as shown in Figure 2.

Fig. 2.

All images captured in the Camera Classification Mode are automatically saved as test data that can be reviewed in a Testing Data Dashboard (similar to the Training Data Dashboard). The Testing Dashboard was designed to support student interpretation of model performance (DDP2). As shown in Figure 3, the latest model classification results are shown, with green checkmarks indicating if the image was correctly classified and red x marks indicating that an image was misclassified. The Testing Dashboard sorts misclassifications before correctly classified samples, encouraging users to identify differences in data that might lead to misclassification. Furthermore, if users revise their data (by adding or removing images), the testing dashboard always displays the latest result after a model is retrained—in this way, learners can check whether the total number of misclassifications for a given label changes over time.

Fig. 3.

We also created an in-app game as another way to support testing in a playful, structured format, as displayed in Figure 4. The game consists of multiple rounds in which a target label is given, and the player must show that item to the camera. They score points in each round based on the confidence level with which the model identifies the object (for example, an apple classified with a confidence level of 75% would get 7.5 points). The game consists of multiple rounds that the user completes within a 90 second time limit, designed so that users would be testing each label in their project multiple times. At the end of the game, the user can see their cumulative score, a high score that serves as a proxy for monitoring whether a model has improved or not after iterating on the data, and details of the individual rounds to review misclassified items (DDP2).

Fig. 4.

4.2.5 Iterating on a Model.

At any point, users can revise their training and testing data by adding or removing images or adding or removing labels. After making changes, they can retrain their model and test its performance to see if it has improved. The app uses a flat permission structure, with all users having the same abilities to add and delete data throughout the model-building process.

4.3 Starter App

To support users applying models they create in Co-ML, we developed a companion mobile starter app designed for beginners to program their own ML-powered apps. Users first export their ML model from Co-ML, and then edit and write code in the starter app (using the Swift programming language) to build a real-time classification experience using camera input on mobile devices.

With the starter app, users design and specify what information should appear when an item is classified by the camera. The underlying implementation (including the consumption of an ML model, instantiation of the camera, and real-time inference from live camera input) are handled under the hood to reduce development time and knowledge required to build an ML-powered app.

The default experience of the starter app is displayed in Figure 5, where users can customize a Launch Screen describing the purpose of the app and the labels the classifier can identify. The app then exposes a camera interface for live classification, and users can specify the information they want to appear in the UI for the top classification result (in the example below, that a tomato requires 3 gallons of water to mature). The specific properties to be surfaced in the camera classification interface is handled by a user-edited JSON file.

Fig. 5.

The source code for the starter app was provided for students to customize the design of their own ML-powered mobile applications.

5 Methods

Our research team collaborated with the non-profit Kode with Klossy on the design and implementation of 2-week ML Summer Camps for young women and gender-expansive youth at the secondary education level. During these camps, students used Co-ML in a variety of hands-on model-building activities, working in groups to develop final projects involving the design and development of custom ML models and apps.

In this section, we first provide an overview of Kode with Klossy and the creation of the AIML Summer Camp design and structure. We then describe Kode with Klossy’s recruitment strategy and how our research team recruited camp participants for our research study. As not all camp participants were part of our research study, we used student to refer to anyone enrolled in the camp, and participant to refer specifically to students that consented to participate in the research. Finally, we describe our data collection and analytical approach to understand how participants used Co-ML to support their understanding of ML data design practices.

5.1 Kode with Klossy and Creation of AIML Summer Camp

Kode with Klossy is a non-profit that offers in-person and virtual summer camps for girls and gender expansive youth ages 13–18 in web development, mobile development, and data science in over a dozen cities across the United States, with 10,000 alumni as of Summer 2022. Camps are offered free of cost to all participants and do not require any prior computer science experience. KWK’s central mission is to empower voices that are typically underrepresented in technology and so the organization makes special effort to recruit diverse youth, with 82% of program alumni identifying as people of color and 50% qualifying for free and reduced lunch at school. (In the United States, free and reduced lunch qualification is calculated by multiplying Federal income poverty guidelines by 1.30 and 1.85, respectively [51]; free and reduced lunch is commonly reported as a measure for socio-economic status).

In surveys with alumni, AI and ML were the top request for future camp offerings. Since KWK did not have in-house expertise in AI or ML, KWK partnered with our research team and others to design and pilot two in-person ML summer camps in Summer 2022: the first taking place in Seattle, and the second was held in New York City. Each camp was led by two KWK instructors with prior experience teaching existing KWK web development camps. In advance of the camp, KWK’s instructional team met regularly with our research team over several months to learn how to use Co-ML and co-develop curricular activities.

5.2 AIML Summer Camp Structure and Activities

During the AIML Summer Camp, students met Monday through Friday from 9 am to 5 pm over two weeks for a total of 80 hours. In the first week, instruction was provided on a variety of AI/ML topics, covering the basics of what AI is, various types of ML systems and model types (lightly adapted from the MIT DAIly curriculum [30]), and the dangers of bias in ML systems, all alongside team bonding activities and discussions around diversity in tech. Additionally, there were introductory lessons on the Swift programming language and using the Xcode developer environment to create iOS apps. Each student was provided with a laptop and a tablet to use for the entirety of the camp.

The Co-ML app was introduced in two structured modules during the first week, each lasting between 3 and 4 hours and involving students working in groups of 3–4 people. In the first module, students built fruit classifiers using apples, oranges, grapefruits, and mangoes. In the second module, students built classifiers with items of their choice that they brought from home. For both modules, facilitators guided students through using Co-ML’s training and testing interfaces and the in-app game. Colored and patterned cloth swatches were provided for students to use as backdrops for their items so they could both capture data and test their models using diverse backgrounds. During these activities, the instructors emphasized that it was important to try to capture objects from a variety of different angles and perspectives to ensure that the model had more data about the appearance of objects in their classifier. Project groups for these two modules were randomly assigned.

The introductory Co-ML modules were followed by a culminating final project, where students worked in new teams to build classifiers and apps on topics of their choice.

5.2.1 Final Projects.

The design prompt for the AIML Summer Camp final project was to “Build an image classifier (Co-ML) embedded in an app (Xcode and Swift) that addresses a topic of your choice.” This open-ended project brief was designed so students had the space to tackle a problem of personal interest, curiosity, or relevance.

Final project teams were formed by the instructional staff by the end of the first week of camp. The instructors grouped students based on their shared interests, which students indicated through a survey asking which topics they were most passionate about. Topics were selected from a list KWK provided, including Animal Rights, Climate Change, and Creativity/Arts. Students were not bound to choosing a final project based on their initial shared interest; rather, these topics were used as a starting point for students to ideate on personally relevant projects themes.

The last week of camp was primarily unstructured work time for students to develop their projects. Groups pitched their final project ideas to the class and guest panelists on Monday of the second week, received feedback and suggestions, and then developed their ML models and custom over the next three days, with 9.5 hours total working time. Students built their models in Co-ML by photographing items they brought from home to use during the camp. They then incorporated these models into custom iOS apps for creating a real-time image classifier using camera input (editing the starter app described in Section 4.3). Each team was assigned a Project Manager, a member of the KWK instructional staff that monitored the team’s progress and provided feedback and guidance as needed.

Students documented their design process in digital design journals edited in Google Slides, with one project journal per group. The digital design journals provided templates for documenting model issues and revisions to their dataset, with space for adding screenshots of their Co-ML projects and text descriptions of their process. We provided three different journal templates that students could duplicate when documenting their projects: Issues, Wins, and Changes. The Issues template invited participants to “share any issues (unexpected outcomes or bugs) you identified when testing your model or app,” providing space for describing the issue identified, ideas for what might be causing the issue, and what they tried in response to the issue (including whether or not their changes worked). The Wins template was a space to “celebrate breakthroughs and accomplishments your team made in developing your project”, with space for describing the accomplishment and sharing any insights that led to the breakthrough. Finally, the Changes template could be used to “capture if you decided to change directions in your final project” and was intended to be used for major changes to models as opposed to smaller bug fixes. In the Changes template, space was provided for describing the change and their rationale for why the team made the change: “What led to this decision, and how do you think it will affect the user experience of your app”?

On the final day of camp, all teams presented their final project to an audience of invited family and friends.

5.3 Camp and Research Study Recruitment

In the context of the analysis described in this article, we focus specifically on the second AIML Summer Camp because for this camp, our team instrumented Co-ML with event-based logging to enable more nuanced analysis of students’ collaborative modeling practices (described more fully in Section 5.4). In this section, we describe KWK’s recruitment strategy for the NYC camp and how our research team invited students to participate in our research study.

KWK had a two-prong recruitment strategy for recruiting students to the NYC camp. First, they individually e-mailed alumni that had exhausted all 3 existing KWK camps in NYC and invited them to attend the AIML Summer Camp pilot by answering three open-ended questions about their interest (including “Why are you interested in AI or ML?”); of the 19 students that were invited, 10 applied and all were accepted into the program. Their second approach was to recruit from the pool of students that applied to advertised KWK camps (data science, mobile development, and web development); applications were scored by a member of KWK based on several metrics (curiosity, community-minded, and motivation), and students were selected based on a variety of factors, including application scores, free lunch eligibility, and prior experience in KWK camps.

On the first day of camp, a member of the research team introduced themselves and invited students to participate in our research study, outlining that participation was voluntary and required no additional work—all students completed the same assignments as part of their participation in the camp regardless of study enrollment. Students who consented to participate returned signed consent forms from themselves and a parent or guardian (if the participant was under the age of 18), and study participants were grouped together throughout both weeks of camps.

5.3.1 Study Participants.

A total of 26 female and non-binary identifying students participated in the NYC AIML Summer Camp, and 18 participants consented to take part in our research study (69%). Study participants were between the ages of 15 and 18 years old and came from five different states from the East Coast and Midwest United States (with the highest representation from New Jersey and New York). Eleven (61%) participants qualified for free or reduced lunch at school. Fourteen of the 18 participants self-reported their race and ethnicity (with 4 identifying as multiracial); 13 participants identified as Asian (Asian Indian, Chinese, Korean, or Other Asian), 1 identified as White, and 1 identified as American Indian. A majority of participants (78%) were alumni of other KWK summer camps, with 44% having completed all three existing KWK camps in web development, data science, and mobile development.

5.4 Data Collection

During the camps, 3–4 members of our research team were embedded in the classroom and captured participant experiences through classroom observations, notes, and audio recordings of participant dialogue; additionally, one member of the research team captured photos and videos of participants’ interactions. During the final week, our research team shadowed three project teams to observe their end-to-end process of coming up with a project idea, training and iterating on their model with Co-ML, and building their ML apps. After initial project pitches on Monday of the second week, the research team selected these three teams in an effort to represent a diversity of project topics and collaborative styles (based on observations we had of individual participant’s working styles from the first week, with one team consisting of quieter participants and another with more vocal participants, for example).

To understand how individuals and teams collaboratively built models in Co-ML, we instrumented the Co-ML app to capture logs of actions taken on each tablet. These logs included each time an image was added to the model, whether that image was for training or testing the model, the label the image was associated with, and the raw image data for saved images. We also collected logs when the model was (re)trained, including the number of testing images in each label that were classified correctly, and when a user started the live classification interface or the game. Each of these events was recorded with an associated timestamp and an ID for the tablet being used. Along with log data, we captured screenshots of each Co-ML project’s training and testing dataset dashboards twice a day on Tuesday, Wednesday, and Thursday of the final week, at noon and at the end of the day. This enabled us to assess how the number of training and testing images might have changed, as well as to qualitatively analyze features of deleted images (as we did not retain deleted image data in our logs to preserve user privacy).

Each project team had the option to grant the research team access to their Co-ML projects, which included all images they added to their dataset, and all study participants granted permission. The image data captured in Co-ML was stored in private cloud-based datastores and were only accessible to members of the research team and the participants within each project team. We also collected copies of participant artifacts (including design journals, presentation decks, and Xcode projects). Participant presentations were both video and audio recorded for our analysis.

Throughout both weeks of the camp, participants filled out daily surveys about their experience, including a question designed to measure their confidence engaging in DDPs, which asked participants to rate how strongly they agreed with the statement “I can demonstrate how data can influence model performance” on a scale from 1 (strongly disagree) to 5 (strongly agree). The instructors allocated 10 minutes at the end of each day of camp for students to fill out the daily surveys. Daily surveys are part of the KWK camp experience, regardless of participation in the research study; as most participants in the AIML camps had prior experience with KWK summer camps, they were likely accustomed to the daily survey format.

At the conclusion of the camp, we invited all study participants to 2-hour virtual debrief sessions using a semi-structured interview protocol. We asked participants for feedback about their overall camp experience and their use of Co-ML specifically. All 18 study participants attended the post-camp debriefs, which were held 2 weeks after the conclusion of the camp, and received a $50 giftcard for their time. To accommodate all participants’ availability, we held two separate debrief sessions using the same format and facilitated by 3–4 members of our research team.

Each debrief session began with a group discussion and reflection about the overall camp experience, followed by smaller breakout rooms with 3–4 participants each, where participants reflected on their final projects. Breakout groups were composed of participants from different final project teams in an effort to (1) encourage participants to compare and contrast approaches taken on by other teams, and (2) enable our research team to investigate descriptions of design process from each participant in a team to learn how individuals contributed to group efforts. We captured video and audio recordings of these virtual debriefs for our analysis of participants’ reflections on their experiences.

5.5 Analysis

We take a constructivist stance toward studying and designing for learning; that is, we theoretically conceptualize learning as knowledge building. Learning happens through activities, wherein people activate and draw upon prior understandings to make meanings of information around them, and then build new understandings when interactions with their environment challenge the completeness or utility of prior understandings [10, pp. 8–14][47]. This theoretical stance necessitates attention to the processes through which people learn, including what they attend to, and how they make sense of their observations, explain their understandings and consequent actions, and change strategies as their understandings change over time. We attend to process through documentation and analysis of the modeling actions that participants take as they use Co-ML, especially moments when they make sense of model performance, relate that performance to characteristics of training and testing dataset compositions, and coordinate their activity with one another, which includes justifying proposed actions with their interpretations of their observations.

Because we take a constructivist stance, the best evidence of learning is in participant-driven ideas and implementation of DDPs; as a result, our analysis focuses on participants’ final projects, where teams chose what apps to develop, what data to use, and how to improve their models over 9.5 hours of working time across 3 days. Our research team anticipated that we would see the most variation in debugging scenarios and approaches during this time as a result of teams working on distinct project ideas.

In this section, we describe our approach to analyzing data collected during the AIML Summer Camp. Our research team met regularly over the course of 10 months to discuss the data and reach consensus on our interpretations of how DDPs were represented in participants’ modeling practices.

5.5.1 Data Cleaning and Preparation.

Before our team began analysis, we cleaned and prepared the log data. We reviewed all participant-collected image data in Co-ML and removed images containing any personally-identifiable information to preserve participant and non-participant privacy. Criteria to remove images were created in a discussion of four authors, and then applied to a random sample of images. After another round of discussion, the criteria were refined and then a single author flagged images for removal. Flagged images were reviewed by all four authors who developed the criteria. In total, of the 6,756 total images students captured in their final projects, 4.2% of images were removed. Removed images included images in which non-study participants were inadvertently captured or personally identifiable information like full names were visible.

In addition, we cleaned the log data so that logs only represented actions completed in Co-ML by each study participant (as opposed to actions in Co-ML taken by the research team during our data collection throughout the camp).

We then created visualizations from our Co-ML log data to construct timelines of team activity. Using our visualization tool, we could trace (1) an individual’s model-building process and what features of the app they were using (e.g., if they added training images of one label, trained the model, and then tested the model), and (2) what their team members were doing in parallel (e.g., we could see if two team members were adding data to the same label simultaneously). Figure 6 shows a screenshot of this tool on data from one of the project teams in our study, displaying how our research team could inspect data added by individuals over time. Hovering over any image displays a tooltip indicating whether the image was added to the training or testing dataset; for testing images, we could also inspect whether the image was classified correctly and what the top predicted label was from the model.

Fig. 6.

Finally, we transcribed audio from three different sources: recordings of conversations between team members while they worked on their final projects, the final presentation each team gave, and the post-camp debrief interviews.

5.5.2 Description of Modeling Actions.

One of our first analysis priorities was to understand how frequently students used various features of Co-ML. Let us see how participants and teams actually constructed the models in their final projects. To do this, we primarily looked to the log data. For each app feature, we calculated the frequency of use for both participants and teams. For example, we calculated how many times members of a particular team trained their model and how many times they played the game. We also calculated the size of training and testing datasets for each group. Finally, we analyzed screenshots of Co-ML that students incorporated into their project journals and final presentations to provide evidence about model performance. While these screenshots are not exhaustive of all features they used in Co-ML, they provide an indication of which parts of the Co-ML interface students felt supported their descriptions of how well their model was working.

5.5.3 Identifying Data Diversity Strategies from Co-ML Datasets.

To answer our question of how participants considered dataset diversity (DDP1), we examined the image data from all team Co-ML projects using a grounded-theory approach [14, 28, 66]. We began with inductive open coding of the corpus of images from final project Co-ML datasets to categorize different strategies for incorporating data diversity (such as if teams took photographs of objects from different angles, with different backgrounds, or captured multiple potential form factors of an object). The open coding was performed by the lead author, then collectively discussed with three other members of the research team to identify additional codes and to cluster codes based on similarity. Once our group was in agreement on all four overarching data diversity strategies identified in the data, we then reviewed the image data again to deductively code each team’s dataset, leading to frequency counts of how many teams utilized each data diversity strategy in constructing their datasets.

5.5.4 Identification of Collaborative Modeling Moments.

To understand how participants engaged with each DDP while collaboratively building models with Co-ML, our research team identified model debugging moments from each team through a multi-step and multi-source process. Analysis of these debugging scenarios ultimately helped us construct in-depth case studies of how Co-ML and peer-to-peer interactions helped participants enact DDPs in refining their models.

First, we reviewed transcripts of team final presentations and final presentation slide decks, where participants provided descriptions of challenges they worked through when implementing their models and mobile applications. The presentation decks often included screenshots from Co-ML that complemented their verbal descriptions; for example, when describing how they resolved a misclassification scenario, some teams included screenshots of the testing dashboard in Co-ML showing how many of their samples were classified incorrectly before and after revising their datasets.

Second, we reviewed transcripts of post-camp debriefs, where students described these debugging scenarios in more detail in conversation with our facilitators. As participants were placed in mixed breakout groups containing participants other than their final project teammates, we compiled per-team documents containing statements from each team member to more fully characterize each individual’s role in the debugging scenario.

Third, we examined how participants described model issues in their project journals, including what screenshots of the Co-ML app they used to supplement descriptions of model issues, ideas of what they thought might be causing the issues, and what they tried in response to their ideas (and whether or not those changes improved the model).

Fourth, we reviewed our visualization of the log data to reconstruct participants’ actions in Co-ML during these debugging moments. To review the logs, we had to first determine approximately when these debugging scenarios took place. If a member of our research team observed the team directly, we reviewed dated observation notes to determine approximately when the scenario occurred. If needed, we also examined screenshots of Co-ML from project journals or final presentation decks to determine, using our logs, when a particular image sample was created. For example, if a team captured an example of a misclassified image in their design journal, we could find the image in our logs and determine when it was captured. We then selectively reviewed our visualizations around the identified time when the debugging moment occurred to corroborate participants’ reflections on their process from our interviews and their design journals.

Finally, our fifth step was to supplement log timelines with select transcription of participant dialogue from our audio recordings captured around the time of the debugging scenario. This let us triangulate using discourse between participants while debugging, logged actions they performed in the app, and reflections participants had on their experience.

We then deductively coded the debugging moments based on the DDPs we identified from the literature (Section 3). While each debugging moment might have encompassed multiple DDPs, we ultimately chose one in-depth case study for each DDP to best illustrate what the practice looked like as participants used Co-ML.

6 Results

We begin by providing an overview of the final projects participants built, including the problems they designed for and summary statistics about their datasets and model iteration. Next, we illustrate the ways collaboration shaped team understanding of the role of data in ML by considering learning with respect to each of the four DDPs from Section 3.

6.1 Final Projects Overview

Participants worked in teams of three to create custom ML-powered mobile apps centered on topics of their choice, ultimately addressing substantive issues around racial inclusion, consumer responsibility, and sustainability across diverse domains like food, fashion, and health. Table 2 shares a summary of the projects created by the six teams along with each of their models’ corresponding labels. The diversity of ideas and applications participants built suggests that Co-ML and the companion starter app flexibly supported a range of personally-relevant projects, ultimately inspiring participants to consider the multitude of ways ML can impact their lives. As one participant from Team Fashion (Fashion-P3) described: “All of our projects were about an important issue in society... We learned what AI and ML is and what it does, and more importantly, we learned how to take those skills and apply it to address important issues in the world.”

Table 2.

Team Name	Project Description	Labels
Plants	Classifying house plants and providing information on how to take care of them	cactus, succulents, pothos, orchids, monstera, snake plant
Fashion	Identifying fashion labels from clothing tags and providing information about the brand’s environmental impact	CHNGE, F21, H & M, Patagonia, Zara
Donations	Identifying categorizes of items that can be donated and where to donate them	Technology, stationary, books
Nutrition	Providing nutritional information about food items	Orange, cucumber, mint tea, mayonnaise, poptarts, juice
Foodwaste	Recommending recipes that use common leftover ingredients	Avocado, onion, orange, apple, tomato, sliced avocado, sliced tomato, sliced onion
Makeup	Revealing how inclusive different makeup brands are based on the range of skintones they support	Glossier, Covergirl, Clinique, Fenty, Neutrogena

Table 2. Summary of Team Final Projects

Figure 7 presents an example final project from Team Foodwaste, who created an app centered around reducing food waste by recommending how to use leftover produce. To create this app, they constructed a classifier in Co-ML to identify produce like tomatoes and onions and programmed their starter app to suggest recipes for identified ingredients.

Fig. 7.

Each project had between 3–7 labels, and their model’s data consisted of participant-collected images of objects they photographed during camp. Objects were brought from home (such as houseplants and pieces of clothing) or available at the camp (like laptops, books, plants, and snacks). Additionally, half of the teams (Teams Plants, Fashion, and Makeup) incorporated photos of image search results from the Internet to supplement their datasets.

Teams had an average of 1,000 images in their training data, with a range of 399–1,609 images (SD = 475). In post-camp debriefs, participants noted that this dataset size was possible because of the distributed data collection experience in Co-ML, where multiple people could contribute data simultaneously. Student Foodwaste-P1 shared, “I think the process of getting data and taking the photos became so much more efficient—the fact that we were on different iPads. If it was just one iPad that we were just passing around, it would probably take us so long just to get all of those photos together.” Projects had on average 220 testing images (SD = 53). While these datasets are smaller than production-level ML models (which might have hundreds of thousands to millions of images), the underlying architecture of Co-ML leveraged transfer learning to enable users to train models on smaller datasets that were still on average 80% accurate to their test datasets as shown in Table 3. Accuracy was calculated by weighting each label’s accuracy by the number of images in the testing set, then summing those weighted accuracies to get an overall picture of how well the model was performed on the testing data.

Table 3.

Team	Training Images	Testing Images	Accuracy
Plants	657	174	0.93
Fashion	399	231	0.51
Donations	985	203	0.89
Nutrition	1,521	302	0.82
Foodwaste	1,187	157	0.89
Makeup	1,609	253	0.86

Table 3. Descriptive Statistics of Final Project Models and their Datasets

In the following sections, we describe the ways in which each dataset design practice was enacted, with participants collaboratively designing their datasets and interpreting model evaluation results to refine their data and corresponding models.

6.2 DDP1: Incorporating Dataset Diversity

Participants considered dataset diversity as it related to their envisioned user experiences for their apps. Our analysis of Co-ML datasets identified four types of dataset diversity strategies: perspective, contexts, and states related to diversity on a single object, while types involved variation on a class of objects. Table 4 defines each of these strategies and provides examples from team projects.

Table 4.

Strategy	Definition	Examples
Perspectives	Varying viewpoints on a single object	Rotated object
		Crop and size of object
Contexts	Varying the background around an object	Color or pattern of background
		Presence of occluding objects (e.g., hand holding item)
States	Varying the form factor or condition of an object	Items opened or closed
		Ingredients slice or whole Condition of an object (e.g., health or sick plants)
Types	Varying the types of objects within a class	Distinct form factors (e.g., over-the-ear-headphones versus in-ear earbuds)

Table 4. Data Diversity Strategies Represented in Team Projects

Perspectives on an object included capturing multiple angles or camera zoom levels on a single object and was represented in all projects. Contexts on an object (also represented in all projects) involved photographing objects on different colored backgrounds or having occluding objects like hands represented in part of the data. States of an object encompassed multiple form factors or conditions, such as whether a fruit was cut or whole or if a plant was health or dying; states were represented in 4 of the 6 projects. Types incorporated different kinds objects to represent a single label and was a strategy used by just two teams due to the specific use case for their apps. For Team Makeup, variation within a class involved capturing both 2D product images and physical 3D products to account for shoppers who might use their model in a physical retail shopping experience or online. For Team Donations, each label covered multiple types of objects; for example, under the label “Technology,” they took photos of phones, laptops, and earbuds. For all teams, multiple strategies were used in combination to diversify their datasets.

While both training and testing data were largely captured in the same classroom, participants took measures to mimic different contexts by using patterned or colored fabrics (provided by the instructional team) as backdrops for their objects, or by testing with images from the Internet (which half the teams did). For example, Team Fashion wanted their model to identify brand labels on pieces of clothing; since they had a limited amount of clothing they could bring to the camp, they decided to expand their dataset by taking photos of web search results of clothing of different colors. Our debriefs revealed how participants had even more ideas for how they would like to add diversity to their project’s data if they had the opportunity to work in a different environments, with participant Foodwaste-P3 describing how “The space that we had [classroom] was very well lit, and I think that was something else we were trying to do get—darker lighting and more diversity in our dataset.”

Some data diversity strategies were discussed a priori within teams at the onset of their data collection process. Since Team Plants recognized that plants could be in many different rooms in a house, they focused on photographing plants in different settings such as on a ledge, in front of a window, or on a table (an example of a context data diversity strategy).

Other strategies arose through the use of Co-ML as participants reviewed their collective datasets and noticed gaps in how objects were represented. This process was aligned with our design goals for Co-ML to help surface differences in how individuals capture data and test models, as well as enable learners to discuss and interpret those differences. We illustrate this process using a vignette from Team Foodwaste in which the team expanded their data collection strategies based on monitoring differences between how individual photographed their objects.

6.2.1 Vignette: Noticing Differences to Diversify Data.

Team Foodwaste’s project centered around identifying produce items, and their image data incorporated multiple data diversity strategies: perspectives, contexts, and states. At the start, the team discussed capturing the produce items on different backgrounds to simulate kitchen countertops of varying colors so the model would be able to work well regardless of the surface an ingredient was on (an example of contexts); one participant captured images of the ingredients on a black coffee table, another on a wooden surface, and the third on a white table.

In contrast, other data collection strategies emerged through the modeling process. In her initial set of training images, participant Foodwaste-P3 captured photos of a whole avocado, with some training images showing her hand holding the avocado, and some capturing the avocado by itself. At the same time, her two teammates were capturing photos of onions and tomatoes, respectively, without any hands visible as shown in Figure 8.

Fig. 8.

After reviewing their collected images using the training data dashboard in Co-ML, Foodwaste-P3 asked her teammates if they were putting hands in their photos, at which point the other two team members realized that they did not. As a result, they began to add images of both the isolated ingredient and the ingredients held by their hands. This was important as they wanted their model to work regardless of how the user presented the object to the camera. In our post-camp debriefs, Foodwaste-P3 reflected on this experience, sharing that while each team member was adding training data, she would periodically review the training dataset to monitor differences in their strategies:

I was going through photos that my other teammates had taken. And then I was trying to make sure my data was similar to theirs, or had the same diversity those did. And so that was when we started to realize this [not having hands in the training images] was an issue, but it was not something we had discussed. So, I think it was helpful that we could see the kind of training data that other people were taking. And then talk about it and improve our model that way.

This same participant then went on to describe how perspective-taking worked in both directions—she brought attention to differences in data collection strategies to help her teammates, and she also learned from her teammates’ perspectives:

Personally, when I was taking photos, I was very limited in the kind of photos I was taking. I was taking very close up and certainly specific angles, but my other teammates were taking them farther away or with other distracting objects in the background. Which is, good, I think to diversify our dataset.

Thus, we observed that this team was able to discover ways of diversifying their dataset by examining differences in individual data collection strategies with Co-ML, aligned with our goal of enabling multiple perspectives to emerge from a distributed data collection experience.

6.2.2 Diversity in Response to Model Failure.

Incorporating dataset diversity was also driven by identifying failure cases of the model. Team Nutrition found that because only one team member initially collected training data in which a hand was visible for the orange label, any item held by hands was incorrectly classified as an orange, leading the team to add more images of hands holding objects to each of their labels. Team Makeup reflected on their strategies of adding diverse perspectives and contexts, as described by student Makeup-P1: “We did that [resolved misclassifications] by adding more training data of misclassified products while being more inclusive with the images that were inputed. So we took pictures from different angles or different lighting and we were especially conscious of the backgrounds of the new images. By incorporating new backgrounds, we were diversifying our dataset.” Dataset diversity thus was seen as a way both to account for different intended use cases and contexts, but also as a way to improve classification accuracy by accounting for dataset gaps, which we describe more fully in the next section.

6.3 DDP2: Evaluating Model Performance and its Relationship to Data

One of the affordances of the Co-ML software is that models train in a few seconds on tablets, allowing teams to quickly experiment with their models and the data they were trained on. All teams took advantage of this affordance by retraining their models between 17 and 75 times, with groups retraining their models an average of 40 times (SD = 20), and individuals retraining on average 13 times. (Note that with Co-ML, each person needs to retrain on their iPad to have the most up-to-date model). The high frequency of retraining indicates that participants were able to iteratively test and continually refine their datasets.

Participants leveraged several different features of Co-ML when evaluating model performance between model retrainings. Our logs revealed that all teams utilized the testing dashboard for reviewing test images, as well as the live classification feature for viewing model results in real time, the latter of which was invoked an average of 11 times for each team (SD = 5).

Four of the six teams (all except Teams Nutrition and Donations) used the game to test their models. Teams Plants and Fashion played the game one time each, while Team Makeup played 4 times and Team Foodwaste played 11 times. Participants described how the game enabled them to monitor their progress. For example, participant Fashion-P1 shared, “I really enjoyed the testing game because it allowed us to see our progress and see how we’ve improved and what we added that improved the model’s accuracy. I feel like we could test out some piece of data and be able to see how the [model] fared.” We observed that Team Foodwaste developed a system for continually refining their model using the game; periodically, after their team had collected more data, each team member would play the game and review the Game Over screen to assess which labels were performing better and in what instances the model failed. Based on these results, they would decide which labels to add or delete data from. Participant Foodwaste-P1 described how their goal was to keep improving their model until all members of her team were able to get “100% prediction scores” in the game.

Our analysis of team project documentation revealed which features of the Co-ML app participants documented most frequently to support descriptions about how they iterated on their models. Figure 9 displays examples of screenshots of Co-ML from team project documentation, which fell into one of four categories: (1) a sample classification result (showing the result of a single test image); (2) the testing dashboard (displaying a set of testing images and their classification results); (3) the live classification interface (where model results are shown in real time); and (4) the average confidence scores achieved while playing the game. The majority of teams (five out of six) incorporated screenshots of classification results for a single sample, while half of the teams included images of the testing dashboard. The live classification interface and game average confidence scores were incorporated into the design journal of a single team (Team Foodwaste).

Fig. 9.

The combination of frequent model retrainings alongside repeated use of the testing dashboard suggests the value of integrating test datasets into a novice ML experience. And Co-ML further enabled this exploration through a collaborative experience where learners could debate interpretations of model failures and decide how to iterate on their data collectively. To illustrate the role of collaboration in evaluating models, we provide a vignette of project Team Plants, who wrestled with a misclassification for their houseplant classifier.

6.3.1 Vignette: Collaborative Model Testing.

Team Plants developed a final project centered around identifying houseplants and providing information on how to take care of them. When testing their model, each team member individually discovered that their Pothos plant was being misclassified as a Succulent, ultimately realizing that their training dataset lacked images of the Pothos plant shot from above as displayed in their project journal in Figure 10.

Fig. 10.

In the dialogue below, two members of Team Plants discussed the misclassification together:

—

Plants-P3: When I’m doing it [testing the model] at this angle, [the model’s] saying the Pothos is a Succulent.

—

Plants-P2: Ok, got it. I was doing the same thing [testing the model on Pothos]. Like, from the top angle, it [the model] kept getting [misclassifying the Pothos as] the succulent. I’m going to try to take some pictures of it.

—

[Both participants begin adding more training images of the Pothos plant].

Through discussing the misclassified Pothos images, Plants-P2 and Plants-P3 decided to work together to add more images of the plant and discussed ideas for what might be causing the model failure:

—

Plants-P2: I’m trying to take more photos of the [Pothos] stem because I guess that’s mainly near the top. Or do you think it’s...because the succulent is so close to the soil, and that one [the misclassified pothos] is close to the soil maybe? If that makes sense?

—

Plants-P3: Yeah. Like, the soil [of both the Pothos and Succulent plants] looks similar.

—

Plants-P2: And maybe it’s [the model’s] using that to identify it, rather than the actual plant? Cause this one’s [the Pothos], the soil’s kind of covered with the stem [of the plant], so I do not know if it’s looking at that.

—

Plants-P3: Oh that’s true, yeah.

Through their discussion, both participants decided that the soil may be a confounding factor-–that the model is attending to the soil to incorrectly classify any images with soil as Succulent. Once they took additional photos of the Pothos plant with soil visible, they retrained their model and tested it on new data.

—

[Plants-P3 tests the retrained model on an image of the Pothos. The model classifies the Pothos plant correctly. Next, she takes a photograph of a succulent, but the succulent is now misclassified as Pothos].

—

Plants-P3: I think I fixed it, but now it’s saying the Succulent is the Pothos.

—

Plants-P1: Fix one thing, break the other [laughs].

While their refined model appeared to work better at identifying their previously misclassified Pothos plant, its performance for identifying Succulent was negatively impacted, leading the participants to recognize that the model needed to be more comprehensively tested. This is a common issue in ML practice, in which edits to a single label might affect model performance for other labels.

In this example, collaboration contributed to group awareness of the model failure, discussion and ideas for the cause of the misclassification, and coordinated data collection efforts, positively contributing to participants’ discovery and reasoning about gaps in their dataset.

6.4 DDP3: Balancing Datasets

We designed Co-ML to encourage the creation of balanced datasets by displaying sample counts throughout the interface, such as when users add images via the camera or review their datasets in the training dashboard. Our intention was for learners to use this information to identify if there were fewer samples for particular labels, and to ultimately create datasets with a similar number of samples for each label. Instead, we observed that while some teams initially set targets for equally distributed data collection across labels, teams ultimately took a more reactive approach, attending to specific labels based on model evaluation results. In fact, some groups intentionally created skewed datasets to improve performance for underperforming labels.

Figure 11 reports the percent distribution of training and testing samples for each label within a project. First, these plots reveal differences in sampling among training data; for example, Team Plants had fewer Cactus (7.5%) and Orchid (8.7%) images relative to other labels in their project. Additionally, we observed differences between training and testing efforts; while training samples were more equally distributed for Team Makeup, the distribution was less equal for their testing data, with Glossier and Neutrogena accounting for a smaller amount of their overall testing data (9.2% and 5%, respectively). In this section, we describe why these differences emerged as a result of decisions learners made about which labels to add data to.

Fig. 11.

An example of shifting away from equally distributed training data was observed with Team Makeup. While the team originally decided to add 100 images for each label, they soon realized this was a sub-optimal strategy because the model had more difficulty identifying some labels versus others. In response, their team only add data to labels their model did not work as well for:

We first decided to have a set number of photos to add for each label—100 for each label just as a base point. But then afterward, we realized that that was not the best way to go about it because [for] some objects, the model had more difficulties identifying than others. And, so we ended up just taking more photos for the ones that were difficult to classify.

The strategy to add mostly to labels with lower classification accuracy was also taken on by Teams Plants and Fashion, who both described this strategy as attending to “equity over equality.” Participant Plants-P-1 shared that, “My team really value[s] the mindset of equity over equality when it came to training labels. So once we realize that one label was working most of the time, or even working 100% of the time, we wanted to focus on the labels that maybe were not working as much and that really helped us.” They determined they needed more data for leafy green plants compared with their only flower label, a white and purple orchid plant; their orchid label represented 8.7% of their total training data, compared with their Pothos label, which had 29.7%. Because the orchid plant was visually distinct from the other labels, and the model was consistently able to classify it correctly, the team members turned their attention to labels that were not performing as well, as described by Plants-P2: “Our orchid label had a lot less data than anything else because it was so different than all the other plants, and so the model was able to recognize it with a lot less data than the other plants that were all kind of green and similar shapes.” Likewise, Team Fashion described how more data was needed for brand logos that looked visually similar to one another compared with those that were unique and easier for the model to identify.

Thus, by attending to model performance, participants wrestled with the machine learnability of individual classes, ultimately prioritizing adding more data to underperforming labels over having more evenly distributed class balance.

6.5 DDP4: Inspecting for Data Quality

Participants discovered and interrogated data quality issues by reasoning about misclassifications that surfaced in the testing interface of Co-ML. Misclassifications were largely project-specific, stemming from the unique qualities of the objects used in their data. To illustrate how collaboration played a role in how teams resolved data quality issues, we provide an in-depth description of a debugging scenario faced by Team Fashion, followed by a comparison of how this group’s approach was reflected across other teams.

6.5.1 Vignette: Resolving Data Quality Issues.

Team Fashion designed an application to help consumers learn about the sustainability of different clothing brands, creating a model that could identify five common brands and display a sustainability rating from 1 to 5 (using data from Good On You [82]), along with suggestions for second-hand shopping or more sustainable alternatives. Their data incorporated photos of physical garments they owned as well as images from the Internet of different colored clothing.

Initially, members of Team Fashion captured images where each article of clothing was fully visible, but they quickly transitioned to taking more close-up shots of the brand logos themselves. This change in strategy came about because of the model’s low accuracy when testing, leading the team to look for issues in their dataset using the training and testing dashboards in Co-ML. Inspecting their test data led them to discover a pattern of reflective artifacts in images of Internet search results they photographed from a laptop screen (Figure 12). This problem was first brought up by participant Fashion-P2 based on a testing result—in response, the team examined the training data for similar patterns, hypothesizing that, “Maybe it [the model] was associating whatever has reflections with that brand that had that [reflections] in the training data.” After identifying and discussing the issue together, they then removed approximately 20 images from their training data, across three of their labels, as shown in Figure 12.

Fig. 12.

In our post-camp debriefs, Fashion-P1 described their team’s debugging process as follows:

I think we each kind of gave different insights. We all kind of contributed to it. We were just looking at it [the training data] on our individual iPad’s. But then as we kept studying the failure cases, I think each of us came up with, like, “Oh, this could be a reflection,” or, “Oh, this could be like this color”... We came up with a couple more theories...[and] the two main ones that we all kind of came to a conclusion on [were] after seeing our individual testing data and confirming that...this seems to be like a pattern in all of them.

Notably, it was through a combination of reviewing their training and testing data in Co-ML along with group discussion about patterns in misclassified images that the team decided to clean their dataset. Describing this process during their team’s final presentation, Fashion-C shared: “We spent a lot of time analyzing our failure cases and looking for problematic spurious relationships. This allowed us to identify which pieces of data had to be removed and which ones had to be increased.”

Analyzing failed test samples as a group also provided space for each team member’s perspective to be valued, as shared by Fashion-P1 “I feel like all of us just approached the situation in different ways, and we thought of creative ways that we could test our models. So it’s definitely so much better than working alone because you can bring in so many new perspectives and kind of be more efficient overall.” Through discussing the model’s data, the team was able to share multiple theories about qualities that may contribute to misclassification, and ultimately act on those theories by removing lower quality images from their dataset.

6.5.2 Data Quality Issues Across Projects.

While Team Fashion’s data cleaning process resulted from collaborative discussion, Co-ML also supported asynchronous data cleaning by supporting ambient awareness of the dataset. Describing this process, Nutrition-P1 shared, “I think we would look at each other’s to see if they were like 100% correct. And, if the data was accurate—to make sure there were not any really blurry photos…or maybe the wrong item...I feel like [we] just looked over each other’s work just to make sure it was correct.” Team Makeup echoed this idea, with participant Makeup-P2 describing a scenario where she had accidentally collected images under the wrong label, and her teammates helped her delete the mislabeled data, stating that this saved her group time and helped them work efficiently. These examples show how debugging related to data quality was facilitated by having multiple eyes on the data.

These scenarios point to how a multi-user modeling experience with Co-ML enabled learners to monitor data quality by reviewing each other’s images, discussing and debating theories about what might lead to model failures, and collectively sharing in the responsibility to delete lower quality images. Throughout this process, the testing and training dashboards in Co-ML anchored their discussions as they continually revised their datasets.

6.6 Increased Confidence in Dataset Design

Throughout the camp, we asked participants daily to rate their ability to contribute to conversations about “how data can influence model performance”, “the design and improvement of ML systems”, and “examples of ML to problem solve”. These ratings were on a scale from 1 (strongly disagree) to 5 (strongly agree), and we compared the mean of each participant’s response to these questions at the beginning and end of the camp. This showed an increase across the camp of about half a point (0.53), from 3.9 (SD = 0.67) on the first day to 4.8 (SD = 0.43) on the final day. This increase was statistically significant, t(18) = 5.48, p < 0.001. The effect size of this difference, using Cohen’s d, was 1.26, which is considered a “large” effect [15]. This shows that participants’ confidence in dataset design had increased, because understanding how data can influence model performance and the design and improvement of ML systems are both fundamental to dataset design.

Participants reflected that their ability to modify datasets directly, and see how those changes affected model performance, was made possible through their hands-on experience using Co-ML. Participant Fashion-P1 expressed this idea by sharing, “Working with an image classifier was really cool in that we could modify our objects any way that we wanted in order to train the model and make it more accurate.” Similarly, participant Foodwaste-P3 described how Co-ML supported her experimentation with data: “When we had an issue, we were able to look at our training data and [see if] “It did not work against this background because we did not have it in our training data.” The issues we did have, there were solutions to it—just add more training data or delete some data.” In turn, participants were able to take on an active role in building models, which student Foodwaste-P2 contrasted with only being on the receiving end of ML systems: “It was really cool to be able to be on the developer’s side to train and test models that we had only been users for before.”

7 Discussion

Our analysis of group ML model building using Co-ML, situated in the designed context of the AIML Summer Camp, revealed how collaboration positively shaped the enactment of core DDPs. These results demonstrate that the ways participants explored the design of ML datasets during the camp were compatible with our initial design goals for the Co-ML app.

7.1 The Role of Collaboration in Designing and Debugging Data

Collaboration influenced how participants engaged in each of the DDPs, playing a key role throughout their model-building process by helping participants (1) collectively inspect data to identify differences between misclassified and correctly classified test data; (2) discuss ideas about the cause of misclassifications; and (3) coordinate how to enact those ideas by modifying data and retraining their models together.

We observed that, through continuous discourse, individuals encountered ways to capture data other than what they had individually considered, which helped them create more diverse training data-sets (DDP1). We saw individual differences in data collection arise for Teams Foodwaste and Nutrition, who both discovered that team members differed in whether they photographed their objects with hands visible. Discovery of these differences emerged through both active verbal discussion and ambient progress monitoring of their shared training and testing dashboards. By noticing and discussing the “hands versus no hands” conditions, participants were able to propose, test, and revise ideas about how and to what extent human hands impacted their model’s performance (DDP2).

Conversations between participants also deepened their understanding of data quality and its relationship to model performance, similar to prior findings that peer-to-peer dialogue can support students in generating more robust, diverse ideas when they use ML tools [38] and that acknowledging and discussing proposed ideas can lead to positive uptake of new solutions [7]. As exemplified by members of Team Fashion (as described in 6.5), Co-ML surfaced discussion about the root causes of misclassified test images. By working together, participants brainstormed reasons why certain images might be misclassified and the specific data quality issues that may have contributed to erroneous predictions. They concluded that unintentional image artifacts led to poor data quality (DDP4), deleted these images from their dataset, and retrained their model to improve its performance (DDP2). Co-ML’s data editing features supported their revisions by enabling data deletion, an atypical practice for novice ML-users [81].

Our results show how enabling collaboration in a novice ML modeling experience positively contribute to learning about ML and data diversity, quality, and model performance.

7.2 The Value of Testing Interfaces for Encouraging Model Iteration

All teams iteratively refined their models, retraining them on average 40 times. Model iteration was supported by a key feature of the Co-ML experience that is missing in other novice-oriented ML tools: a testing interface for collecting and reviewing test datasets. Existing tools (as described in 2.2) largely rely on ephemeral live classification, where users monitor model results in real-time, providing little opportunity for users to revisit individual misclassified samples, compare misclassifications with correctly classified data, and determine whether evaluation results improve in response to model retraining. In contrast, we observed that the testing interface in Co-ML was commonly used by students to monitor their model’s progress, encouraging them to revisit and revise their datasets, an iterative practice common among ML professionals but overlooked by ML novices [81].

Because participants iteratively refined their datasets and retrained their models, they needed to manage multiple dataset considerations in parallel, just as ML practitioners do when dealing with real-world, messy data [61]. For example, when groups attended to underperforming labels, they described how models may not necessarily need the same amount of data per label (one strategy for class balance). Specifically, they reasoned that more data might be needed for labels that are harder for a model to learn, referring to the learnability of a label. In this way, participants were authentically engaging with DDPs because they could quickly update their data and test what impact changes to their data had on model performance (DDP2). As described by participant Fashion-P1,

It can be really hard to build a ML model because of many factors that play into it... But using Co-ML allowed us to really accurately visualize what that would actually look like, because we were able kind of see how bad data could affect the model and how to remove that.

With Co-ML, we show how testing interfaces supported iterative data refinement and model retraining. Because model iteration is such a foundational part of ML model design [33], we suggest that other ML tools and educational efforts would benefit from incorporating user experiences around refining and reviewing test data to encourage more model iteration.

7.3 Learner-Driven Projects for Authentic ML Modeling

Because participants chose their own projects and used their own data, each group encountered unique considerations around model use that shaped their thinking about what representative data means. Iteratively refining their datasets and assessing how that affected model performance (DDP 2), alongside the use of personally-designed data during the ML camp, enabled participants to explore aspects of dataset diversity (DDP1) that were not possible in prior work because learners either used the same underlying materials [23] or their project ideas were selected for them based on feasibility [75].

While previous work [23] described learners capturing data from multiple backgrounds and angles (the perspectives and contexts described in our dataset diversity strategies in Section 6.3), we saw two additional strategies around states and types, where participants considered multiple form factors or conditions for a given object, as well as the ways a class of objects might be represented. We believe that the use of personal data may provide opportunities for encountering domain-specific edge cases for model use, such as the use of healthy and sick plants for the Team Plants. This can enrich learner understanding and appreciation for how datasets should accurately reflect a variety of user scenarios.

Furthermore, because each team applied ML to distinct problems, participants were able to expand their notion of how ML can applied to range of issues in the world. One participant described this by stating, “I love how we can utilize Co-ML and AI and ML in general across a variety and spectrum of topics.” The cultural relevancy of the projects and data that teams designed is especially important given prior work suggesting that engaging in socially-relevant projects can increase participation of female-identifying students in CS [39]. AIML Summer Camp participants not only built personally-relevant models, but embedded these models into custom apps, allowing them to realize how models can be integrated into user experiences. This end-to-end model and application design experience helped participants not only develop an understanding of data in ML, but also see their own role in shaping ML systems in the future. Reflecting on her experience, participant Foodwaste-P2 shared,

In the lessons, before we started using Co-ML, we heard about everyday uses of AI and different models like, Instagram [and] Snapchat filters. I thought it was really cool to be able to use our own objects and then actually train and test and go through all of that stuff that developers have to go through to create the things that we use.

Through using Co-ML in the ML camps, learners gained confidence and saw themselves as developers of future ML technologies with social impact. This was demonstrated by responses to the daily surveys, which showed a statistically significant increase in confidence in contributing to conversations about the design and improvement of ML systems with a large effect size from the first to the last day of the camp.

8 Limitations and Future Work

The use of Co-ML we describe in this article was necessarily shaped by the AIML Summer Camp context—an out-of-school enrichment experience designed primarily to support young women and gender nonconforming youth. The AIML Summer Camp centered activities and discussions at the intersection of technology and social equity, and this lens likely played a role in participants picking socially-minded final projects. Further, as a supportive environment for youth underrepresented in technology, AIML Summer Camp may have especially fostered a safe place for participants to explore topics of personal relevance such as sustainable fashion and racial inclusion in the beauty industry. While these topics highlight the importance of creating personally relevant applications, in this article, we did not investigate student identity development in relation to gender and computing. Given the lack of representation in the technology industry today, we believe further research on AIML Summer Camp is needed to better understand how the intentional design of a learning environment for female and gender nonconforming youth supported students’ engagement with Co-ML. We also acknowledge that future work is required to determine the transferability of our results to other contexts such as in-school learning experiences and learners of mixed gender and cultural identities.

As a high percentage (78%) of the camp participants were alumni of other KWK summer camps, participants were likely to have more experience with computing than the average KWK student, as well as more experience working in groups to develop projects (as all KWK camps involve students working in groups to develop final projects). We note that participants in our study did have a higher rate of free or reduced lunch (65% for study participants compared with 50% for KWK participants in all camps), and 93% of the study participants identified as people of color (compared with 82% for KWK participants in all camps).

The majority of images that participants captured included items they could reasonably bring to the camp themselves, as they were unable to take their iPads home and thus use Co-ML outside of the camp. While we did see participants diversify their datasets to align with their imagined user scenarios (such as capturing images from the web to expand their dataset), we imagine that leveraging the mobile quality of iPads to collect data in multiple different contexts, in and outside a classroom, may further learning and experimentation with data diversity.

While decisions made by students related to data diversity (DDP1), model evaluation (DDP2), and data quality (DDP4) have ethical implications [49], we acknowledge that our analyses did not focus on these. The ethical and critical issues that students encounter when using Co-ML warrant a comprehensive exploration and analysis. Future research should explore how students’ ideas of justice and ethical stances influence their decisions when engaging with data design practices, and how instructional activities may support critical engagement with Co-ML [2, 5].

Our research team was only able to closely shadow three of the six final project teams. As a result, for the other three teams, we were reliant on participants’ reflections on their experience, either through design journals, they maintained during camp or post-camp interviews. To the best of our abilities, we tried to corroborate the issues they self-described using our log visualizations, but for teams that were not directly observed, there may have been debugging moments that were not recollected in participant reflections and thus not represented in our analyses.

The AIML Summer Camp camp was our first test of our in-app logging system, and we ran into technical difficulties that precluded our logging of when users deleted individual images in their datasets. Missing logs of image deletions gave us less insight into how individuals may have taken on data cleaning responsibilities throughout final projects. While our team captured supplemental snapshots of students’ Co-ML projects and corresponding training and testing data twice a day when teams were building their final projects, this information only provides context about deleted images at the group-level and at specific moments in time; we are unable to attribute deletion of specific images to particular individuals moment-to-moment. This makes it difficult for us to definitively identify what information in the app (such as a misclassified test sample) might have motivated users to delete images in their dataset. Furthermore, we had limited capability to analyze qualitative factors of deleted images because we do not retain those images to protect user privacy. We mention these challenges because handling of deleted participant-collected data is especially pertinent to ML education, where considerations about data ownership, privacy, and ethics are important. Screen recordings or videos of participant interactions during the model-building process may be especially valuable to understanding how learners’ collaboratively cleaned data, neither of which our team was able to collect in this pilot study.

Finally, we see an opportunity in future work with Co-ML to analyze and categorize different types of collaboration and its impact on learning of ML. Our initial analysis of log data revealed two basic collaborative strategies for data collection. Some teams took a “divide and conquer” approach, where data collection responsibilities were split between team members, with members focused solely on a subset of labels. Other teams had a “cross-label” approach where group members had contributed to most labels by the end of the project. In the design of Co-ML, we envisioned that reviewing and contributing across most or all labels of a project would provide more opportunities for participants to discuss data that others had added, and to realize different perspectives for diversifying their dataset. However, further work is needed to fully relate our log data with participant and group motivations for working more narrowly or expansively across labels; we believe this future work may be able to identify specific collaborative behaviors and strategies that support learning about the relationship between data and model performance.

Acknowledgments

We would like to thank all AIML Summer Camp participants who enrolled in our research study for sharing their time and feedback with us. Many members of the Kode with Klossy team played an instrumental role in organizing and implementing the AIML Summer Camp, including Tara Tran, Laura Angelich, Hallie Smith, Dorothy Chang, Hannah Kim, and Allie Feldman. At Apple, we would like to thank Stuart Ralston, Richard Lombardo, Casandra Sisneros, Mike Mead, Adriana Hilliard, Paris Garrett, and Emmanuel Adepoju for all of their support in facilitating the Kode with Klossy partnership, educator training, engineering of Co-ML, and student support through the camps.

Supplementary Material

toce-final (toce-final.mp4)

Supplementary video

Download
59.34 MB

References

[1]

Adam Agassi, Hadas Erel, Iddo Yehoshua Wald, and Oren Zuckerman. 2019. Scratch nodes ML: A playful system for children to create gesture recognition classifiers. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems. 1–6.

Abstract

1 Introduction

2 Related Work

2.1 The Role of Data in ML

2.2 ML Education for Youth

2.3 Collaborative Learning

2.4 Research Opportunities for Fostering Dataset Design Practices for Novices

3 Dataset Design Practices

4 Co-ML System and Companion Starter App

4.1 Co-ML Design Goals

4.2 Co-ML System Description

4.2.1 Define Ontology of Labels in a Classifier.

4.2.2 Collect Data.

4.2.3 Train a Model.

4.2.4 Evaluate Model Performance.

4.2.5 Iterating on a Model.

4.3 Starter App

5 Methods

5.1 Kode with Klossy and Creation of AIML Summer Camp

5.2 AIML Summer Camp Structure and Activities

5.2.1 Final Projects.

5.3 Camp and Research Study Recruitment

5.3.1 Study Participants.

5.4 Data Collection

5.5 Analysis

5.5.1 Data Cleaning and Preparation.

5.5.2 Description of Modeling Actions.

5.5.3 Identifying Data Diversity Strategies from Co-ML Datasets.

5.5.4 Identification of Collaborative Modeling Moments.

6 Results

6.1 Final Projects Overview

6.2 DDP1: Incorporating Dataset Diversity

6.2.1 Vignette: Noticing Differences to Diversify Data.

6.2.2 Diversity in Response to Model Failure.

6.3 DDP2: Evaluating Model Performance and its Relationship to Data

6.3.1 Vignette: Collaborative Model Testing.

6.4 DDP3: Balancing Datasets

6.5 DDP4: Inspecting for Data Quality

6.5.1 Vignette: Resolving Data Quality Issues.

6.5.2 Data Quality Issues Across Projects.

6.6 Increased Confidence in Dataset Design

7 Discussion

7.1 The Role of Collaboration in Designing and Debugging Data

7.2 The Value of Testing Interfaces for Encouraging Model Iteration

7.3 Learner-Driven Projects for Authentic ML Modeling

8 Limitations and Future Work

Acknowledgments

Supplementary Material

References

Cited By

Index Terms

Recommendations

Collaborative Machine Learning Model Building with Families Using Co-ML

Experiential Learning in Data Science: Developing an Interdisciplinary, Client-Sponsored Capstone Program

A Meta-Design Model for Creative Distributed Collaborative Design

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link