D. Piorkowski, I. Vejsbjerg, O. Cornec, E. M. Daly and O. Alkan. "AIMEE: An Exploratory Study of How Rules Support AI Developers to Explain and Edit Models," ACM Conf. on Computer-Supported Cooperative Work and Social Computing (CSCW), 24 pages, 2023 (to appear) [view abstract]
Abstract: In real-world applications when deploying Machine Learning (ML) models, initial model development includes close analysis of the model results and behavior by a data scientist. Once trained, however, models may need to be retrained with new data or updated to adhere to new rules or regulations. This presents two challenges. First, how to communicate how a model is making its decisions before and after retraining, and second how to support model editing to take into account new requirements. To address these needs, we built AIMEE (AI Model Explorer and Editor), a tool created to address these challenges by providing interactive methods to explain, visualize, and modify model decision boundaries using rules. Rules should benefit model builders by providing a layer of abstraction for understanding and manipulating the model and reduces the need to modify individual rows of data directly. To evaluate if this was the case, we conducted a pair of user studies totaling 23 participants to evaluate AIMEE's rules-based approach for model explainability and editing. We found that participants correctly interpreted rules and report on their perspectives of how rules are beneficial (and not), ways that rules could support collaboration, and provide a usability evaluation of the tool.
B. Dominique, K. El Maghraoui, D. Piorkowski and L. Herger. "FactSheets for Hardware-Aware AI Models: A Case Study of Analog In Memory Computing AI Models," IEEE International Conference on Software Services Engineering (SSE), 10 pages, 2023 (to appear) [view abstract]
Abstract: In the last few years, documenting and tracking the lineage of AI models has emerged as a important research area that can help to improve the transparency, traceability and overall effectiveness of a model when it is used or deployed by an entity that did not create it. This is a crucial step towards responsible AI in the services computing paradigm especially as AI-enabled software service engineering is becoming more prevalent and mainstream. Multiple documentation methods have been proposed and their adoption has slowly begun, but these methods tend to focus on the data science aspects of the model creation, such as the datasets used to design and train the model, the neural network structure of the model, the F1 score, the modal bias, etc. When adapted to the emerging AI hardware accelerators field of analog in-memory computing (IMC), additional documentation requirements need to be considered. Analog IMC accelerators offer increased area and power efficiency, which are paramount in IOT and edge resource-constrained environments. We use the AI FactSheets (FS) 360 documentation methodology to understand and evaluate the documentation needs in this emerging domain. To do so, we interviewed 12 participants who represent various roles throughout the lifecycle of designing, training, evaluating, deploying and consuming an analog-aware AI model. From these interviews we capture these roles' documentation and collaborative needs, develop FactSheets to meet those needs, and evaluate the quality of completed FactSheets. We show that the FactSheets methodology can be applied to Analog AI models to successfully create meaningful documentation that is suitable across multiple roles and a key step towards responsible AI models.
J. He, D. Piorkowski, M. Muller, K. Brimijoin, S. Houde and J. Weisz. "Rebalancing Worker Initiative and AI Initiative in Future Work: Four Task Dimensions," ACM Sym. on Computer-Human Interaction for Work (CHIWORK), 18 pages, 2023 (to appear) [view abstract]
Abstract: Organizations have recently begun to deploy conversational task assistants that collaborate with knowledge workers to partially automate their work tasks. These assistants evolved out of business robotic process automation (RPA) tools and are becoming more intelligent: users can initiate task sequences through natural language, and the system can orchestrate those tasks if they have not previously been defined. As these tools become more automated, system designers tend to optimize overall process efficiency, but at the cost of shifting agency away from workers. Particularly in high stakes work environments, this shift raises questions of how to re-delegate agency such that workers feel sufficiently in control of automated tasks. We explored this through two studies comprised of interviews and co-design activities with knowledge workers and identified four task dimensions along which their automation and interaction preferences vary: process consequence, social consequence, task familiarity, and task complexity. These dimensions are useful for understanding when, why, and how to delegate agency between workers and conversational task assistants.
A. Danielescu and D. Piorkowski. "Iterative Design of Gestures During Elicitation: Understanding the Role of Increased Production," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), 14 pages, 2022 [view abstract]
Abstract: Previous gesture elicitation studies have found that user proposals are influenced by legacy bias which may inhibit users from proposing gestures that are most appropriate for an interaction. Increasing production during elicitation studies has shown promise moving users beyond legacy gestures. However, variety decreases as more symbols are produced. While several studies have used increased production since its introduction, little research has focused on understanding the effect on the proposed gesture quality, on why variety decreases, and on whether increased production should be limited. In this paper, we present a gesture elicitation study aimed at understanding the impact of increased production. We show that users refine the most promising gestures and that how long it takes to find promising gestures varies by participant. We also show that gestural refinements provide insight into the gestural features that matter for users to assign semantic meaning and discuss implications for training gesture classifiers.
J. Richards, D. Piorkowski, M. Hind, S. Houde, A. Mojsilovic and K. R. Varshney. "A Human-Centered Methodology for Creating AI FactSheets," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 12 pages, 2021 [view abstract]
Abstract: As artificial intelligence (AI) models and services are used in a growing number of high-stakes areas, a consensus is forming around the need for a clearer record of how these models and services are developed to increase trust. Several proposals for higher quality and more consistent AI documentation have emerged to address ethical and legal concerns and general social impacts of such systems. However, there is little published work on how to create this documentation. In this paper we describe a methodology for creating the form of AI documentation we call FactSheets. This paper describes the methodology and shares the insights we have gathered while creating nearly two dozen FactSheets. Within each step of the methodology, we describe the issues to consider and the questions to explore with the relevant people in an organization who will be creating and consuming AI facts. This methodology may help foster the creation of transparent AI documentation.
D. Piorkowski, S. Park, A. Y. Wang, D. Wang, M. Muller and F. Portnoy. "How AI Developers Overcome Communication Challenges in a Multidisciplinary Team: A Case Study," ACM Conf. on Computer-Supported Cooperative Work and Social Computing (CSCW), 23 pages, 2021 [view abstract]
Abstract: The development of AI applications is a multidisciplinary effort, involving multiple roles collaborating with the AI developers, an umbrella term we use to include data scientists and other AI-adjacent roles on the same team. During these collaborations, there is a knowledge mismatch between AI developers, who are skilled in data science, and external stakeholders who are typically not. This difference leads to communication gaps, and the onus falls on AI developers to explain data science concepts to their collaborators. In this paper, we report on a study including analyses of both interviews with AI developers and artifacts they produced for communication. Using the analytic lens of shared mental models, we report on the types of communication gaps that AI developers face, how AI developers communicate across disciplinary and organizational boundaries, and how they simultaneously manage issues regarding trust and expectations.
S. Park, A. Wang, B. Kawas, Q. V. Liao, D. Piorkowski and M. Danilevsky. "Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models," ACM Conf. on Intelligent User Interfaces (IUI), 18 pages, 2021 [view abstract]
Abstract: Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML) models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in a form readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts in sharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to distill and share their domain knowledge using domain concept extractors and five types of label justification over a representative data sample. The design of Ziva is informed by preliminary interviews with data scientists, in order to understand current practices of domain knowledge acquisition process for ML development projects. To assess our design, we run a mix-method case-study to evaluate how Ziva can facilitate interaction of domain experts and data scientists. Our results highlight that (1) domain experts are able to use Ziva to provide rich domain knowledge, while maintaining low mental load and stress levels; and (2) data scientists find Ziva's output helpful for learning essential information about the domain, offering scalability of information, and lowering the burden on domain experts to share knowledge. We conclude this work by experimenting with building NLP models using the Ziva output by our case study.
S. K. Kuttal, J. Myers, S. Gurka, D. Magar, D. Piorkowski and R. Bellamy. "Towards Designing Conversational Agents for Pair Programming: Accounting for Creativity Strategies and Conversational Styles," 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 1-11, 2020 [view abstract]
Abstract: Established research on pair programming reveals benefits, including increasing communication, creativity, self-efficacy, and promoting gender inclusivity. However, research has reported limitations such as finding a compatible partner, scheduling sessions between partners, and resistance to pairing. Further, pairings can be affected by predispositions to negative stereotypes. These problems can be addressed by replacing one human member of the pair with a conversational agent. To investigate the design space of such a conversational agent, we conducted a controlled remote pair programming study. Our analysis found various creative problem-solving strategies and differences in conversational styles. We further analyzed the transferable strategies from human-human collaboration to human-agent collaboration by conducting a Wizard of Oz study. The findings from the two studies helped us gain insights regarding design of a programmer conversational agent. We make recommendations for researchers and practitioners for designing pair programming conversational agent tools.
M. Muller, I. Lange, D. Wang, D. Piorkowski, J. Tsay, Q. V. Liao, C. Dugan and T. Erickson. "How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), 14 pages, 2019 [view abstract]
Abstract: With the rise of big data, there has been an increasing need for practitioners in this space and an increasing opportunity for researchers to understand their workflows and design new tools to improve it. Data science is often described as data-driven, comprising unambiguous data and proceeding through regularized steps of analysis. However, this view focuses more on abstract processes, pipelines, and workflows, and less on how data science workers engage with the data. In this paper, we build on the work of other CSCW and HCI researchers in describing the ways that scientists, scholars, engineers, and others work with their data, through analyses of interviews with 21 data science professionals. We set five approaches to data along a dimension of interventions: Data as given; as captured; as curated; as designed; and as created. Data science workers develop an intuitive sense of their data and processes, and actively shape their data. We propose new ways to apply these interventions analytically, to make sense of the complex activities around data practices.
T. Sandbank, M. Shmueli-Scheuer, D. Konopnicki, J. Herzig, J. Richards and D. Piorkowski. "Detecting Egregious Conversations between Customers and Virtual Agents," North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2018 [view abstract]
Abstract: Virtual agents are becoming a prominent channel of interaction in customer service. Not all customer interactions are smooth, however, and some can become almost comically bad. In such instances, a human agent might need to step in and salvage the conversation. Detecting bad conversations is important since disappointing customer service may threaten customer loyalty and impact revenue. In this paper, we outline an approach to detecting such egregious conversations, using behavioral cues from the user, patterns in agent responses, and useragent interaction. Using logs of two commercial systems, we show that using these features improves the detection F1-score by around 20% over using textual features alone. In addition, we show that those features are common across two quite different domains and, arguably, universal.
D. Piorkowski, S. Penney, A. Henley, M. Pistoia, M. Burnett, O. Tripp and P. Ferrara. "Foraging Goes Mobile: Foraging While Debugging on Mobile Devices," 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 9-17, 2017 [view abstract]
Abstract: Although Information Foraging Theory (IFT) research for desktop environments has provided important insights into numerous information foraging tasks, we have been unable to locate IFT research for mobile environments. Despite the limits of mobile platforms, mobile apps are increasingly serving functions that were once exclusively the territory of desktops—and as the complexity of mobile apps increases, so does the need for foraging. In this paper we investigate, through a theory-based, dual replication study, whether and how foraging results from a desktop IDE generalize to a functionally similar mobile IDE. Our results show ways prior foraging research results from desktop IDEs generalize to mobile IDEs and ways they do not, and point to challenging open research questions for foraging on mobile environments.
S. Srinivasa Ragavan, B. Pandya, D. Piorkowski, C. Hill, S. K. Kuttal, A. Sarma and M. Burnett. "PFIS-V: Modeling Foraging Behavior in the Presence of Variants," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 6232-6244, 2017 [view abstract]
Abstract: Foraging among similar variants of the same artifact is a common activity, but computational models of Information Foraging Theory (IFT) have not been developed to take such variants into account. Without being able to computationally predict people's foraging behavior with variants, our ability to harness the theory in practical ways—such as building and systematically assessing tools for people who forage different variants of an artifact—is limited. Therefore, in this paper, we introduce a new predictive model, PFIS-V, that builds upon PFIS3, the most recent of the PFIS family of modeling IFT in programming situations. Our empirical results show that PFIS-V is up to 25% more accurate than PFIS3 in predicting where a forager will navigate in a variationed information space.
A. Aydin, D. Piorkowski, O. Tripp, P. Ferrara and M. Pistoia. "Visual Configuration of Mobile Privacy Policies," Int'l Conf. on Fundamental Approaches to Software Engineering (FASE), pp. 338-355, 2017 [view abstract]
Abstract: Mobile applications often require access to private user information, such as the user or device ID, the location or the contact list. Usage of such data varies across different applications. A notable example is advertising. For contextual advertising, some applications release precise data, such as the user's exact address, while other applications release only the user's country. Another dimension is the user. Some users are more privacy demanding than others. Existing solutions for privacy enforcement are neither app- nor user- sensitive, instead performing general tracking of private data into release points like the Internet. The main contribution of this paper is in refining privacy enforcement by letting the user configure privacy preferences through a visual interface that captures the application's screens enriched with privacy-relevant information. We demonstrate the efficacy of our approach w.r.t. advertising and analytics, which are the main (third-party) consumers of private user information. We have implemented our approach for Android as the VisiDroid system. We demonstrate VisiDroid's efficacy via both quantitative and qualitative experiments involving top-popular Google Play apps. Our experiments include objective metrics, such as the average number of configuration actions per app, as well as a user study to validate the usability of VisiDroid.
D. Piorkowski, A. Z. Henley, T. Nabi, S. D. Fleming, C. Scaffidi and M. Burnett. "Foraging and Navigations, Fundamentally: Developers' Predictions of Value and Cost," ACM Proc. Int'l Symposium on the Foundations of Software Engineering (FSE), pp. 97-108, 2016 [view abstract]
Abstract: Empirical studies have revealed that software developers spend 35%–50% of their time navigating through source code during development activities, yet fundamental questions remain: Are these percentages too high, or simply inherent in the nature of software development? Are there factors that somehow determine a lower bound on how effectively developers can navigate a given information space? Answering questions like these requires a theory that captures the core of developers' navigation decisions. Therefore, we use the central proposition of Information Foraging Theory to investigate developers' ability to predict the value and cost of their navigation decisions. Our results showed that over 50% of developers' navigation choices produced less value than they had predicted and nearly 40% cost more than they had predicted. We used those results to guide a literature analysis, to investigate the extent to which these challenges are met by current research efforts, revealing a new area of inquiry with a rich and crosscutting set of research challenges and open problems.
S. Srinivasa Ragavan, S. K. Kuttal, C. Hill, A. Sarma, D. Piorkowski and M. Burnett. "Foraging among an Overabundance of Similar Variants," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 3509-3521, 2016 [view abstract]
Abstract: Foraging among too many variants of the same artifact can be problematic when many of these variants are similar. This situation, which is largely overlooked in the literature, is commonplace in several types of creative tasks, one of which is exploratory programming. In this paper, we investigate how novice programmers forage through similar variants. Based on our results, we propose a refinement to Information Foraging Theory (IFT) to include constructs about variation foraging behavior, and propose refinements to computational models of IFT to better account for foraging among variants.
D. Piorkowski, S. D. Fleming, C. Scaffidi, M. Burnett, I. Kwan, A. Z. Henley, J. Macbeth, C. Hill and A. Horvath. "To Fix or to Learn? How Production Bias Affects Developers' Information Foraging during Debugging," IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME), pp. 11-20, 2015 [view abstract]
Abstract: Developers performing maintenance activities must balance their efforts to learn the code vs. their efforts to actually change it. This balancing act is consistent with the "production bias" that, according to Carroll's minimalist learning theory, generally affects software users during everyday tasks. This suggests that developers' focus on efficiency should have marked effects on how they forage for the information they think they need to fix bugs. To investigate how developers balance fixing versus learning during debugging, we conducted the first empirical investigation of the interplay between production bias and information foraging. Our theory-based study involved 11 participants: half tasked with fixing a bug, and half tasked with learning enough to help someone else fix it. Despite the subtlety of difference between their tasks, participants foraged remarkably differently—making foraging decisions from different types of "patches," with different types of information, and succeeding with different foraging tactics.
D. Piorkowski, S. D. Fleming, I. Kwan, M. Burnett, C. Scaffidi, R. Bellamy and J. Jordahl.. "The Whats and Hows of Programmers' Foraging Diets," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 3063-3072, 2013 [view abstract]
Abstract: One of the least studied areas of Information Foraging Theory is diet: the information foragers choose to seek. For example, do foragers choose solely based on cost, or do they stubbornly pursue certain diets regardless of cost? Do their debugging strategies vary with their diets? To investigate "what" and "how" questions like these for the domain of software debugging, we qualitatively analyzed 9 professional developers' foraging goals, goal patterns, and strategies. Participants spent 50% of their time foraging. Of their foraging, 58% fell into distinct dietary patterns—mostly in patterns not previously discussed in the literature. In general, programmers' foraging strategies leaned more heavily toward enrichment than we expected, but different strategies aligned with different goal types. These and our other findings help fill the gap as to what programmers' dietary goals are and how their strategies relate to those goals.
D. Piorkowski, S. D. Fleming, C. Scaffidi, C. Bogart, M. Burnett, B. E. John, R. Bellamy and C. Swart. "Reactive Information Foraging: An Empirical Investigation of Theory-Based Recommender Systems for Programmers," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 1471-1480, 2012 [view abstract]
Abstract: Information Foraging Theory (IFT) has established itself as an important theory to explain how people seek information, but most work has focused more on the theory itself than on how best to apply it. In this paper, we investigate how to apply a reactive variant of IFT (Reactive IFT) to design IFT-based tools, with a special focus on such tools for ill-structured problems. Toward this end, we designed and implemented a variety of recommender algorithms to empirically investigate how to help people with the ill-structured problem of finding where to look for information while debugging source code. We varied the algorithms based on scent type supported (words alone vs. words + code structure), and based on use of foraging momentum to estimate rapidity of foragers' goal changes. Our empirical results showed that (1) using both words and code structure significantly improved the ability of the algorithms to recommend where software developers should look for information; (2) participants used recommendations to discover new places in the code and also as shortcuts to navigate to known places; and (3) low-momentum recommendations were significantly more useful than high-momentum recommendations, suggesting rapid and numerous goal changes in this type of setting. Overall, our contributions include two new recommendation algorithms, empirical evidence about when and why participants found IFT-based recommendations useful, and implications for the design of tools based on Reactive IFT.
D. Piorkowski, S. D. Fleming, C. Scaffidi, L. John, C. Bogart, B. E. John, M. Burnett and R. Bellamy. "Modeling programmer navigation: A head-to-head empirical evaluation of predictive models," 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 109-116, 2011 [view abstract]
Most Influential Paper (10 years)
Abstract: Software developers frequently need to perform code maintenance tasks, but doing so requires time-consuming navigation through code. A variety of tools are aimed at easing this navigation by using models to identify places in the code that a developer might want to visit, and then providing shortcuts so that the developer can quickly navigate to those locations. To date, however, only a few of these models have been compared head-to-head to assess their predictive accuracy. In particular, we do not know which models are most accurate overall, which are accurate only in certain circumstances, and whether combining models could enhance accuracy. Therefore, we have conducted an empirical study to evaluate the accuracy of a broad range of models for predicting many different kinds of code navigations in sample maintenance tasks. Overall, we found that models tended to perform best if they took into account how recently a developer has viewed pieces of the code, and if models took into account the spatial proximity of methods within the code. We also found that the accuracy of single-factor models can be improved by combining factors, using a spreading-activation based approach, to produce multi-factor models. Based on these results, we offer concrete guidance about how these models could be used to provide enhanced software development tools that ease the difficulty of navigating through code.
C. Bogart, M. Burnett, S. Douglass, D. Piorkowski and A. Shinsel. "Does My Model Work? Evaluation Abstractions of Cognitive Modelers," 2010 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 49-56, 2010 [view abstract]
Abstract: Are the abstractions that scientific modelers use to build their models in a modeling language the same abstractions they use to evaluate the correctness of their models? The extent to which such differences exist seems likely to correspond to additional effort of modelers in determining whether their models work as intended. In this paper, we therefore investigate the distinction between "programming abstractions" and "evaluation abstractions". As the basis of our investigation, we conducted a case study on cognitive modeling. We report modelers' evaluation abstractions, and the lengths they went to in evaluating their models. From these results, we derive design implications for several categories of persistent, first-class evaluation abstractions in future debugging tools for modelers.