David Piorkowski's Homepage

Conference Journal Short Workshop/Tutorial Patents

E. Miehling, M. Nagireddy, P. Sattigeri, E. M. Daly, D. Piorkowski and J. T. Richards. "Language Models in Dialogue: Conversational Maxims for Human-AI Interactions," Conf. on Empirical Methods in Natural Language Processing (EMNLP), 12 pages, 2024 [view abstract]

Abstract: Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims -- quantity, quality, relevance, manner, benevolence, and transparency -- for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice) in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement with, harmful content) and transparency (concerning recognition of one's knowledge boundaries, operational constraints, and intents), are necessary for addressing behavior unique to modern human-AI interactions. We evaluate the degree to which various language models are able to understand these maxims and find that models possess an internal prioritization of principles that can significantly impact accurate interpretability of the maxims.

D. Piorkowski, R. Ostrand, K. Brimijoin, J. He, E. Albert and S. Houde. "Towards Interactive Guidance for Writing Training Utterances for Conversational Agents," ACM Conversational User Interfaces Conf. (CUI), 20 pages, 2024 [view abstract]
Honorable Mention

Abstract: Improving conversational agents that are trained with supervised learning requires iteratively refining example intent training utterances based on chat log data. The difficulty of this process hinges on the quality of the initial example utterances used to train the intent before it was first deployed. Creating new intents from scratch, when conversation logs are not yet available, has many challenges. We interviewed experienced conversational agent intent trainers to better understand challenges they face when creating new intents, and their best practices for writing high quality training utterances. Using these findings and related literature, we developed an intent training tool that provided interactive guidance via either language feedback or sample utterances. Language feedback notified the user when training utterances could be linguistically improved, while sample utterances were crowdsourced and provided examples of end user language prior to deploying an intent. We compared these two types of guidance in a 187-participant between-subject study. We found that participants in the language feedback condition reported limited creativity and higher mental load and spent more time on the task, but were more thoughtful in crafting utterances that adhered to best practices. In contrast, sample utterance participants leveraged the samples to either quickly select examples or use them as a springboard to develop new utterance ideas. We report on differences in user experience in the strategies that participants took and preferences for or against the different types of guidance.

R. Ostrand, K. Brimijoin, D. Piorkowski, J. He, E. Albert and S. Houde. "Say What? Real-time Linguistic Guidance Supports Novices in Writing Utterances for Conversational Agent Training," ACM Conversational User Interfaces Conf. (CUI), 15 pages, 2024 [view abstract]

Abstract: Writing utterances to train conversational agents can be a challenging and time-consuming task, and usually requires substantial expertise, meaning that novices face a steep learning curve. We investigated whether novices could be guided to produce utterances that adhere to best practices via an intervention of real-time linguistic feedback. We conducted a user study in which participants were tasked with writing training utterances for a particular topic ( extit{intent}) for a conversational agent. Participants received one of two types of linguistic guidance in real-time to shape their utterance-writing: (1) feedback on the lexical and syntactic properties and variety of each utterance as it was written, or (2) sample utterances written by other users to choose from or inspire the writing of new utterances. All participants also completed a control condition, in which they wrote utterances for a different intent, without receiving any guidance. We investigated whether linguistic properties of the utterances written differed as a function of whether the participant had received guidance, and if so, which type. Results showed that participants wrote better quality utterances in both guidance conditions compared to when they received no guidance, in that they were longer and had greater lexical and syntactic diversity. These results demonstrate that giving novices explicit linguistic guidance can improve the quality of the training utterances they write, suggesting that this could be an effective way of getting new utterance writers started with minimal training.

D. Piorkowski, I. Vejsbjerg, O. Cornec, E. M. Daly and Ö. Alkan. "AIMEE: An Exploratory Study of How Rules Support AI Developers to Explain and Edit Models," ACM Conf. on Computer-Supported Cooperative Work and Social Computing (CSCW), 24 pages, 2023 [view abstract]

Abstract: In real-world applications when deploying Machine Learning (ML) models, initial model development includes close analysis of the model results and behavior by a data scientist. Once trained, however, models may need to be retrained with new data or updated to adhere to new rules or regulations. This presents two challenges. First, how to communicate how a model is making its decisions before and after retraining, and second how to support model editing to take into account new requirements. To address these needs, we built AIMEE (AI Model Explorer and Editor), a tool created to address these challenges by providing interactive methods to explain, visualize, and modify model decision boundaries using rules. Rules should benefit model builders by providing a layer of abstraction for understanding and manipulating the model and reduces the need to modify individual rows of data directly. To evaluate if this was the case, we conducted a pair of user studies totaling 23 participants to evaluate AIMEE's rules-based approach for model explainability and editing. We found that participants correctly interpreted rules and report on their perspectives of how rules are beneficial (and not), ways that rules could support collaboration, and provide a usability evaluation of the tool.

B. Dominique, K. El Maghraoui, D. Piorkowski and L. Herger. "FactSheets for Hardware-Aware AI Models: A Case Study of Analog In Memory Computing AI Models," IEEE International Conference on Software Services Engineering (SSE), 10 pages, 2023 [view abstract]
Best Student Paper

Abstract: In the last few years, documenting and tracking the lineage of AI models has emerged as a important research area that can help to improve the transparency, traceability and overall effectiveness of a model when it is used or deployed by an entity that did not create it. This is a crucial step towards responsible AI in the services computing paradigm especially as AI-enabled software service engineering is becoming more prevalent and mainstream. Multiple documentation methods have been proposed and their adoption has slowly begun, but these methods tend to focus on the data science aspects of the model creation, such as the datasets used to design and train the model, the neural network structure of the model, the F1 score, the modal bias, etc. When adapted to the emerging AI hardware accelerators field of analog in-memory computing (IMC), additional documentation requirements need to be considered. Analog IMC accelerators offer increased area and power efficiency, which are paramount in IOT and edge resource-constrained environments. We use the AI FactSheets (FS) 360 documentation methodology to understand and evaluate the documentation needs in this emerging domain. To do so, we interviewed 12 participants who represent various roles throughout the lifecycle of designing, training, evaluating, deploying and consuming an analog-aware AI model. From these interviews we capture these roles' documentation and collaborative needs, develop FactSheets to meet those needs, and evaluate the quality of completed FactSheets. We show that the FactSheets methodology can be applied to Analog AI models to successfully create meaningful documentation that is suitable across multiple roles and a key step towards responsible AI models.

J. He, D. Piorkowski, M. Muller, K. Brimijoin, S. Houde and J. Weisz. "Rebalancing Worker Initiative and AI Initiative in Future Work: Four Task Dimensions," ACM Sym. on Computer-Human Interaction for Work (CHIWORK), 18 pages, 2023 [view abstract]

Abstract: Organizations have recently begun to deploy conversational task assistants that collaborate with knowledge workers to partially automate their work tasks. These assistants evolved out of business robotic process automation (RPA) tools and are becoming more intelligent: users can initiate task sequences through natural language, and the system can orchestrate those tasks if they have not previously been defined. As these tools become more automated, system designers tend to optimize overall process efficiency, but at the cost of shifting agency away from workers. Particularly in high stakes work environments, this shift raises questions of how to re-delegate agency such that workers feel sufficiently in control of automated tasks. We explored this through two studies comprised of interviews and co-design activities with knowledge workers and identified four task dimensions along which their automation and interaction preferences vary: process consequence, social consequence, task familiarity, and task complexity. These dimensions are useful for understanding when, why, and how to delegate agency between workers and conversational task assistants.

A. Danielescu and D. Piorkowski. "Iterative Design of Gestures During Elicitation: Understanding the Role of Increased Production," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), 14 pages, 2022 [view abstract]
Honorable Mention

Abstract: Previous gesture elicitation studies have found that user proposals are influenced by legacy bias which may inhibit users from proposing gestures that are most appropriate for an interaction. Increasing production during elicitation studies has shown promise moving users beyond legacy gestures. However, variety decreases as more symbols are produced. While several studies have used increased production since its introduction, little research has focused on understanding the effect on the proposed gesture quality, on why variety decreases, and on whether increased production should be limited. In this paper, we present a gesture elicitation study aimed at understanding the impact of increased production. We show that users refine the most promising gestures and that how long it takes to find promising gestures varies by participant. We also show that gestural refinements provide insight into the gestural features that matter for users to assign semantic meaning and discuss implications for training gesture classifiers.

J. Richards, D. Piorkowski, M. Hind, S. Houde, A. Mojsilovic and K. R. Varshney. "A Human-Centered Methodology for Creating AI FactSheets," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 12 pages, 2021 [view abstract]

Abstract: As artificial intelligence (AI) models and services are used in a growing number of high-stakes areas, a consensus is forming around the need for a clearer record of how these models and services are developed to increase trust. Several proposals for higher quality and more consistent AI documentation have emerged to address ethical and legal concerns and general social impacts of such systems. However, there is little published work on how to create this documentation. In this paper we describe a methodology for creating the form of AI documentation we call FactSheets. This paper describes the methodology and shares the insights we have gathered while creating nearly two dozen FactSheets. Within each step of the methodology, we describe the issues to consider and the questions to explore with the relevant people in an organization who will be creating and consuming AI facts. This methodology may help foster the creation of transparent AI documentation.

D. Piorkowski, S. Park, A. Y. Wang, D. Wang, M. Muller and F. Portnoy. "How AI Developers Overcome Communication Challenges in a Multidisciplinary Team: A Case Study," ACM Conf. on Computer-Supported Cooperative Work and Social Computing (CSCW), 23 pages, 2021 [view abstract]

Abstract: The development of AI applications is a multidisciplinary effort, involving multiple roles collaborating with the AI developers, an umbrella term we use to include data scientists and other AI-adjacent roles on the same team. During these collaborations, there is a knowledge mismatch between AI developers, who are skilled in data science, and external stakeholders who are typically not. This difference leads to communication gaps, and the onus falls on AI developers to explain data science concepts to their collaborators. In this paper, we report on a study including analyses of both interviews with AI developers and artifacts they produced for communication. Using the analytic lens of shared mental models, we report on the types of communication gaps that AI developers face, how AI developers communicate across disciplinary and organizational boundaries, and how they simultaneously manage issues regarding trust and expectations.

S. Park, A. Wang, B. Kawas, Q. V. Liao, D. Piorkowski and M. Danilevsky. "Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models," ACM Conf. on Intelligent User Interfaces (IUI), 18 pages, 2021 [view abstract]

Abstract: Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML) models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in a form readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts in sharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to distill and share their domain knowledge using domain concept extractors and five types of label justification over a representative data sample. The design of Ziva is informed by preliminary interviews with data scientists, in order to understand current practices of domain knowledge acquisition process for ML development projects. To assess our design, we run a mix-method case-study to evaluate how Ziva can facilitate interaction of domain experts and data scientists. Our results highlight that (1) domain experts are able to use Ziva to provide rich domain knowledge, while maintaining low mental load and stress levels; and (2) data scientists find Ziva's output helpful for learning essential information about the domain, offering scalability of information, and lowering the burden on domain experts to share knowledge. We conclude this work by experimenting with building NLP models using the Ziva output by our case study.

S. K. Kuttal, J. Myers, S. Gurka, D. Magar, D. Piorkowski and R. Bellamy. "Towards Designing Conversational Agents for Pair Programming: Accounting for Creativity Strategies and Conversational Styles," 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 1-11, 2020 [view abstract]

Abstract: Established research on pair programming reveals benefits, including increasing communication, creativity, self-efficacy, and promoting gender inclusivity. However, research has reported limitations such as finding a compatible partner, scheduling sessions between partners, and resistance to pairing. Further, pairings can be affected by predispositions to negative stereotypes. These problems can be addressed by replacing one human member of the pair with a conversational agent. To investigate the design space of such a conversational agent, we conducted a controlled remote pair programming study. Our analysis found various creative problem-solving strategies and differences in conversational styles. We further analyzed the transferable strategies from human-human collaboration to human-agent collaboration by conducting a Wizard of Oz study. The findings from the two studies helped us gain insights regarding design of a programmer conversational agent. We make recommendations for researchers and practitioners for designing pair programming conversational agent tools.

M. Muller, I. Lange, D. Wang, D. Piorkowski, J. Tsay, Q. V. Liao, C. Dugan and T. Erickson. "How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), 14 pages, 2019 [view abstract]

Abstract: With the rise of big data, there has been an increasing need for practitioners in this space and an increasing opportunity for researchers to understand their workflows and design new tools to improve it. Data science is often described as data-driven, comprising unambiguous data and proceeding through regularized steps of analysis. However, this view focuses more on abstract processes, pipelines, and workflows, and less on how data science workers engage with the data. In this paper, we build on the work of other CSCW and HCI researchers in describing the ways that scientists, scholars, engineers, and others work with their data, through analyses of interviews with 21 data science professionals. We set five approaches to data along a dimension of interventions: Data as given; as captured; as curated; as designed; and as created. Data science workers develop an intuitive sense of their data and processes, and actively shape their data. We propose new ways to apply these interventions analytically, to make sense of the complex activities around data practices.

T. Sandbank, M. Shmueli-Scheuer, D. Konopnicki, J. Herzig, J. Richards and D. Piorkowski. "Detecting Egregious Conversations between Customers and Virtual Agents," North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL:HLT), 2018 [view abstract]

Abstract: Virtual agents are becoming a prominent channel of interaction in customer service. Not all customer interactions are smooth, however, and some can become almost comically bad. In such instances, a human agent might need to step in and salvage the conversation. Detecting bad conversations is important since disappointing customer service may threaten customer loyalty and impact revenue. In this paper, we outline an approach to detecting such egregious conversations, using behavioral cues from the user, patterns in agent responses, and useragent interaction. Using logs of two commercial systems, we show that using these features improves the detection F1-score by around 20% over using textual features alone. In addition, we show that those features are common across two quite different domains and, arguably, universal.

D. Piorkowski, S. Penney, A. Henley, M. Pistoia, M. Burnett, O. Tripp and P. Ferrara. "Foraging Goes Mobile: Foraging While Debugging on Mobile Devices," 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 9-17, 2017 [view abstract]
Honorable Mention

Abstract: Although Information Foraging Theory (IFT) research for desktop environments has provided important insights into numerous information foraging tasks, we have been unable to locate IFT research for mobile environments. Despite the limits of mobile platforms, mobile apps are increasingly serving functions that were once exclusively the territory of desktops—and as the complexity of mobile apps increases, so does the need for foraging. In this paper we investigate, through a theory-based, dual replication study, whether and how foraging results from a desktop IDE generalize to a functionally similar mobile IDE. Our results show ways prior foraging research results from desktop IDEs generalize to mobile IDEs and ways they do not, and point to challenging open research questions for foraging on mobile environments.

S. Srinivasa Ragavan, B. Pandya, D. Piorkowski, C. Hill, S. K. Kuttal, A. Sarma and M. Burnett. "PFIS-V: Modeling Foraging Behavior in the Presence of Variants," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 6232-6244, 2017 [view abstract]

Abstract: Foraging among similar variants of the same artifact is a common activity, but computational models of Information Foraging Theory (IFT) have not been developed to take such variants into account. Without being able to computationally predict people's foraging behavior with variants, our ability to harness the theory in practical ways—such as building and systematically assessing tools for people who forage different variants of an artifact—is limited. Therefore, in this paper, we introduce a new predictive model, PFIS-V, that builds upon PFIS3, the most recent of the PFIS family of modeling IFT in programming situations. Our empirical results show that PFIS-V is up to 25% more accurate than PFIS3 in predicting where a forager will navigate in a variationed information space.

A. Aydin, D. Piorkowski, O. Tripp, P. Ferrara and M. Pistoia. "Visual Configuration of Mobile Privacy Policies," Int'l Conf. on Fundamental Approaches to Software Engineering (FASE), pp. 338-355, 2017 [view abstract]

Abstract: Mobile applications often require access to private user information, such as the user or device ID, the location or the contact list. Usage of such data varies across different applications. A notable example is advertising. For contextual advertising, some applications release precise data, such as the user's exact address, while other applications release only the user's country. Another dimension is the user. Some users are more privacy demanding than others. Existing solutions for privacy enforcement are neither app- nor user- sensitive, instead performing general tracking of private data into release points like the Internet. The main contribution of this paper is in refining privacy enforcement by letting the user configure privacy preferences through a visual interface that captures the application's screens enriched with privacy-relevant information. We demonstrate the efficacy of our approach w.r.t. advertising and analytics, which are the main (third-party) consumers of private user information. We have implemented our approach for Android as the VisiDroid system. We demonstrate VisiDroid's efficacy via both quantitative and qualitative experiments involving top-popular Google Play apps. Our experiments include objective metrics, such as the average number of configuration actions per app, as well as a user study to validate the usability of VisiDroid.

D. Piorkowski, A. Z. Henley, T. Nabi, S. D. Fleming, C. Scaffidi and M. Burnett. "Foraging and Navigations, Fundamentally: Developers' Predictions of Value and Cost," ACM Proc. Int'l Symposium on the Foundations of Software Engineering (FSE), pp. 97-108, 2016 [view abstract]
Distinguished Paper

Abstract: Empirical studies have revealed that software developers spend 35%–50% of their time navigating through source code during development activities, yet fundamental questions remain: Are these percentages too high, or simply inherent in the nature of software development? Are there factors that somehow determine a lower bound on how effectively developers can navigate a given information space? Answering questions like these requires a theory that captures the core of developers' navigation decisions. Therefore, we use the central proposition of Information Foraging Theory to investigate developers' ability to predict the value and cost of their navigation decisions. Our results showed that over 50% of developers' navigation choices produced less value than they had predicted and nearly 40% cost more than they had predicted. We used those results to guide a literature analysis, to investigate the extent to which these challenges are met by current research efforts, revealing a new area of inquiry with a rich and crosscutting set of research challenges and open problems.

S. Srinivasa Ragavan, S. K. Kuttal, C. Hill, A. Sarma, D. Piorkowski and M. Burnett. "Foraging among an Overabundance of Similar Variants," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 3509-3521, 2016 [view abstract]
Best Paper

Abstract: Foraging among too many variants of the same artifact can be problematic when many of these variants are similar. This situation, which is largely overlooked in the literature, is commonplace in several types of creative tasks, one of which is exploratory programming. In this paper, we investigate how novice programmers forage through similar variants. Based on our results, we propose a refinement to Information Foraging Theory (IFT) to include constructs about variation foraging behavior, and propose refinements to computational models of IFT to better account for foraging among variants.

D. Piorkowski, S. D. Fleming, C. Scaffidi, M. Burnett, I. Kwan, A. Z. Henley, J. Macbeth, C. Hill and A. Horvath. "To Fix or to Learn? How Production Bias Affects Developers' Information Foraging during Debugging," IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME), pp. 11-20, 2015 [view abstract]

Abstract: Developers performing maintenance activities must balance their efforts to learn the code vs. their efforts to actually change it. This balancing act is consistent with the "production bias" that, according to Carroll's minimalist learning theory, generally affects software users during everyday tasks. This suggests that developers' focus on efficiency should have marked effects on how they forage for the information they think they need to fix bugs. To investigate how developers balance fixing versus learning during debugging, we conducted the first empirical investigation of the interplay between production bias and information foraging. Our theory-based study involved 11 participants: half tasked with fixing a bug, and half tasked with learning enough to help someone else fix it. Despite the subtlety of difference between their tasks, participants foraged remarkably differently—making foraging decisions from different types of "patches," with different types of information, and succeeding with different foraging tactics.

D. Piorkowski, S. D. Fleming, I. Kwan, M. Burnett, C. Scaffidi, R. Bellamy and J. Jordahl.. "The Whats and Hows of Programmers' Foraging Diets," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 3063-3072, 2013 [view abstract]

Abstract: One of the least studied areas of Information Foraging Theory is diet: the information foragers choose to seek. For example, do foragers choose solely based on cost, or do they stubbornly pursue certain diets regardless of cost? Do their debugging strategies vary with their diets? To investigate "what" and "how" questions like these for the domain of software debugging, we qualitatively analyzed 9 professional developers' foraging goals, goal patterns, and strategies. Participants spent 50% of their time foraging. Of their foraging, 58% fell into distinct dietary patterns—mostly in patterns not previously discussed in the literature. In general, programmers' foraging strategies leaned more heavily toward enrichment than we expected, but different strategies aligned with different goal types. These and our other findings help fill the gap as to what programmers' dietary goals are and how their strategies relate to those goals.

D. Piorkowski, S. D. Fleming, C. Scaffidi, C. Bogart, M. Burnett, B. E. John, R. Bellamy and C. Swart. "Reactive Information Foraging: An Empirical Investigation of Theory-Based Recommender Systems for Programmers," ACM Proc. Int'l Conf. Human Factors in Computing Systems (CHI), pp. 1471-1480, 2012 [view abstract]

Abstract: Information Foraging Theory (IFT) has established itself as an important theory to explain how people seek information, but most work has focused more on the theory itself than on how best to apply it. In this paper, we investigate how to apply a reactive variant of IFT (Reactive IFT) to design IFT-based tools, with a special focus on such tools for ill-structured problems. Toward this end, we designed and implemented a variety of recommender algorithms to empirically investigate how to help people with the ill-structured problem of finding where to look for information while debugging source code. We varied the algorithms based on scent type supported (words alone vs. words + code structure), and based on use of foraging momentum to estimate rapidity of foragers' goal changes. Our empirical results showed that (1) using both words and code structure significantly improved the ability of the algorithms to recommend where software developers should look for information; (2) participants used recommendations to discover new places in the code and also as shortcuts to navigate to known places; and (3) low-momentum recommendations were significantly more useful than high-momentum recommendations, suggesting rapid and numerous goal changes in this type of setting. Overall, our contributions include two new recommendation algorithms, empirical evidence about when and why participants found IFT-based recommendations useful, and implications for the design of tools based on Reactive IFT.

D. Piorkowski, S. D. Fleming, C. Scaffidi, L. John, C. Bogart, B. E. John, M. Burnett and R. Bellamy. "Modeling programmer navigation: A head-to-head empirical evaluation of predictive models," 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 109-116, 2011 [view abstract]
Most Influential Paper (10 years)

Abstract: Software developers frequently need to perform code maintenance tasks, but doing so requires time-consuming navigation through code. A variety of tools are aimed at easing this navigation by using models to identify places in the code that a developer might want to visit, and then providing shortcuts so that the developer can quickly navigate to those locations. To date, however, only a few of these models have been compared head-to-head to assess their predictive accuracy. In particular, we do not know which models are most accurate overall, which are accurate only in certain circumstances, and whether combining models could enhance accuracy. Therefore, we have conducted an empirical study to evaluate the accuracy of a broad range of models for predicting many different kinds of code navigations in sample maintenance tasks. Overall, we found that models tended to perform best if they took into account how recently a developer has viewed pieces of the code, and if models took into account the spatial proximity of methods within the code. We also found that the accuracy of single-factor models can be improved by combining factors, using a spreading-activation based approach, to produce multi-factor models. Based on these results, we offer concrete guidance about how these models could be used to provide enhanced software development tools that ease the difficulty of navigating through code.

C. Bogart, M. Burnett, S. Douglass, D. Piorkowski and A. Shinsel. "Does My Model Work? Evaluation Abstractions of Cognitive Modelers," 2010 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 49-56, 2010 [view abstract]

Abstract: Are the abstractions that scientific modelers use to build their models in a modeling language the same abstractions they use to evaluate the correctness of their models? The extent to which such differences exist seems likely to correspond to additional effort of modelers in determining whether their models work as intended. In this paper, we therefore investigate the distinction between "programming abstractions" and "evaluation abstractions". As the basis of our investigation, we conducted a case study on cognitive modeling. We report modelers' evaluation abstractions, and the lengths they went to in evaluating their models. From these results, we derive design implications for several categories of persistent, first-class evaluation abstractions in future debugging tools for modelers.

S. Srinivasa Ragavan, M. Codoban, D. Piorkowski, D. Dig and M. Burnett. "Version Control Systems: An Information Foraging Perspective," IEEE Transactions on Software Engineering (TSE), 11 pages, 2021 [view abstract]

Abstract: Version Control Systems (VCS) are an important source of information for developers. This calls for a principled understanding of developers' information seeking in VCS-both for improving existing tools and for understanding requirements for new tools. Our prior work investigated empirically how and why developers seek information in VCS: in this paper, we complement and enrich our prior findings by reanalyzing the data via a theory's lens. Using the lens of Information Foraging Theory (IFT), we present new insights not revealed by the prior empirical work. First, while looking for specific information, participants' foraging behaviors were consistent with other foraging situations in SE; therefore, prior research on IFT-based SE tool design can be leveraged for VCS. Second, in change awareness foraging, participants consumed similar diets, but in subtly different ways than in other situations; this calls for further investigations into change awareness foraging. Third, while committing changes, participants attempted to enable future foragers, but the competing needs of different foraging situations led to tensions that participants failed to balance: this opens up a new avenue for research at the intersection of IFT and SE, namely, creating forageable information. Finally, the results of using an IFT lens on these data provides some evidence as to IFT's scoping and utility for the version control domain.

M. Arnold, R. K. E. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilovic, R. Nair, K. Natesan Ramamurthy, D. Reimer, A. Olteanu, D. Piorkowski, J. Tsay and K. R Varshney.. "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity," IBM Journal of Research and Development, pp. 6:1-6:13, 2019 [view abstract]

Abstract: Accuracy is an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety (which includes fairness and explainability), security, and provenance, are also critical elements to engender consumers’ trust in a service. Many industries use transparent, standardized, but often not legally required documents called supplier's declarations of conformity (SDoCs) to describe the lineage of a product along with the safety and performance testing it has undergone. SDoCs may be considered multidimensional fact sheets that capture and quantify various aspects of the product and its development to make it worthy of consumers’ trust. In this article, inspired by this practice, we propose FactSheets to help increase trust in AI services. We envision such documents to contain purpose, performance, safety, security, and provenance information to be completed by AI service providers for examination by consumers. We suggest a comprehensive set of declaration items tailored to AI in the Appendix of this article.

C. Scaffidi, S. D. Fleming, D. Piorkowski, M. Burnett, R. Bellamy, J. Lawrance and I. Kwan. "An Information Foraging Theory Perspective on Tools for Debugging, Refactoring, and Reuse Tasks," ACM Transactions on Software Engineering and Methodology (TOSEM), 41 pages, 2013 [view abstract]

Abstract: Theories of human behavior are an important but largely untapped resource for software engineering research. They facilitate understanding of human developers’ needs and activities, and thus can serve as a valuable resource to researchers designing software engineering tools. Furthermore, theories abstract beyond specific methods and tools to fundamental principles that can be applied to new situations. Toward filling this gap, we investigate the applicability and utility of Information Foraging Theory (IFT) for understanding information-intensive software engineering tasks, drawing upon literature in three areas: debugging, refactoring, and reuse. In particular, we focus on software engineering tools that aim to support information-intensive activities, that is, activities in which developers spend time seeking information. Regarding applicability, we consider whether and how the mathematical equations within IFT can be used to explain why certain existing tools have proven empirically successful at helping software engineers. Regarding utility, we applied an IFT perspective to identify recurring design patterns in these successful tools, and consider what opportunities for future research are revealed by our IFT perspective.

R. Ostrand, V. Ferreira and D. Piorkowski. "Rapid Lexical Alignment to a Conversational Agent," Proc. INTERSPEECH 2023, pp. 2653-2657, 2023 [view abstract]

Abstract: Conversational partners modify their language to be more similar to each other during interactions. This phenomenon, known as alignment, has been shown in human-human interactions, but there is little work on lexical alignment in human-computer interactions. We investigate whether people lexically align to a conversational agent, and whether the degree of alignment depends on feedback from the agent. This study compared three feedback conditions for how the agent responded to users' word choice: (1) the agent only understood the specific words that it produced itself; (2) the agent understood the words that it produced as well as more appropriate synonyms; (3) the agent's understanding of words that it did not produce was random. Participants significantly aligned to the agent in all conditions, and aligned more when they learned that the agent's comprehension was contingent on their alignment. Thus, inducing lexical alignment may be an effective way to increase dialogue success.

M. Hind, S. Houde, J. Martino, A. Mojsilovic, D. Piorkowski, J. Richards and K. R. Varshney. "Experiences with improving the transparency of AI models and services," Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), 8 pages, 2020 [view abstract]

Abstract: AI models and services are used in a growing number of high-stakes areas, resulting in a need for increased transparency. Consistent with this, several proposals for higher quality and more consistent documentation of AI data, models, and systems have emerged. Little is known, however, about the needs of those who would produce or consume these new forms of documentation. Through semi-structured developer interviews, and two document-creation exercises, we have assembled a clearer picture of these needs and the various challenges faced in creating accurate and useful AI documentation. Based on the observations from this work, supplemented by feedback received during multiple design explorations and stakeholder conversations, we make recommendations for easing the collection and flexible presentation of AI facts to promote transparency.

T. Nabi, K. M. D. Sweeney, S. Lichlyter, D. Piorkowski, C. Scaffidi, M. Burnett and S. D. Fleming. "Putting Information Foraging Theory to Work: Community-based Design Patterns for Programming Tools," 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 129-133, 2016 [view abstract]

Abstract: The design of programming tools is slow and costly. To ease this process, we developed a design pattern catalog aimed at providing guidance for tool designers. This catalog is grounded in Information Foraging Theory (IFT), which empirical studies have shown to be useful for understanding how developers look for information during development tasks. New design patterns, authored by members of the research community for the catalog, concretely explain how to apply IFT in tool design. In our evaluation, qualitative analyses revealed the community-written design patterns compared well in quality to patterns that we had ourselves published in a smaller, peer-reviewed catalog.

O. Benjelloun, L. Kaffee, S. Longpre, D. Piorkowski, E. Simperl and S. Worth. "Tutorial: AI Data Transparency: The Past, The Present, and Beyond," AAAI Conference on Artificial Intelligence (AAAI), 2 pages, 2025

B. Knowles, J. Richards and D. Piorkowski. "Practice Tutorial: Documenting AI's Environmental Impact," ACM Conf. on Fairness, Accountability, and Transparency (FAccT), 2 pages, 2024

B. Dominique, D. Piorkowski, M. Nagireddy and I. Baldini. "Prompt Templates: A Methodology for Improving Manual Red Teaming Performance," First Workshop on Human-Centered Evaluation and Auditing of Language Models (HEAL@CHI), 6 pages, 2024 [view abstract]

Abstract: Large language models (LLMs) may output content that is undesired or outright harmful. One method for auditing this unwanted model output is a process called manual red teaming, in which a human creates prompts to probe the LLMs behavior. Successful red teaming requires experience and expertise. To better support humans in manual red teaming, we tested prompt templates to facilitate novices towards more effective red teaming results. We evaluated the prompt templates in a user study of 29 participants who were tasked with red teaming an LLM to identify biased output based on societal stigmas. We found that using prompt templates led to increased success and performance in this task, with multiple effective strategies being used while doing so.

J. He, D. Piorkowski, M. Muller, K. Brimijoin, S. Houde and J. Weisz. "Understanding How Task Dimensions Impact Automation Preferences with a Conversational Task Assistant," Intervening, Teaming, Delegating, Creating Engaging Automation Experiences Workshop at CHI (CHI), 6 pages, 2023 [view abstract]

Abstract: Organizations have recently begun to deploy conversational task assistants that collaborate with business users to partially automate their work tasks. These assistants are becoming more intelligent: users initiate automated task support through natural language, and the system can dynamically orchestrate new task sequences accordingly. As these tools become more intelligent and automated, they sometimes shift control away from users to increase process efficiency at the cost of consequences for users' preferences and productivity. Particularly in high stakes work environments, this shift raises questions of when automation is suitable and how to delegate agency such that users feel sufficiently in control of their tasks. We explored these questions through two studies comprised of interviews and co-design activities with business users and identified four task dimensions along which their automation and interaction preferences vary: process consequence, social consequence, task familiarity, and task complexity. These dimensions are useful for understanding when, why, and how to delegate control between users and conversational task assistants.

D. Piorkowski, R. Ostrand, Y. Rizk, V. Isahagian, V. Muthusamy and J. D. Weisz. "Accuracy Is Not All You Need," Workshop on Human-Centered AI Workshop at NeurIPS (NeurIPS), 2 pages, 2022 [view abstract]

Abstract: Improving the performance of human-AI (artificial intelligence) collaborations tends to be narrowly scoped, with better prediction performance often considered the only metric of improvement. As a result, work on improving the collaboration usually focuses on improving the AI's accuracy. Here, we argue that such a focus is myopic, and instead, practitioners should take a more holistic view of measuring the performance of AI models, and human-AI collaboration more specifically. In particular, we argue that although some use cases merit optimizing for classification accuracy, for others, accuracy is less important and improvement on human-centered metrics should be valued instead.

D. A. González Rueda, D. Piorkowski and D. Mendonça. "Behavioral Measures of Trust in Human-autonomy Teams," CHI 2022 Workshop on Trust and Reliance in Human-AI Teams (CHI), 11 pages, 2022 [view abstract]

Abstract: Trust has long been acknowledged as a crucial aspect of teamwork, whether in all-human or in mixed human/autonomy teams. However, typical approaches to the measurement of trust rely chiefly on psychometric approaches that are not well suited to capturing data on trust among non-human members of a team and can constitute interference in the workflow of all-human teams. This paper explores prospects for conceptualizing and measuring trust at the team level through the measurement of observable behaviors associated with trust. Here, three aspects of trust are considered---competence, predictability, and integrity--and existing behavioral measures of trust are examined in relation to them, using as criteria the reliability, validity, and extensibility of each measure. This paper concludes with a summative assessment of the current state of behavioral measures on trust in teams, as well as recommendations for future work. Further research along these lines will be critical for understanding the role of trust in human/autonomy teams in general, but particularly when the proportion of non-human members on a team is large, or when "autonomies" participate in vital activities in the team's workflow.

M. Muller, A. Wang, S. Ross, J. Weisz, M. Agarwal, K. Talamadupula, S. Houde, F. Martinez, J. Richards, J. Drozdal, X. Liu, D. Piorkowski and D. Wang. "How Data Scientists Improve Generated Code Documentation in Jupyter Notebooks," IUI 2021 Workshop on Human-AI Co-Creation with Generative Models (IUI), 13 pages, 2021 [view abstract]

Abstract: Generative AI models are capable of creating high-fidelity outputs, sometimes indistinguishable from what could be produced by human effort. However, some domains possess an objective bar of quality, and the probabilistic nature of generative models suggests that there may be imperfections or flaws in their output. In software engineering, for example, code produced by a generative model may not compile, or it may contain bugs or logical errors. Various models of human-AI interaction, such as mixed-initiative user interfaces, suggest that human effort ought to be applied to a generative model's outputs in order to improve its quality. We report results from a controlled experiment in which data scientists used multiple models -- including a GNN-based generative model -- to generate and subsequently edit documentation for data science code within Jupyter notebooks. In analyzing their edit-patterns, we discovered various ways that humans made improvements to the generated documentation, and speculate that such edit data could be used to train generative models to not only identify which parts of their output might require human attention, but also how those parts could be improved.

S. Houde, V. Liao, J. Martino, M. Muller, D. Piorkowski, J. Richards, J. Weisz and Y. Zhang. "Business (Mis)use Cases of Generative AI," IUI 2020 Workshop on Human-AI Co-Creation with Generative Models (IUI), 6 pages, 2020 [view abstract]

Abstract: Generative AI is a class of machine learning technology that learns to generate new data from training data. While deep fakes and media-and art-related generative AI breakthroughs have recently caught people's attention and imagination, the overall area is in its infancy for business use. Further, little is known about generative AI's potential for malicious misuse at large scale. Using co-creation design fictions with AI engineers, we explore the plausibility and severity of business misuse cases.

I. Kwan, S. D. Fleming and D. Piorkowski. "Information Foraging Theory for Collaborative Software Development," Proceedings of the CSCW 2012 Workshop on The Future of Collaborative Software Development (Future CSD), 3 pages, 2012 [view abstract]

Abstract: Information foraging theory describes how people gather information based on a cost-benefit model. This theory has been successfully applied to the web domain and to software engineering tools. However, little work has been done on how information foraging theory can be applied to information-seeking behavior in a collaborative software engineering setting. This paper discusses how the theory might apply to information-seeking within collaborative softwaredevelopment teams, and how constructs of the theory might help aid in the design of tools and processes.

Y. Rizk, V. Isahagian, V. Muthusamy and D. J. Piorkowski. "Dynamic Selection of AI Computer Models to Reduce Costs and Maximize User Experience," (pending). Filed on 2023-09-21.

J. T. Richards, M. A. Bhide, M. Hind, A. Mojsilovic, J. Martino and D. J. Piorkowski. "Governing Usage of an Artificial Intelligence Technology," (pending). Filed on 2023-09-15.

A. Goldsteen, M. Hind, J. Martino, D. J. Piorkowski, O. Raz, J. T. Richards, M. Singh and M. Zalmanovici. "Providing and Comparing Customized Risk Scores for Artificial Intelligence Models," (pending). Filed on 2023-04-28.

J. T. Richards, T. Hampp-Bahnmueller, M. Hind and D. J. Piorkowski. "Dynamic fact contextualization in support of artificial intelligence (AI) model development," (pending). Filed on 2023-03-30.

J. T. Richards, D. J. Piorkowski, S. Houde, Y. Zhang, Q. Liao and R. K. E. Bellamy. "Increasing trust formation and reduce oversight costs for autonomous agents," US Patent No. 11,741,192. Filed on 2021-01-29. Granted on 2023-08-29.

A. Chaudhary, D. Wang, D. J. Piorkowski, D. M. Gruen, C. Gan, P. D. Kirchner, G. Bramble, B. Chen, A. Valente, C. M. Spina, J. T. Richards and A. Bhandwaldar. "Automated analysis generation for machine learning system," (pending). Filed on 2020-09-14.

J. T. Richards, R. K. E. Bellamy, R. G. Farrell, Q. Liao and D. J. Piorkowski. "Agent to bot transfer," US Patent No. 11,316,980. Filed on 2019-11-26. Granted on 2022-04-26.

M. R. Arnold, R. K. E. Bellamy, K. El Maghraoui, M. Hind, S. Houde, K. Kannan, S. Mehta, A. Mojsilovic, R. Raghavendra, D. C. Reimer, J. T. Richards, D. J. Piorkowski, J. Tsay, K. R. Varshney and M. Kesarwani. "Generation and management of an artificial intelligence (AI) model documentation throughout its life cycle," US Patent No. 11,263,188. Filed on 2019-11-01. Granted on 2022-03-01.

M. Hirzel, H. L. Ossher, D. J. Piorkowski and P. Tarr. "Conversational optimization of cognitive models," US Patent No. 10,810,994. Filed on 2018-07-19. Granted on 2020-10-20.

E. Duesterwald, A. Z. Henley, D. J. Piorkowski and J. T. Richards. "Framework of proactive and/or reactive strategies for improving labeling consistency and efficiency," US Patent No. 12,079,648. Filed on 2017-12-28. Granted on 2024-09-03.

G. A. Baudart, J. T. Dolby, E. Duesterwald and D. J. Piorkowski. "Cognitive virtual detector," US Patent No. 11,206,228. Filed on 2017-10-18. Granted on 2021-12-21.

E. Duesterwald, Y. Chen, M. Desmond, H. L. Ossher and D. J. Piorkowski. "Filter for harmful training samples in online learning systems," US Patent No. 10,977,562. Filed on 2017-08-07. Granted on 2021-04-13.

David Piorkowski, Ph.D.

Staff Research Scientist at IBM
Interested in Human-Centered Artificial Intelligence (HCAI)
Working on HCI of AI Governance

This site was last updated on April 14, 2025.

© 2014-2025 David Piorkowski, all rights reserved. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of my employer or sponsors.