Jared Bauer summarizes results of a study I suggested this spring. 202 developers were randomly assigned GitHub Copilot, while the others were instructed not to use AI tools. The participants were asked to complete a coding task. Developers with GitHub Copilot had 56% greater likelihood of passing all unit tests. Other developers evaluated code to assess quality and readability. Code from developers with GitHub Copilot was rated better on readability, maintainability, and conciseness. All these differences were statistically significant.
The Effect of Microsoft Copilot in a Multi-lingual Context with Donald Ngwe
We tested Microsoft Copilot in multilingual contexts, examining how Copilot can facilitate collaboration between colleagues with different native languages.
First, we asked 77 native Japanese speakers to review a meeting recorded in English. Half the participants had to watch and listen to the video. The other half could use Copilot Meeting Recap, which gave them an AI meeting summary as well as a chatbot to answer questions about the meeting.
Then, we asked 83 other native Japanese speakers to review a similar meeting, following the same script, but this time held in Japanese by native Japanese speakers. Again, half of participants had access to Copilot.
For the meeting in English, participants with Copilot answered 16.4% more multiple-choice questions about the meeting correctly, and they were more than twice as likely to get a perfect score. Moreover, in comparing accuracy between the two scenarios, people listening to a meeting in English with Copilot achieved 97.5% accuracy, slightly more accurate than people listening to a meeting in their native Japanese using standard tools (94.8%). This is a statistically significant difference (p<.05). The changes are small in percentage point terms because the baseline accuracy is so high, but Copilot closed 38.5% of the gap to perfect accuracy for those working in their native language (p<0.10) and closed 84.6% of the gap for those working in (non-native) English (p<.05).
Summary from Jaffe et al, Generative AI in Real-World Workplaces, July 2024.
Impact of M365 Copilot on Legal Work at Microsoft
Teams at Microsoft often reflect on how Copilot helps. I try to help these teams both by measuring Copilot usage in the field (as they do their ordinary work) and in lab experiments (idealized versions of their tasks in environments where I can better isolate cause and effect). This month I ran an experiment with CELA, Microsoft’s in-house legal department. Hossein Nowbar, Chief Legal Officer and Corporate Vice President, summarized the findings in a post at LinkedIn:
Recently, we ran a controlled experiment with Microsoft’s Office of the Chief Economist, and the results are groundbreaking. In this experiment, we asked legal professional volunteers on our team to complete three realistic legal tasks and randomly granted Copilot to some participants. Individuals with Copilot completed the tasks 32% faster and with 20.3% greater accuracy!
Copilot isn’t just a tool; it’s a game-changer, empowering our team to focus on what truly matters by enhancing productivity, elevating work quality, and, most importantly, reclaiming time.
All findings statistically significant at P<0.05.
Early LLM-based Tools for Enterprise Information Workers Likely Provide Meaningful Boosts to Productivity
Early LLM-based Tools for Enterprise Information Workers Likely Provide Meaningful Boosts to Productivity. Microsoft Research Report – AI and Productivity Team. With Alexia Cambon, Brent Hecht, Donald Ngwe, Sonia Jaffe, Amy Heger, Mihaela Vorvoreanu, Sida Peng, Jake Hofman, Alex Farach, Margarita Bermejo-Cano, Eric Knudsen, James Bono, Hardik Sanghavi, Sofia Spatharioti, David Rothschild, Daniel G. Goldstein, Eirini Kalliamvakou, Peter Cihon, Mert Demirer, Michael Schwarz, and Jaime Teevan.
This report presents the initial findings of Microsoft’s research initiative on “AI and Productivity”, which seeks to measure and accelerate the productivity gains created by LLM-powered productivity tools like Microsoft’s Copilot. The many studies summarized in this report, the initiative’s first, focus on common enterprise information worker tasks for which LLMs are most likely to provide significant value. Results from the studies support the hypothesis that the first versions of Copilot tools substantially increase productivity on these tasks. This productivity boost usually appeared in the studies as a meaningful increase in speed of execution without a significant decrease in quality. Furthermore, we observed that the willingness-to-pay for LLM-based tools is higher for people who have used the tools than those who have not, suggesting that the tools provide value above initial expectations. The report also highlights future directions for the AI and Productivity initiative, including an emphasis on approaches that capture a wider range of tasks and roles.
Studies I led that are included within this report:
Randomized Controlled Trials for Microsoft Copilot for Security with James Bono, Sida Peng, Roberto Rodriguez, and Sandra Ho. updated March 29, 2024.
Randomized Controlled Trials for Microsoft Copilot for Security. SSRN Working Paper 4648700. With James Bono, Sida Peng, Roberto Rodriguez, and Sandra Ho.
We conducted randomized controlled trials (RCTs) to measure the efficiency gains from using Security Copilot, including speed and quality improvements. External experimental subjects logged into a M365 Defender instance created for this experiment and performed four tasks: Incident Summarization, Script Analyzer, Incident Report, and Guided Response. We found that Security Copilot delivered large improvements on both speed and accuracy. Copilot brought improvements for both novices and security professionals.
(Also summarized in What Can Copilot’s Earliest Users Teach Us About Generative AI at Work? at “Role-specific pain points and opportunities: Security.” Also summarized in AI and Productivity Report at “M365 Defender Security Copilot study.”)
Sound Like Me: Findings from a Randomized Experiment with Donald Ngwe
Sound Like Me: Findings from a Randomized Experiment. SSRN Working Paper 4648689. With Donald Ngwe.
A new version of Copilot for Microsoft 365 includes a feature to let Outlook draft messages that “Sound Like Me” (SLM) based on training from messages in a user’s Sent Items folder. We sought to evaluate whether SLM lives up to its name. We find that it does, and more. Users widely and systematically praise SLM-generated messages as being more clear, more concise, and more “couldn’t have said it better myself”. When presented with a human-written message versus a SLM rewrite, users say they’d rather receive the SLM rewrite. All these findings are statistically significant. Furthermore, when presented with human and SLM messages, users struggle to tell the difference, in one specification doing worse than random.
(Also summarized in What Can Copilot’s Earliest Users Teach Us About Generative AI at Work? at “Email effectiveness.” Also summarized in AI and Productivity Report at “Outlook Email Study.”)
Measuring the Impact of AI on Information Worker Productivity with Donald Ngwe and Sida Peng
Measuring the Impact of AI on Information Worker Productivity. SSRN Working Paper 4648686. With Donald Ngwe and Sida Peng.
This paper reports the results of two randomized controlled trials evaluating the performance and user satisfaction of a new AI product in the context of common information worker tasks. We designed workplace scenarios to test common information worker tasks: retrieving information from files, emails, and calendar; catching up after a missed online meeting; and drafting prose. We assigned these tasks to 310 subjects tasked to find relevant information, answer multiple choice questions about what they found, and write marketing content. In both studies, users with the AI tool were statistically significantly faster, a difference that holds both on its own and when controlling for accuracy/quality. Furthermore, users who tried the AI tool reported higher willingness to pay relative to users who merely heard about it but didn’t get to try it, indicating that the product exceeded expectations.
(Also summarized in What Can Copilot’s Earliest Users Teach Us About Generative AI at Work? at “A day in the life” and “The strain of searching.” Also summarized in AI and Productivity Report at “Copilot Common Tasks Study” and “Copilot Information Retrieval Study.”)
An Introduction to the Competition Law and Economics of “Free” with Damien Geradin
Benjamin Edelman and Damien Geradin. An Introduction to the Competition Law and Economics of ‘Free’. Antitrust Chronicle, Competition Policy International. August 2018.
Many of the largest and most successful businesses today rely on providing services at no charge to at least a portion of their users. Consider companies as diverse as Dropbox, Facebook, Google, LinkedIn, The Guardian, Wikipedia, and the Yellow Pages.
For consumers, it is easy to celebrate free service. At least in the short term, free services are often high quality, and users find a zero price virtually irresistible.
But long-term assessments could differ, particularly if the free service reduces quality and consumer choice. In this short paper, we examine these concerns. Some highlights:
First, “free” service tends to be free only in terms of currency. Consumers typically pay in other ways, such as seeing advertising and providing data, though these payments tend to be more difficult to measure.
Second, free service sometimes exacerbates market concentration. Most notably, free service impedes a natural strategy for entrants: offer a similar product or service at a lower price. Entrants usually can’t pay users to accept their service. (That would tend to attract undesirable users who might even discard the product without trying it.) As a result, prices are stuck at zero, entry may be more difficult, effectively shielding incumbents from entry.
In this short paper, we examine the competition economics of “free” — how competition works in affected markets, what role competition policy might have and what approach it should take, and finally how competitors and prospective competitors can compete with “free.” Our bottom line: While free service has undeniable appeal for consumers, it can also impede competition, and especially entry. Competition authorities should be correspondingly attuned to allegations arising out of “free” service and should, at least, enforce existing doctrines strictly in affected markets.
Updated Research on Discrimination at Airbnb with Jessica Min
In December 2015, Mike Luca, Dan Svirsky, and I posted the results of an experiment in which we created test Airbnb guest accounts, some with black names and some with white names, finding that the latter got favorable responses from hosts more often than the latter. Black users widely reported similar problems — Twitter #AirbnbWhileBlack — and in September 2016 Airbnb responded with a report discussing the problem and Airbnb’s plans for response.
I promptly posted a critique of Airbnb’s plans, broadly arguing that Airbnb’s commitments were minimal and that the company had ignored a simpler and more effective alternative. But ultimately the proof is in the results. Do minority guests still have trouble booking rooms at Airbnb? Available evidence indicates that they do.
Below is a table based on work of Jessica Min (Harvard College ’18) as part of her undergraduate thesis measuring discrimination against Muslim guests. The table summarizes eight studies, with data collected as early as July 2015 (mine) and as late as November-December 2017 (hers), the latter postdating Airbnb’s report by more than a year. Each study finds minority users at a disadvantage, statistically significantly so.
Author/title/place and year of publication | Dates of data collection | Sample size | Summary of findings | Noteworthy secondary findings |
Edelman, Benjamin, Michael Luca, and Dan Svirsky.
Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment. American Economic Journal: Applied Economics, 2017. |
July 2015 | 6,400 listings across five U.S. cities | Guests with distinctively black names received positive responses 42% of the time, compared to 50% for white guests.
|
Results were persistent across type of hosts (i.e. race, gender, experience level, type and neighborhood of listing).
Discrimination was concentrated among hosts with no African American guests in their review history. Hosts lost $65 to $100 of revenue for each black guest rejected. |
Ameri, Mason, Sean Rogers, Lisa Schur, and Douglas Kruse.
No Room At The Inn? Disability Access in The New Sharing Economy. Working paper, 2017. |
June to November 2016 | 3,847 listings across 48 U.S. states | Guests with disabilities received positive responses less often. Hosts preapproved 75% of guests without disabilities, but only 61% of guests with dwarfism, 50% of blind guests, 43% of guests with cerebral palsy, and 25% of guests with spinal cord injury. | Airbnb’s non-discrimination policy, which took effect midway through data collection, did not have a significant effect on host responses to guests with disabilities. |
Ahuja, Rishi and Ronan C. Lyons.
The Silent Treatment: LGBT Discrimination in the Sharing Economy. Working paper, 2017. |
June – July 2016 | 794 listings in Dublin, Ireland | Guests in male same-sex relationships were approximately 25 percentage points less likely to be accepted than identical guests in heterosexual relationships or female same-sex relationships. | The difference was driven by non-responses from hosts, not outright rejection.
The difference persisted across a variety of host and location characteristics. Male hosts and hosts with many listings were less likely to discriminate. |
Cui, Ruomeng and Li, Jun and Zhang, Dennis J.
Working paper, 2016. |
Three audit studies. Summarizing the results as to guests without prior reviews: | |||
September 2016 | 598 listings in Chicago, Boston, and Seattle | Guests with distinctively black names received positive responses 29% of the time, compared to 48% for white guests. | The authors assess hosts’ apparent reasons for discrimination, including whether hosts were engaged in statistical discrimination and whether reviews reduce the problem of discrimination. | |
October – November 2016 | 250 listings in Boston and Seattle | Guests with distinctively black names received positive responses 41% of the time, compared to 63% for white guests. | ||
July – August 2017 | 660 listings in Boston, Seattle, and Austin | Guests with distinctively black names received positive responses 42% of the time, compared to 53% for white guests. | ||
Sveriges Radio’s Kaliber show, Sweden | October 2016 | 200 listings in Stockholm, Gothenburg, and Malmö | For hosts who said no to guests with black-sounding names, a second inquiry was then sent from a guest with a white-sounding name. Of hosts who had previously declined the black guest, many told the white guest that the listing was available. | Methodology follows longstanding testing for discrimination in US housing markets, sending a white applicant after a landlord declines a black prospective tenant. |
Min, Jessica
No Room for Muhammad: Evidence of Discrimination from a Field Experiment over Airbnb in Australia. Undergraduate honors thesis, 2018. |
November – December 2017 | 813 listings in Sydney, Australia | Guests with distinctively Middle Eastern names received positive responses 13.5 percentage points less often, compared to identical guests with white-sounding names. | Results were persistent across all hosts, including hosts with shared properties and those with expensive listings.
Discrimination was most prominent for hosts with highly sought-after listings, where hosts can reject disfavored guests with confidence of finding replacements. |
My bottom line remains as I remarked in fall 2016: Airbnb’s proposed responses are unlikely to solve the problem and indeed have not done so. Truly fixing discrimination at Airbnb will require more far-reaching efforts, likely including preventing hosts from seeing guests’ faces before a booking is confirmed. Anything less is just distraction and demonstrably insufficient to solve this important, and long-festering, problem.
On Uber Selling Southeast Asia Business to Grab
Uber and Grab provide much the same service — ride-hailing that lets casual drivers, using their personal cars, transport passengers in on-demand service. In the markets where both operate, in Southeast Asia, they’ve been locked in a price war. Grab has local expertise and, in many countries, useful product customizations to suit local needs. Uber is an international powerhouse. It hasn’t been obvious which would win, and both firms have spent freely to attract drivers and passengers. Today the companies announced that Uber would sell its Southeast Asia assets to Grab.
It’s clear why both companies like the deal. They’d end costly competition with each other — saving billions on incentives to both drivers and passengers. Diving the world market — with Grab dominating Southeast Asia, Didi in China (per a 2016 transaction), and Uber most everywhere else — they can improve their income statements and begin to profit.
But for every dollar of benefit to Grab and Uber, there is corresponding cost to drivers and passengers. Free of competition from each other, neither company will see a need to pay bonuses to drivers who complete a target number of rides at target quality. Nor will they see a reason to offer discounts to passengers who direct their business to the one company rather than the other. And with drivers and passengers increasingly dependent on a single intermediary to connect them, Grab will be able to charge a higher markup — a price increase that harms both sides.
Some will protest that aggrieved passengers can take taxis, buses, bikes, or private cars, or just walk. Indeed. But there’s always a substitute. If Coca Cola and Pepsi merged, customers could still drink water. Antitrust law is, prudently, not so narrow-minded. The relevant question under law is the SSNIP test, assessing customer response to a small but significant and non-transitory increase in price. Facing such an increase, would passengers truly go elsewhere? In my travels in Southeast Asia, I’ve often found Grab and Uber to be 30% cheaper than taxis. There’s plenty of room for them to increase price without me, and other passengers similarly situated, finding it profitable to switch to taxis. That means Grab and Uber are, under the relevant test, in a separate market from taxis. Then they can’t seek shelter from having (maybe) a small market share relative to taxis and other forms of transportation.
Separately, it’s not apparent what alternative is available to Grab and Uber drivers. Facing higher fees from Uber, what exactly are they supposed to do? They certainly can’t become taxi drivers (requiring special licenses, special vehicles, and more). There’s no obvious easy alternative for them. For drivers, ride-hailing is plainly distinct from other forms of transportation and other work.
The short of it is, ride-hailing is different from alternatives. Grab, Uber, passengers, and regulators know this instinctively, and extended economic and legal analysis will confirm it. With Grab and Uber in a distinct market, they jointly have near-complete market share in the markets where both operate. Under antitrust law, they should not and cannot be permitted to merge. No one would seriously contemplate a merger of Lyft and Uber in the US, and sophisticated competition regulators in Southeast Asia should be equally strict.
Additional concerns arise from the special role of SoftBank, the Japanese investment firm that held shares in both Grab and Uber. Owning portions of both companies, SoftBank cared little which one prevailed in the markets where both operated. But more than that, SoftBank specifically sought to broker peace between Grab and Uber: When investing in Uber in December 2017, SoftBank sought a discount exactly because it could influence Uber’s competitors across Asia. Such overlapping ownership — intended to reduce competition — raises particularly clear concerns under competition law. A Grab spokesman tried to allay these concerns by claiming the transaction was “a very independent decision by both companies [Grab and Uber]” — yet in the next sentence noted that Masa [SoftBank CEO Masayoshi Son] was highly supportive of the” transaction (emphasis added).
The Grab-Uber transaction follows Uber’s summer 2016 agreement to cede China to Didi, which led that firm to an unchallenged position in that market. News reports indicate higher prices and inferior service after the Didi-Uber transaction — the same results likely to arise in the Southeast Asia markets where Uber and Grab propose to combine.