This white paper was written close to a year ago as of this posting, so some of its contents may have lost accuracy. The analysis framework itself I believe maintains relevance.

Preliminary note: The viewpoints expressed in this document are the opinions of the author and do not necessarily reflect those of any organization.

Overview

Purpose

This document attempts to provide a framework to analyze the benefit provided to organizational software development efforts by generative AI, with specific attention to augmentation of developer efficiency.

An obvious question at the outset is what benefit a new analysis can provide at this juncture. Many analyses have been performed in this domain; several by the creators of leading generative AI tools. The aim of this document is to satisfy a niche of organizations which find themselves of two minds concerning generative AI tools for software development. In doing so it attempts secondarily to provide a conclusion, but primarily to establish an analysis that acts as a decisional framework.

Details of this analysis will be elaborated below. Specific attention is paid to GitHub’s CoPilot AI, which by most accounts leads as the mainstay of AI-augmented development. A number of other tools are also considered and compared, as well as the trivial null or “no change” option of declining to engage generative AI for this purpose.

All alternative tools will be measured by the expectations set by CoPilot. At the conclusion, some overall considerations relevant to the decision to implement generative AI for software development will be examined and discussed.

It is understood that the state of the field of generative AI unfolds rapidly, and no report can comprehensively assess an organic, growing medium. That being said, there remains distinct value in periodically gauging the current state of affairs. A sufficiently careful approach from a broader viewpoint may remain pertinent for a reasonable timeframe, certainly long enough to inform relevant choices for developers and for the decision-makers charged with managing software development.

It should be noted that for the remainder of this document, when the term “CoPilot” is used, this relates to the code generation tool CoPilot, and not to the more general companion implementation of the same name as can be found in Office365, etc.

Value Proposition of this Document

As the field of AI-augmented development has rapidly grown, consideration of this technology, its utility, benefits, and downsides has been broadly assessed by many interested groups.

Each group, organization, and individual considering the use of AI must employ its distinct needs and viewpoint in this decision. The expectation of this document is that there is added utility in providing a means to assess AI-augmented development in a way that enables the reader to reach whatever conclusions are deemed pertinent to their own context. In other words, this document is less interested in determining whether CoPilot, and generative AI coding in general, is “good” or “useful,” and more interested in establishing a set of heuristics capable of generating results applicable and unique to each independent context.

In pursuit of this goal, the use of CoPilot will be assessed on a number of assessable axes, listed below. These terms will be more concretely established by definitions later on.

Axes of assessment:

Technical vitality
Novelty of form
Stability and reliability
Security and privacy
Augmentation of development efficiency.

This assessment will be accompanied by two further steps:

To broadly assess the general state of affairs of AI-augmented development with focus on CoPilot in particular, and
To understand its benefits, or lack thereof, for software development, inasmuch as software development aims to meet high standards of quality, safety, reusability, intelligibility, and efficiency.

In sum, this document will attempt to cover the following main items of discussion:

A review of the purposes and uses of CoPilot, as compared to other code assistance tools, or no code assistance tool
The current effects of CoPilot use in practice
An evaluation of CoPilot as a tool as assessed across a number of parameters – this with an eye towards subjective assessment of organizational use case rather than strictly objective metrics-based benefits
A discussion of the results of the above analyses.

Key Findings

As mentioned, the findings of this document are not strictly its main feature. However, noting these findings help to explain the framework of assessment, and so are noted here, and elaborated in context further below.

Prefatory notes to findings:

The evaluations in this document focus primarily on CoPilot with a Business license, and a small selection of comparable alternative tools. The market space of demonstrably high-quality augmentation tools remains minimal despite a number startups attempting comparable efforts.
The differences between CoPilot Business and CoPilot Individual / CoPilot Enterprise, while not insignificant in some regards, remain minimal enough to be largely disregarded in this analysis.
As noted earlier, the purpose of this document is for the subjective determination of the utility of CoPilot in 2024, attempting to frame this subjective conclusion by as close to objective factors as plausible. Naturally, specifics of what these findings mean, and how they may be used in determining the viability of CoPilot as a business solution, will be elaborated somewhat further in the discussion at the end of this document. The purpose of listing them here is only to provide a synopsis of how CoPilot may prove a means to an end.

Finding 1: CoPilot’s primary distinction when compared to other code generation products appears in the form of additional tools offered. As of this writing, qualitative code differences appear to be minimal, with some alternate tools sometimes subjectively presenting “usable” code more often than CoPilot, perhaps due to curating. For the most part, however, CoPilot can continue to be used as the representative standard of augmented code generation. For the purposes of this section, we will consider CoPilot to be archetypal of all specific code generative AI tools.

Finding 2: For reasonably experienced software developers, CoPilot introduces a speed boost for the developer’s work, taken on average. Inexperienced developers or those working in unfamiliar territory (new frameworks, etc) often incur additional work to interpret and analyze CoPilot’s output. Additionally, for any developer, benefit to speed comes only once they have had time to become familiar with CoPilot and integrate it intuitively into their daily development workflow.

Finding 3: CoPilot may be viewed more accurately as an assistant than a guide. This role may change with time. For now this limitation is a natural outcome of a fairly universal (if diminishing) limitation of generative AI, of which CoPilot can be no exception. Developers report a need to keep a close eye on CoPilot outputs, and understanding these outputs and their contextual fit stands as the key to using CoPilot effectively – a result which is not surprising.

Finding 4: As a corollary to Finding 1 and Finding 3, the benefits provided by CoPilot scale according to how readily the user can assess the quality of generated code and the validity of its applicability to the solution. This is not only a logical conclusion, it is also borne out in anecdotal surveys.

General Introduction

Definition of Terms

Throughout this document some descriptive terms will be used with more narrow meanings than as in general use. These terms are listed below so that when referenced, they can be understood in the specific meaning intended.

Additionally, some further terms particular to the assessment process will be defined below in the Definition of Parameters section. Each of these terms is provided a fuller description as they are the focus of evaluation and functionally constitute titles rather than descriptive terms.

Terms:

Code output: Any generated code as created by an AI tool. This includes code completion, and code generation from scratch, but always with the input of the user, via prompt.
Augment / Augmentation: The use of AI as companion; generative AI in its generation of outputs prompted by a human in the creative process.
Prompt: This term is used generally in adherence to its meaning within generative AI; that is, the user input upon which a generated output is based. In our particular case this will represent preliminary code, context, and description as used to provide the input for an AI-generated code output, in one of several forms as will be surveyed later on.

Definition of Parameters

This section provides a breakdown of the evaluation parameters that will be used throughout the document as described above. CoPilot’s variance of usage and relative subjectivity of outcome relegate standard code metrics to a backseat – “good” code is theoretically objective but in reality often context-dependent.

Therefore to provide some ability to rank or evaluate CoPilot (and any similar tool), a methodology of evaluation must necessarily be determined. The definitions below attempt to satisfy this need.

No expectation exists that these terms are complete, or the parameters exhaustive. Nonetheless the intent is to satisfy the majority of practical and theoretical considerations that might be pertinent to an organization considering CoPilot against its alternatives, or against no code generation tool at all.

The parameters of evaluation, and their summary definition, are as follows. Full definition will be found below.

Technical vitality: Strict code quality moderated by contextual pertinence
Novelty of form: Originality or “humanness” of code output
Stability & reliability: Robustness and resistance to environmental changes
Security & privacy: Technical security and proper data management, treated broadly
Augmentation of efficiency: Subjective improvement of efficiency, contextually defined

Technical vitality

This parameter of evaluation assesses code quality when taken as a whole, combined with a sense of pertinence to the subject matter and the broader context of the code, as technically viable code may be off-target in its implementation choices.

Since this assessment is necessarily subjective and experiential, the term vitality will represent the sense of “reasonable code” that a developer might incur when reviewing the code if it were written by a coworker. Terms such as quality, functionality, or applicability capture only a portion of the subject in question, instead viewed as one overarching parameter, and other related factors like efficiency will form separate axes of evaluation, as they are uniquely distinguishable and separable from pertinence.

Novelty of form

Despite the formality inherent to code, a great deal of expressiveness tends to emerge as a codebase grows. This expressiveness may be pragmatically defined as the intrinsic, intuitive appearance of “creativity” in the code, of the kind that reflects not the arbitrary use of imagination but rather the application of exploratory thought in solving a problem. In practice this is the sense a developer might get, drawing from experience, when code they read fits tightly to the context and shows evidence of having been crafted carefully rather than written by rote, leveraging a code pattern blindly, or copying and pasting a solution. In other words, the code being applied (whether a single line or function, or a complete feature) is developed with an eye towards the unique vector of the solution.¹

In many ways this parameter is least important for evaluation, as the large majority of development use cases require code to be high quality and repeatable rather than creative or sharp, to employ a different term. Interesting code tends to be poor code, or at best, code that ought to be relegated to academic settings for research purposes until standardized.

However, the utility of ranking Copilot outcomes by novelty of form is not because this is an important outcome but rather because it represents an innate quality of the model generating the code. A display of novelty will show that CoPilot is capable of flexible, pertinent solutions. The extent to which this can exist at all in current generative models is of course severely limited, as all output will be generated almost entirely, if not entirely, as a weighted representation of the data upon which it was trained – the form of the solution cannot deviate from that of the input data.

In practice, some models (especially those augmented by hard-coded modifications) may display more “innovation” than others, in the sense of adapting its outputs with some degree of apparent enthusiasm for solving the problem creatively. Sometimes this control can be adjusted by the user via controls like temperature of the generation, etc, but even so the output will be dependent on the tuning of the model. Therefore it seems worthwhile to assess outputs and outcomes on this parameter.²

Stability & reliability

Stable code may be defined variably in context. Often it is defined loosely, interpreted as representing a lack of bugs in the code, or code that “seems” to work well. For the purposes of this evaluation stability will be defined as code that is not just performant, but robust. Code which must be tweaked for every change of the surrounding code context is unstable (e.g. ordered function parameters when a parametrized dictionary could be applied, or alteration of input format causing a fatal error). Stable code is least subject to the “whimsy” of its surrounding application context; i.e. the state of the program and its environment.

Relatedly, code can be considered reliable when it may run in ideal circumstances, and code that often fails to work as expected when the environment changes can be considered unreliable (e.g. lack of checks for the existence of a property).

Taken together, these two ideas, stability and reliability, can best be assessed in this context as one evaluation parameter, gauging the general robustness of a code generation.

Security & privacy

While security in code is a broad topic with many expectations, the use of AI code augmentation presents a somewhat smaller basis.³ Here, we can be concerned solely with how well a given code output maintains the security of the system, either by respecting the efforts of surrounding code to implement security features (if present), or by properly managing data (whether user input or from other areas of code). In practice, this may range from sanitizing inputs to leveraging authorization functions to avoiding the use of certain tools or libraries where insufficiently standardized. Privacy effectively constitutes a subsection of this domain, and is mentioned by name in order to ensure that this key area is not glossed during assessment.

Augmentation of efficiency

This parameter can be assessed metrically, but may nonetheless prove difficult to pin down effectively in many contexts. Necessarily, efficiency is dependent on the expectations placed on the individual and the team where development work is involved, and as such will prove highly contextually variable. Ranking along this axis can be best defined by the expected and demonstrated relative improvement of code generation. Of course, this aspect presupposes all other axes to be reasonable; the assumption when assessing this parameter must be that vitality, novelty, robustness, and safety all maintain approximately linear standards.

Measurable outcomes such as time to completion of task, lines of code generated (a questionable metric at the best of times), and number of goals or requirements met in development all are known to mask targets such as those defined above; a completed feature may easily generate bugs that far exceed the initial development time, and ticketed work closed may hide questions that should have been asked during the implementation (which would have caused it to take longer, but with a possibly more robust or accurate result).

Observable, trackable metrics should not be dismissed, but should be taken in context.

Technical Overview

Summary of CoPilot and related tools

The reader of this document is assumed to be familiar with CoPilot at a conceptual level. Therefore a synopsis of the general features of generative AI as coding tool will be omitted. However it is worth briefly highlighting some specific points of how CoPilot operates in order to set expectations and provide a basis against which to measure.

GitHub provides a definition of the general operation of CoPilot:

[T]he Copilot extension begins by examining the code in your editor—focusing on the lines just before and after your cursor, but also information including other files open in your editor and the URLs of repositories or file paths to identify relevant context.⁴

Effectively, CoPilot is a code augmentation tool, as discussed above, which uses generative AI to create text completion for prompts in the form of fragments of code. In this way it compares to any other code generation tool fundamentally (see more discussion of this later on). Taken in context of the broader CoPilot code system, which includes tools for managing the development pipeline, CoPilot may naturally be considered the forerunner in the code augmentation space.

The distinction between different levels of license for CoPilot at time of writing are, on the whole, minimal, and deal primarily with organizational limitations, IP space, and so on. A few key functional differences do exist which must be recognized when evaluating the utility of CoPilot for a given set of needs:

Enterprise license CoPilot can be used to index a codebase on which it operates, and thereby provide more catered generations (this will be relevant particularly to code generation efforts off of minimal prompts, as necessarily the vector of the former will be closer to a more “generic” solution while the latter will have a greater degree of compatibility with existing code)1.
In a related restriction, users of Business and Enterprise licenses will not have their code used to train CoPilot models – effectively, the data remains private, though it does necessarily contact the server.⁵

Baseline Expectations of CoPilot

Before assessing CoPilot from an external point of view, it may be useful to reference as a basis GitHub’s own research analysis of CoPilot’s technical preview.⁶ GitHub’s primary report was released in September 2022 and therefore should be considered dated, but many of its findings continue to apply, and the internal viewpoint offered serves as a necessary and useful basis from which to assess CoPilot.

This report produced a number of results, with some key findings summarized here:

Preface 1: Developer productivity is difficult to measure, making impact to productivity difficult to measure as well.
Preface 2: Satisfaction and conservation of mental energy count for a lot – CoPilot is a “developer’s tool.”
Finding 1: GitHub Copilot can shoulder the boring and repetitive work of development, and reduce cognitive load.
Finding 2: Developers reported they complete tasks faster when using GitHub Copilot.

In a subsequent piece of research in late 2023, GitHub found that “85% of developers felt more confident in their code quality when authoring code with GitHub Copilot and Copilot Chat.”⁷

A number of questions can be asked about this conclusion, including the potential mapping from confidence to measurable use, but it may be most useful to take this finding at a surface level for the moment.

Additional key findings from other published articles, include conclusions that add context to the conversation:^8,9

Junior or less experienced developers felt they benefit more from GitHub Copilot.
Users accept nearly 30% of code suggestions from GitHub Copilot.
GitHub Copilot is already writing 46% of code (apparently measured by commits, at time of writing).

The implications of these conclusions should be evaluated and interpreted given the understanding that they represent obvious selling points for CoPilot as a business. However, this does not exclude the potential for these points to remain relevant. Understanding the metrics and conclusions of these reports even under the pessimistic assumption that they are cherry-picked from the evidence still provides insight into the trends in benefits that CoPilot provides – and more specifically and critically to the assessment, the value that GitHub intends to provide with CoPilot.

Most recently, GitHub has released CoPilot Workspace, an integrated development environment apparently aimed at further aggregating the development process into one arena, and providing more comprehensive guidance throughout.^10,11

Taken generally, the single most visible advantage of tools like CoPilot is the ability to power through large amounts of code in a shorter period of time. This applies especially where code is reiterative, tedious, or “boilerplate” and needs to be written and rewritten in only subtly different ways throughout a file or codebase.

The second most visible advantage is the elimination, or reduction, of the expenditure of brainpower on solutions that have been resolved many times.¹² An experienced developer possesses a mental dictionary of common problems and their solution patterns, but no developer can memorize all patterns relevant to even a single industry. Additionally, copy-paste-modify techniques are employed by many developers where knowledge of design patterns is lacking, which introduces potential new issues, bugs, or even vulnerabilities.

AI tools are one attempt to minimize this issue by the “natural” process of the AI tool recognizing a nascent pattern and generating satisfactory code without the danger introduced by copy-pasted or haphazard code (although AI may introduce dangerous code through its own indirect method of copying and pasting; more on this later) and without the developer need to think through the solution beyond verifying that the code meets the necessary requirement – which can easily be checked by visual examination, test sequences, and execution of code.

CoPilot suggesting a line of code in an example function to generate a hashed auth sequence. Note the assumption of a library.

CoPilot provides the option to select among alternatives when generating a code completion. This adds a level of flexibility, but also makes it more relevant that the developer successfully recognize and select the optimal code output, as the default choice may not always be the most optimal.

A prompt of only very minimal size is necessary for CoPilot to start generating suggestions (in fact, CoPilot can be asked to generate code with no existing code, discussed later). In some cases the file name alone can be sufficient to give CoPilot a sense of direction. As with all generative AI, the more targeted the prompt, the more targeted the output is likely to be, and here, the existing code acts as that prompt vector.

CoPilot generating a series of code suggestions using the file name and the single prompt `function`.

However, this form of generation from scratch leads to a point of comparison between using CoPilot to generate code completion, and using any AI service to generate code from scratch: With some limited exceptions, CoPilot and similar tools do not engineer for the developer, but augment the developer, which requires the existing code to be of a reasonable quality, as might be contrasted to using an English prompt to get Meta, ChatGPT, etc, to generate a section of code in its entirety.

The question of how all this will lead to low-code or no-code development in the future remains academic for now. Most workplaces will continue to require developer-led efforts that can be only supplemented by AI tools.

In the meantime, the more practical aspect of this assessment will ask how the AI tool will aid the developer and whether that aid is worth any trade-offs that may be identified.

Applications of CoPilot

At first assessment, CoPilot appears to best serve two use cases, noted previously but restated for pertinence: 1) Assisting the novice developer, so that not every line must be recalled from memory, and the structure of the code can be provided when needed. 2) For a developer of any level acting an augment to the speed of development especially when churning through more rote or boilerplate-style code.

There can also be, depending on the needs of the organization, a further use in providing optionality (in the form of solutions not otherwise considered), ease of development (in the form of allowing narrower developer focus), and most especially effectiveness (in the form of allowing the knowledge of the developer to guide the use of CoPilot’s more accurate “memory” in constructing full solutions with context born in mind). This last item in particular has the most opportunity for success, but also the highest rate of failure, as will be discussed.

Regardless of the actual viability of using CoPilot for development needs, which will be addressed after the evaluation in the subsequent sections, CoPilot, and similar tools, have taken a spotlight in the industry specifically because they do provide such an obvious supposed benefit in speed, efficiency, and comfort in the development process. Whether such a benefit is indeed the truth will be a matter of personal evaluation, which the next section hopes to guide.

However, before examining these parameters, there may be some utility in understanding how CoPilot has already been viewed in practical application and day to day use.

Following are presented a handful of opinions on CoPilot gleaned from a variety of sources. None should be taken as individually significant, but each represents facets of the CoPilot experience. How its viability as evaluated by various practical parameters plays into these interpretations will be an exercise in reflecting the subsequent sections back onto this one. Some resolution of this question will be addressed in the conclusion of this document, but the end conclusion must be the reader’s to decide.

The points that follow prove most relevant to our discussion. These comments can be found reiterated across the web in various formats written by various users:

Many developers note that CoPilot has helped them learn or overcome obstacles rather than generate code with more precision or effectiveness.^13,14,15
Relatedly, CoPilot output might be best viewed as “suggestions” rather than “completions.”^16,17
CoPilot is therefore best suited to experienced developer who can “hold their own” and recognize the best moments to prompt for completion, and to accept CoPilot suggestions.^11,18,19

It should not be assumed that these opinions reflect the majority of CoPilot users; outside of GitHub’s own metrics, this would be a difficult assessment to quantify. However, the amount of anecdotal evidence suggests that these opinions are relatively commonly held by long-term users of CoPilot (a quick search in StackOverflow, Twitter, Reddit, or a search engine will replicate similar discussion points).

One additional factor worth noting after reviewing such comments is that the use of CoPilot charges ahead despite any of these considerations: According to one of GitHub’s own reports, 40% of the code Copilot users check in to GitHub is AI-generated and unmodified.²⁰ In other words, and as can be further seen in the details of this and similar articles published by GitHub, CoPilot is used by developers not only in situations where it has proven effective, but more or less anytime opportunity provides. This approach, used as a default, as noted in one quoted article, may produce a “downward effect” on code quality.

However, at this point we risk moving ahead of our assessment. The key point in this particular note is that the true use and “application” of CoPilot may be somewhat self-defining. When evaluating CoPilot and other similar tools for organizational use, this finding can be reckoned however seems most appropriate to the reader.

Technical Evaluation

Preamble in this section is intentionally kept to a minimum. The results of analysis stand as the focus here, and discussion will follow after. In the assessment below, objective trials were applied wherever plausible, with some sections requiring a more subjective approach. In these latter sections the conclusions will be explained and justification attempted, so the reader may come to their own conclusions.

Evaluation of Parameters

Technical Vitality

CoPilot’s ranking on this parameter is effectively the same as the conceptual advancement of the field of LLMs. That is, the current state of this sort of coding accuracy is hit or miss, with the bulk of outcomes slowly listing towards “hit” as the model advances. This outcome can be established by the interested reader by comparing earlier outcomes against current outcomes, as well as viewing earlier reports and discussion around CoPilot trials and generative AI more broadly.

Whether this trend of improvement will continue remains a matter of speculation. Out of a series of ten generations for a given code completion (as measured somewhat arbitrarily – again, the reader should repeat this trial for their own edification), on average nine will be plausibly applicable, one option will be completely inapplicable or erroneous, and one of the plausible nine will be, at the least, reasonably fitting if not “excellent.” This was born out of user experience across approximately five hundred generations during the course of trials.

What is most commonly missing from generated output, in a subjective sense, is the sense of “exceptional” code. This evaluation falls primarily under the parameter of “Form” and will be discussed below.

Novelty of Form

As noted in the definition of parameters, Form may be the most difficult parameter to define, and the most difficult to emulate in an output. The wide variety of material upon which CoPilot was trained means the outputs it provides in its generations are equally varied. In turns this produces a state such that in a given set of completion options, there may exist one or multiple solutions which may be considered reasonably “novel.”

As of the time of writing, the controls to manage this level of “creativity” are limited. CoPilot is intended most often to be used in as simple a manner as possible, as expected for an augmentation tool – the addition of further options for tuning may not be considered a benefit in most contexts if this makes its use less intuitive.

Therefore in practice there is no (direct) control over temperature within the development environment, no (direct) guidance of output; changes must be created by generation of new outputs or alteration of the code prompt.

In order to alter the outputs offered, not only the preceding code, but additional tabs, filenames, and other aspects of the codebase that CoPilot takes into account in the generation must be manipulated in total as a kind of unified prompt. The extent to which these forms of prompt can be used in practice to vary the “novelty” of the solution is really an exercise for the developer and has no practical limitation.

Few if any CoPilot completions fall within what we might informally call the output of a “rock star developer.” This is true for two reasons: By its nature, CoPilot is trained on a large bulk of code. Statistically, little of this code is likely to be written by an arbitrarily defined “top tier” of developers, or even in scenarios where this level of novelty in coding can be easily applied (such as in many legacy enterprise software systems), and so the output is less likely represent this kind of code.

Secondly, “rock star” code by definition requires a high level of understanding of the context of the code. Code completion, and all LLM-based generative AIs, produce subsequent outputs based on statistical achievement, which includes only a limited virtual understanding of the context. While advanced models and techniques can employ larger contexts of evaluation when generating a completion, this remains fundamentally different than a developer gaining a subtle understanding of what is being asked by the requirement and then writing code to suit.

Here we can observe one of the fundamental divisions of CoPilot stated by GitHub’s own research: CoPilot is intended to, and successfully does, supplement the developer in producing code faster, and aid the novice developer in taking the code being written in the correct overall direction. As such, we can say that CoPilot is successful – even significantly so. But we must rank its technical vitality relatively low. As such a subjective measure is extremely difficult to quantize, we must settle by giving it a rating of “adequate.”

Stability & Reliability

This axis requires more rigorous analysis to properly gauge. In order to assess reliability at its most basic level, we will use CoPilot to generate a series of code snippets, test them, and determine how correctly they perform their tasks (assuming they do run at all).

The tests below will be broken out between stability and reliability as two different factors, with the conclusions of each discussed separately.

Stability – Testing

Following are ten definitions for small pieces of functionality. Each functionality’s goal can be defined by a basic prompt, eliminating for the moment the concern of whether CoPilot is taking the context into account.

Each definition will be generated by writing the prompt in a blank document named and suffixed to encourage CoPilot in the proper direction, with the result then assessed for completion and accuracy. Where successful, outcome will not be elaborated.

Test 1: Read file and process (1)

Language: Python

Prompt: Open the file named “contents” in the current directory, and count the number of syllables across the whole file.

File name: count_syllables.py

Result: Success.

Assessment: Code generated and ran correctly.

Test 2: Scrape live data and output (1)

Language: JS (Node)

Prompt: Scrape the day’s financials from Yahoo Finance, and output them in a formatted PDF file named “financials.pdf”

File name: scrape_financials.js

Result: Failure

Assessment: Code did not generate completely. When the remainder of the code was added manually, the code failed to run.

Test 3: Read multiple files and process

Language: C

Prompt: Open every text file in the specified directory and sum all the numbers in them. Each number should end with a newline character. The program should print the sum of all numbers in all files.

File name: sum_numbers.c

Result: Failure

Assessment: The generated code compiled, and ran, but it failed to properly sum the numbers.

Test 4: Scrape live data and output (2)

Language: Python

Prompt: Scrape the day’s financials from Yahoo Finance, and output them in a formatted PDF file named “financials.pdf”

File name: scrape_financials.js

Result: Failure

Assessment: The generated code runs, and generates a PDF, but the content is empty, sans header (i.e. does not scrape the target).

Test 5: Read file and process (2)

Language: C

Prompt: Open the file named “contents” in the current directory, and count the number of syllables across the whole file.

File name: count_syllables.c

Result: Success

Assessment: Code generated and ran correctly.

Test 6: Run basic HTTP server (1)

Language: Ruby

Prompt: Serve the current working directory over HTTP on port 8000.

File name: serve_folder.rb

Result: Success

Assessment: Code generated and ran correctly.

Test 7: Run basic HTTP server (2)

Language: JS (Node)

Prompt: Serve the current working directory over HTTP on port 8000.

File name: serve_folder.js

Result: Success

Assessment: Code generated and ran correctly, with the minor exception that some excess code was generated and had to be removed in order for the code to run.

Test 8: Read file and sort (1)

Language: Ruby

Prompt: Extract 100 words randomly from the dictionary file, then sort them efficiently. Print three of them to the screen.

File name: random_words.rb

Result: Mixed

Assessment: The code fails, but only due to making an erroneous assumption on what file to open. As this is highly system dependent, it is not reasonable to consider this a “failure”; when the file name is corrected, the generated code operates, and prints. However, only one word printed, rather than three.

Test 9: Read file and sort (2)

Language: Go

Prompt: Extract 100 words randomly from the dictionary file, then sort them efficiently. Print three of them to the screen.

File name: random_words.go

Result: Success

Test 10: Create login form

Language: JS (Angular)

Prompt: A small self-contained login form using Angular

File name: login_form.js

Result: Failure

Assessment: The component was not fully formed. Even under the assumption that dependencies are properly managed, the generated code would not be able to be used in an Angular project.

Stability – Takeaways

All of the examples provided are all extremely limited in scope. Full codebases remain unrealistic to generate this way, via direct prompt, and even a more complex standalone script would not easily be generated by a single prompt without some amount of adjustment.
Interestingly, raw code generation remains one area where the use of a standard LLM interface would be useful more so than CoPilot and related tools: Asking ChatGPT, for example, to generate code by providing a full descriptive example, then iterating step by step, will be faster, and usually more complete and correct, than using CoPilot for similar purposes. Even CoPilot’s CLI Chat feature will not fully meet this need. But as an auxiliary co-developer, CoPilot certainly proves preferable. Additionally, CoPilot’s more recent addition of its Workspace environment attempts to close this gap by providing opportunities for CoPilot to more completely generate code from scratch. It is reasonable to expect that progress will continue in this direction. However, the principle nature of CoPilot as auxiliary coder, rather than as an interpreter of intent of the prompter, will remain for the foreseeable future.
Individual methods, functions, processes, and so on, often all follow similar generation methods one to the other, and the use of small scripts generated in a single sequence of completion allows a glimpse into the fundamental eligibility of generated code.
The simplicity or complexity of a given language appears to form a dependency for the quality of generated code. This makes intuitive sense. For instance, the use of Ruby’s built-in sort eliminates the need to import and use a library, or employ a custom sort. The more self-contained the code, the more likely it seems to work correctly – which seems reasonable at least on the surface, as the addition of external context is where any NLP is most likely to demonstrate a weakness.
It should also be noted that in some of the test cases above, multiple attempts were needed in order to generate usable code. Repeated attempts were applied only when an iteration failed to prompt for autocompletion (i.e. nothing at all was returned for a given line / set of lines). A prompt that generated code which then failed to run or ran incorrectly was not re-run, and was considered a failure.
In some cases, the code which was generated operated extremely inefficiently. One prime example of this was test case 8, in which the code CoPilot generated included the following:
```
# Read the dictionary file

dictionary = File.readlines("dictionary.txt")

# Extract 100 words randomly

words = dictionary.sample(100)
```
This code functions. Ruby’s internal balancing act even makes the output to the end user appear reasonable. However the code represents an extremely poor choice of implementation.
Interestingly, in almost all cases tested (those listed above and others), .js files incurred assumptions of Node when no other framework was specified. Several potential reasons might answer this; no speculation will be given here.
One additional persistent issue was demonstrated in a curious concatenation of code and documentation, as visualized here:
Some output came in the form of extraneous comments, but much of it appeared to source from README or other markdown files and certainly causes the code to be unusable in the large majority of cases. Often, but not always, this appears to be the result of CoPilot attempting to add output where the code sample is effectively complete, which is reasonable – CoPilot’s purpose is to suggest the next step, which in a sense after the full generation of code is the generation of documentation. Since this generation incurs a potential for error, it represents an additional need for the developer to note when CoPilot might generate extranea. In a realistic context there is no guarantee that CoPilot will not generate this sort of documentation within the code, where a given context appears to be “complete” but which in no way should include documentation verbatim.

It seems clear that while generated code often works, and is occasionally even performant, it is almost never robust. Code from CoPilot is almost guaranteed to be fragile. This is also not surprising as, frankly, most publicly available code is written to highly specific and often inflexible needs (say, a custom-built tool of a hobbyist, uploaded to GitHub).

Reliability – Testing

So much for stability. What about reliability? Can the code that CoPilot generates be ported, reused without significant adaptation? To assess this, we will need a different sort of test than those used above.

In order to provide the narrowest of vectors for comparison (thereby aiming to minimize the opportunity for confounding) we ask CoPilot to generate functions, or methods, in a similar comment-prompt means as before, and then attempt to reuse those outputs in different contextual environments.

We will attempt two different use cases for this, one in a minor, self-contained context, and one in a more significant codebase as represents an established project (since these represent much more code by content than greenfield projects, despite the need for CoPilot possibly being higher at the outset of a project when generating boilerplate, etc).

Test 1: Data concatenation tool

One example of such a function is outlining the beginning of a simple model data concatenation tool in Python:

# Open the CSV file named "content.csv" and read the data as follows:

# Read column "intro" into a list named "intro_list"

# Read column "main" into a list named "main_list"

# Read column "conclusion" into a list named "conclusion_list"



import csv



intro_list = []

main_list = []

conclusion_list = []



with open('content.csv', 'r') as file:

reader = csv.DictReader(file)

for row in reader:                                                                        

intro_list.append(row['intro'])                                                       

main_list.append(row['main'])

conclusion_list.append(row['conclusion'])



# Method to build the synopsis out of all provided components

def build_synopsis(intro, main, conclusion):

return intro + " " + main + " " + conclusion

All code in this example was generated automatically based on the comment prompts. Currently the “synopsis” is defined by a combination of three columns. However the prompt provided for the concatenation method expects those three parts only. If we were to add new fields or change the method of concatenation, this method would fail to work and would require refactoring.

Of course this example is trivially satisfied and does not represent most real code. But we can inductively understand that this lack of “foresight” manifesting as flexible, reliable code may have noticeable impact if CoPilot is used across the board.

There is nothing wrong about this code. It runs and functions correctly. It is not even unreasonable, given specific requirements, to expect to ingest three, and only three, named columns. But a general principle of reusable code would expect a more adaptable implementation of this method, which after all would require only very minor adjustments to be significantly more reliable (accept arbitrary parameters, read in a dict structure, etc).

This result trivially highlights another fundamental limitation of CoPilot – as discussed earlier, it is not principally a code generation tool. It is a coding easement tool. The fundamental requirement for the developer to recognize and adapt the code to the goal remains nearly no matter how advanced CoPilot’s output becomes. This is a point which will be addressed again.

Test 2: Extending HTMX library

As noted, a fully realistic test of CoPilot includes adaptations and extensions based on a complete existing codebase – in other words, adding new features to code. This is perhaps the primary and most common use case of CoPilot in an organizational setting, as distinguished from private use for smaller, more isolated use cases.

The example chosen for this is htmx, a Javascript library that adds dynamic features to HTML elements.²¹

Htmx is a modern library of reasonable size, but straightforward enough that its core workings ought to be digestible by CoPilot so that the context is complete.

One of the core features of htmx is the ability for an element to make a call and swap its contents with the result.

A realistic extension of this, which will be used for this testing example, might be to add a basic stack memory for an element, which will allow it to track its own history (theoretically with a mind to being able to swap back to previous states easily, though for this example we will not attempt to create the full functionality).

Working in the main JS file which manages the state of elements in htmx, CoPilot was provided with a prompt to create a tracking function for adding the current state to the stack any time the swap was called. This would be applied universally, leveraging htmx’s core “swap“ function.

The prompt that was applied was generated a comment preceding the helper function, as follows:

Helper function to generically track the elements swapped out over time, using the swapList.

The state variable swapList was initialized manually.

The results of this test were less than promising. Even given the full context of the open file, which contains the core functions used, CoPilot appeared to draw more from pre-existing training than from context, starting generation of a function along a reasonable, but inappropriate, direction:

        function trackSwappedElements(swapList, target, fragment) {

            if (swapList) {

                var swapInfo = {target: target, fragment: fragment};

                swapList.push(swapInfo);

                

                // If the swapList is too long, remove the oldest element.

                if (swapList.length > htmx.config.historyCacheSize) {

                    var oldestSwap = swapList.shift();

                    cleanUpElement(oldestSwap.target);

                    

                    // If the oldest element was a form, remove the form from the form list.

                    if (oldestSwap.target.tagName === "FORM") {

                        var formList = getInternalData(getDocument().body).formList;

                        if (formList) {

                            var formIndex = formList.indexOf(oldestSwap.target);

                            if (formIndex >= 0) {

                                formList.splice(formIndex, 1);

                            }

                        }

                    }

                }

            }

        }

In addition this code contains some highly questionable implementation choices, such as the utilization of global variables defined internally to the function. This does not speak to reliability, strictly, but should be noted.

Regardless, with guidance, this generated code will be made to work more or less. It is easy to see how this function can be developed with minimal effort into something that meets the suggested need (interestingly, CoPilot did successfully leverage the existing cleanUpElement function).

The level of reliability will mostly depend on the remainder of the implementation. As written, the function relies on several unconfirmed properties, and while plausibly functional, is not particularly reliable.

Reliability – Takeaways

The two demonstrations of CoPilot generations in a larger context seem to bear out a standard that might be expected: Straightforward code that makes a number of assumptions – or rather, acts blindly, and in doing so, results in assumptions, such as the existence of properties that would normally be checked, processing of parameters for smooth handling, etc.

Following on from the assessment of stability, this outcome is not surprising. It is worth establishing, however, as it more or less firms the idea that was notionalized earlier, that CoPilot’s outputs are, effectively, trivially direct in form even when functional. Robustness comes of accurate application to context, and from assessment of the state and expected interactions of the system, which for the time being still requires a human hand for guidance. It is plausible that expertise in prompting would reduce this need, but it is unlikely to eliminate it.

Security and Privacy

Surprisingly, there is relatively little to note regarding security in privacy in terms of how it is reflected in code output. This is because perhaps unsurprisingly, the code offered by CoPilot for completions reflects existing code standards across the board as well as it reflects any other current standard – that is to say, when the majority of publicly available code represents a certain standard, protocol, or interface, the majority of the completions will reflect this.

Practically speaking, this turns out to mean that the average output of a code generation will usually include correct practices such as encryption of data, sanitization of input, etc. Demonstrably, this is not always the case – run through a dozen generations for implementing a login form and you will likely find at least one that breaks one or more common security practices.

Under the umbrella of “privacy” the concept of transmission of data – i.e. telemetry – can also be considered. That definition of the term has not been the focus here largely because it is an implicit aspect of any generative AI system that is not self-hosted or air-gapped. As noted earlier in this document, the more detailed telemetry CoPilot may collect may to a degree be negated depending on license, and can further be controlled with manner of engagement. For more details, reference CoPilot’s Privacy FAQ and Trust Center.^22,23

Comparison with Alternatives

Many comparable tools have launched in this space in recent years, from competitors large and small. It would be interesting, but of limited utility, to assess them all. The list includes such names as Magic, CodeGen, JetBrains, Gemini, Augment, and others. A few of the key and newer players are described and assessed here.

Since this document’s intent is primarily to examine CoPilot, rather than to provide a full-breadth examination of all code generation tools, only a minimal examination of each tool’s functionality will be run, producing a sufficient basis for basic comparison against CoPilot’s capabilities. The determination will be up to the user whether each given tool suggests further investigation and might be an avenue for better productivity than CoPilot.

Codeium

Codeium represents itself as “a code acceleration toolkit” that includes features for autocomplete and repo search. Currently, Codeium attempts to distinguish itself via its focus on privacy, though it notes that it has a “grand vision” for the eventual evolution of the coding process.²⁴

In practice, much of the current experience in using Codeium proves comparable to that of CoPilot. The sources that Codeium uses are somewhat more restrictive (“Codeium’s underlying model was trained on publicly available natural language and source code data”) but appears to reflect similar quality to CoPilot (note that for the purposes of this document, this comparison is anecdotal; no explicit comparison tests were performed).

In terms of selecting a coding support option for an organization, one main consideration may be the longevity of Codeium as compared to a tool offered by a more established company (though this itself is no guarantee of longevity). Codeium has recently raised a relatively sizeable Series B; its mid-term business model seems to orient around enterprise licenses.^25,26 Whether this proves sustainable is an open question.

Cursor Copilot++

This augmentation tool promotes itself as a code editor with AI built in from the get-go. A relative newcomer in the generative AI coding space, Copilot++ intends to “actually edit your code, not just predict insertions”.²⁷ Features include direct prompt insertion (effectively merging some chat prompt features into the coding process), error resolution, and pseudocode expansion.

For privacy, Copilot++ provides a selection of two options during setup, storing prompt data or storing nothing:

Copilot++ appears to work partially through prompts to OpenAI services, so whatever data is persisted by OpenAI is outside of the control of this service; this may be a relevant point for some organizations. In terms of security this will be one step further from reliance on OpenAI.

The distinction of using OpenAI directly is two-fold: One, Copilot++ leverages a custom model which is claimed to be “…trained on sequences of small diffs, and can see the edits you have made in the last few minutes at inference-time.”^27,28 Two, the user may elect to leverage custom keys to interact with other services and models including Google and Azure.²⁹

Performance across shared features with GitHub CoPilot appears similar, with some possible benefits:

Running test 1 from the stability section above (“count syllables”) produces a similarly effective result, but with noticeable different code (at least at first run).
Running test 10 (“login form”) is much more effective than the CoPilot counterpart, producing a generally usable form immediately during this test. Notably, when not prompted otherwise, it defaults to typescript.

A more complete suite of feature testing would be advisable if considering CoPilot++ for use, but it appears to offer a reasonable starting point and pass minimal reasonableness.

ChatGPT

The comparison of a chat app against a code augmentation tool might seem pointless, but both tools stem from the same fundamental algorithmic process, and both can be used to generate code. The differences here can be stated simply: ChatGPT is a text completion tool. It is intended as a question-and-answer generative AI. It can be used to complete code, especially if used via the API, but it is better suited to a low-code or no-code situation, where the developer is attempting to minimize the code they themselves write, and instead use ChatGPT to do this for them. In that sense, ChatGPT is highly successful. The code it outputs in recent model versions has a high success rate, and also generally provides explanatory notes with the output, easing its use.

In comparison, as already noted, CoPilot, while based on the same fundamental principle, has been trained and developed to serve as augment. The code it generates can be used from scratch, but as is evident, is more ideally placed when suggesting, finishing, or tweaking existing code.

Conclusion

Overview

As noted earlier, the primary findings of this document’s analysis include 1) CoPilot leads in its ecosystem, but not necessarily in the quality of its code generation; 2) CoPilot provides assistance in speed to developers with sufficient experience to readily distinguish quality code; and 3) CoPilot is viewed by many experienced users as an assistant more so than a guide or independent participant in the generation of code.

As a tool with many users who have differing needs and differing skill levels, no single user experience can summarize the quality and utility of CoPilot. To speak generally, CoPilot offers a useful advantage to a very particular niche – reasonably experienced and knowledgeable developers who strive to “shortcut” the time taken to write common implementation patterns, or supplement some of the typing needed to finish a line or a statement.

CoPilot may effectively serve other niches, such as a more novice developer seeking to experiment and develop hobby projects in a limited timeframe, but these may lead more often to poor outcomes, as questionable code generations and lack of contextual awareness lead the developer astray.

It should be stressed that CoPilot is more than a tool to copy and paste code – CoPilot does positively offer a powerful set of features that can alter – and to an extent has already altered – the way the development process is viewed by product owners, management, and developers themselves. Writing code is already viewed as less of an obstacle. This is a freedom and a pitfall: It is that much easier to pull together a working application and generate a feature set, but the stability and reliability of the resulting implementation is dependent on the user. Issues in code may lie dormant, appear to work as expected, and then produce magnificently disastrous results when the wrong toggle is triggered. This happens already, and generative AI supercharges this potential for trouble. The more that working, usable code is produced by AI, the more its users – and the development community generally – will assume that all its code is quality – and more dangerously, will assume that working right now means is good quality, which is a dangerous, dangerous assumption and one that has tanked many a worthy project. Code cannot be trusted. The human behind it can, sometimes. That is where the strength of development has always lain, and this constant has not changed with the introduction of advanced generative AI.

Recommendation

The goal of this document is to provide an analysis of effectiveness, and a survey of many of the key factors in understanding the benefits and downsides of CoPilot and related tools. A recommendation is therefore strictly speaking outside the scope of this document, as the decision will be necessarily dependent on the needs of the organization and the context in which it operates.

That being said, some evident points do seem to rise uniformly, and may be worth noting:

Developer time is expensive, especially in higher tiers.
- License costs for a chosen tool may be weighed against the time taken to code and review a certain amount of code that might otherwise be ripe for auto-generation. In other words, how much of a developer’s time is currently taken in generating code rather than solving the problem?
- The Mythical Man Month remains a relevant factor no matter how much of a developer’s code can be generated by AI.³⁰
Confusion and distraction are considered by many to be the worst stumbling blocks in a developer’s work day. Does CoPilot aid the developer or does it require additional effort from them to manage?
“Efficiency” is an ambiguous term. Defining it well is the first building block toward proper use of generative AI.

For the average small to midsize organization that employs multiple developers, CoPilot is a tool worth considering. Its ubiquity and ease of use make it useful – sometimes – for cutting down gruntwork.

For extremely small organizations, an alternative might be preferable. Cost of a license may simply not pan out against development time, except possibly for green-field projects, and even in these cases other tools may be more useful. Cursor Copilot++, for example, which focuses (at the moment) on a smaller feature set, and may operate more intuitively to some developers, may be worth adoption over Github CoPilot.

For enterprise organizations, adopting CoPilot may prove a reasonable cost even if its use is intermittent – the ability for a team, or a single developer, to dip into that well as needed can save significant time against the defrayed cost absorbed by the organization.

No matter the choice, knowing how to use CoPilot remains more important than the choice of whether or not to purchase it.

Footnotes, sources, and references:

It is possible to create concrete, rigorous definitions for this intuitive sense of creativity in code. A trivial example of this would include a mapping of a language’s reserved words to a point structure that represents the time and space efficiency of a solution composed in total. This level of exacting definition is bypassed here in the interests of readability. It should also be noted that in order to create a framework pertinent to a broad range of organizations, there is utility in retaining flexibility over accuracy to a certain degree, recognizing the variability of real-life solutions, which do not always accord with technical optimals.
See https://www.allaboutai.com/ai-glossary/temperature for a practical definition of this term.
For instance, a broader concept of security would involve factors such as authentication, authorization, encryption, and so on, as well as both high-level design choices and implementation details for all of these. It is not immediately reasonable to expect suggestions of this nature from CoPilot. If, in writing an authorization module (which often would leverage if not be entirely handled by a framework or library), the user accepted code completions in the context, then we might reasonably expect those outputs to take into consideration such factors as proper encryption, logged-in detection, etc. But it would be less reasonable to expect an output to include a suggestion of how to set up the user database securely. Some tools such as ChatGPT (and products leveraging it) might be more appropriate for this kind of broad suggestion.
https://github.com/features/copilot
https://github.com/features/copilot/#faq-privacy-copilot-for-business
https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
https://github.blog/2023-10-10-research-quantifying-github-copilots-impact-on-code-quality/
https://github.blog/2023-06-27-the-economic-impact-of-the-ai-powered-developer-lifecycle-and-lessons-from-github-copilot/
https://github.blog/2023-02-14-github-copilot-for-business-is-now-available/
At time of writing, May 2024
https://github.blog/2024-04-29-github-copilot-workspace/
The landmark book Design Patterns by Gamma et al, released in 1994, remains a staple of many developers’ bookshelves despite its age. Modern design patterns vary and the concept of design patterns has been applied in niches across software development (see e.g. https://en.wikipedia.org/wiki/Software_design_pattern, https://www.redhat.com/en/blog/14-software-architecture-patterns, https://refactoring.guru/ and many more).
https://www.reddit.com/r/Unity3D/comments/16ww6rd/comment/k2zpp66/
https://www.reddit.com/r/GithubCopilot/comments/15axyot/comment/jtvv3ep/
https://news.ycombinator.com/item?id=32755206
https://www.scalablepath.com/full-stack/ai-pair-programming-github-copilot-review
https://ntietz.com/blog/changing-my-relationship-with-github-copilot/
https://news.ycombinator.com/item?id=39169889
https://trace.yshui.dev/2024-05-copilot.html#did-github-copilot-really-increase-my-productivity
https://www.microsoft.com/en-us/Investor/events/FY-2023/Morgan-Stanley-TMT-Conference
https://github.com/bigskysoftware/htmx
https://github.com/features/copilot/#faq-privacy-copilot-for-business
https://resources.github.com/copilot-trust-center/
https://codeium.com/faq
https://finance.yahoo.com/news/codeium-raises-65-million-bring-140000708.html
https://codeium.com/blog/how-is-codeium-free
https://www.cursor.com/cpp
https://openai.com/form/custom-models/
https://docs.cursor.com/miscellaneous/api-keys
https://openlibrary.org/books/OL1110870M/The_Mythical_Man-Month

Final addendum: At the date of posting, which was long delayed because I could not figure out what to do with this mostly pointless document, many new and more advanced tools have been created and promoted (Google AI Studio, Codex, etc, etc, etc) — many powerful utilities and even environments that enhance specifically code from scratch. Re-reading this white paper it seems that most of the assessments of how the actual code is produced and presented remain comparable. Actual interpreted quality differs from user to user, case to case, tool to tool…but I think that the roadblocks that existed a year ago remain, in similar form, even if at different scale and in different use cases.

Et Cetera

White Paper: CoPilot Organizational Assessment

Overview

Purpose

Value Proposition of this Document

Key Findings

General Introduction

Definition of Terms

Definition of Parameters

Technical vitality

Novelty of form

Stability & reliability

Security & privacy

Augmentation of efficiency

Technical Overview

Summary of CoPilot and related tools

Baseline Expectations of CoPilot

Applications of CoPilot

Technical Evaluation

Evaluation of Parameters

Technical Vitality

Novelty of Form

Stability & Reliability

Stability – Testing

Stability – Takeaways

Reliability – Testing

Reliability – Takeaways

Security and Privacy

Comparison with Alternatives

Codeium

Cursor Copilot++

ChatGPT

Conclusion

Overview

Recommendation

Tags

Overview

Purpose

Value Proposition of this Document

Key Findings

General Introduction

Definition of Terms

Definition of Parameters

Technical vitality

Novelty of form

Stability & reliability

Security & privacy

Augmentation of efficiency

Technical Overview

Summary of CoPilot and related tools

Baseline Expectations of CoPilot

Applications of CoPilot

Technical Evaluation

Evaluation of Parameters

Technical Vitality

Novelty of Form

Stability & Reliability

Stability – Testing

Stability – Takeaways

Reliability – Testing

Reliability – Takeaways

Security and Privacy

Comparison with Alternatives

Codeium

Cursor Copilot++

ChatGPT

Conclusion

Overview

Recommendation