,

Changes to Anaconda’s Anonymous Usage Data Collection

Michael Grant

Anaconda strives to continually improve the user experience for our customers and communities. We aim to be the premiere provider of secure access to thousands of Python and R repositories, packages, and libraries, while also supporting the open-source community that powers those packages. These dual goals are reflected in Anaconda’s products and in our open-source contributions.

A deeper understanding of repository usage patterns enhances our ability to serve both our free and paid users. With that in mind, we are expanding the anonymous usage data that the conda package manager delivers when used alongside these Anaconda products: Anaconda Distribution, and Anaconda Navigator

You will not be affected by this change if you rely exclusively on community channels like conda-forge and installers like Miniforge. Mindful of our commitment to the community, we will not submit this change to the conda project, the conda package itself, or the Miniconda installer.  We have also made it simple to disable this additional data collection, if you choose.

In this article, we detail what data is being collected, what additional data will be collected, where and when this update is happening, why we are now collecting this additional data, how this data will be used by Anaconda, and how it will benefit our customers and communities. The article finishes with a deep dive into how user data is managed in Anaconda products.

What data is being collected?

When a conda client requests a package or an index from a repository, it uses an industry-standard mechanism to provide generic identification information into the request:

  • the versions of the conda and requests Python packages;
  • the variant and version of the Python interpreter; and
  • the variant and version of the host operating system.

With our change in place, three randomly generated tokens are added to each request:

  • a client token unique to each distinct conda client;
  • an environment token unique to each conda environment; and
  • a session token unique to each individual conda transaction.

The details of how this works are given in the “Deeper dive” section below. But as indicated above, the tokens are random: they contain no personally identifiable information—not even the name of your conda environment. They do, however, help us draw better statistical conclusions about usage by allowing us to more precisely distinguish between distinct users, environments, and transactions in our access logs.

Where and when is this change occurring?

This update is accomplished using a new conda plugin implemented in the new anaconda-anon-usage conda package. This package will be added to Anaconda products in phases to help ensure a smooth rollout:

  • Initially, it was added to the dependencies for Anaconda Navigator, starting with version 2.4.3, released on September 7, 2023. This release has undergone a full QA process to ensure that the inclusion of anaconda-anon-usage does not impact operation.
  • It will be incorporated into the Anaconda Distribution 2023.09 installers, scheduled for release at the end of September 2023. 
  • At a later date to be determined, the package will be incorporated into the versions following additional products offered on the Anaconda channels:
    • anaconda-client, our tool for managing community package channels on Anaconda.org;  and 
    • anaconda-cloud-auth, a package that facilitates authenticated connections to a variety of Anaconda cloud services, present and future.

As explained in the introduction, this package is not being added to conda itself. Our intent is to collect data associated specifically with users who engage with Anaconda products, and not the larger open-source community, whose members may prefer to rely entirely on community-driven resources. 

Why collect extended usage data?

This additional data will help Anaconda serve both our community and our customers better, in a variety of ways.

On the community side, we are always looking to improve our ability to understand usage patterns in lightweight, privacy-preserving ways. This additional information allows us to perform analyses with much more accuracy by disaggregating, or separating, the raw usage data across users, transactions, and environments.

Here are just a few examples of questions we will be able to answer better:

  • How many individual conda installations sit behind a single IP address?
  • Is a given conda transaction coming from a long-lived desktop environment, managed by a human being, or a temporary environment created by a CI job?
  • Is a particular package installation occurring in a base environment or a child environment? And is it a new installation, or is it an update to an existing package—the latter indicating a stronger sense of active use?
  • How many separate installations of a particular package vulnerability are we seeing? And how many of them have been remediated with updates?

It is our commitment to our community to find ways to share these insights with you. Specifically, we are looking at ways to provide this information to Anaconda channel owners and package developers, likely with an expansion of our existing condastats project.

Of course, our ability to invest significant resources into the conda community rests on the success of our revenue-generating products as well. To that end, we acknowledge that there are several ways that this data will help us improve our business. For instance:

  • Better usage information allows us to improve our prioritization of package builds and CVE curations, increasing the value of our secure software supply chain.
  • Customers who provide us with an IP address list can obtain accurate counts of their user base, to properly size licenses to our commercial offerings. 
  • Anaconda Professional customers will have access to anonymous summaries of their users’ exposures to new vulnerabilities.
  • Anaconda Business customers can opt in to even more telemetry that allows us to track vulnerabilities down to specific users and hosts.

For more information on how Anaconda generates revenue to support our business operations and continue our community investments, see our recent blog post, “Is conda free?” 

Disabling the extended usage data

We certainly hope that you will consider leaving the random tokens in place. But if you do wish to disable them, you can do so by running this command:

conda config --set anaconda_anon_usage off

You may also manually edit your conda configuration file and add the line:

anaconda_anon_usage: false

To re-enable the additional usage data, run this command,

conda config --set anaconda_anon_usage on

or remove the anaconda_anon_usage entry from your configuration file. No matter what you choose, your choice will remain in effect—even if you uninstall and reinstall Anaconda, as long as you do not delete your conda configuration file when doing so.

Thank you!

We’re grateful for the trust our users have placed in us. As always, with every change we make to our solutions, we are working to serve you better and continue our work to provide centralized, secure access to thousands of Python and R repositories, packages, and libraries. We will continue to champion the data science community and steward open-source projects that make it easier for you to innovate, build, and deploy effective solutions in your field.

Deeper Dive: User Data in conda and anaconda-anon-usage

In this section, we offer a more technical introduction to the mechanism by which both conda and anaconda-anon-usage determine and transmit the user data discussed above.

The key mechanism for transmitting this user data is through the industry standard HTTP user agent string. Your web browsers transmit these strings along with every request they make to a website. It generally contains information about the computer, operating system, and browser; for example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36

Conda uses the same protocols to make requests of package repositories, so it, too, generates user agent strings. Here is a typical example:

conda/23.7.3 requests/2.31.0 CPython/3.10.12 Darwin/22.6.0 OSX/13.5.1

As you can see, this string contains information about the versions of various components of your operating system and conda environment. You can see the precise string that your conda environment uses by running the command conda info and examining the user agent line. Alternative conda clients such as mamba and pixi use very similar user agents as well. HTTPS encryption ensures that the content of these headers is protected from snooping.

When anaconda-anon-usage is installed, it uses the conda plugin mechanism to augment the conda user agent string. Specifically, it adds a new version token and three additional, randomly generated tokens. This longer user agent string might look like this:

conda/23.7.3 requests/2.31.0 CPython/3.10.12 Darwin/22.6.0 OSX/13.5.1 aau/0.3.0 c/16lUJyi7R8u-Co33mZJElQ s/YYFCctOeTjyDnXLazjLy_A e/rVB0_HxgRXKPLzKt9sKcVA

Here is what each of these tokens means:

  • aau/: the version of the anaconda-anon-usage package itself.
  • c/: a client token unique to your conda installation. Every time conda is run, this same token will be delivered.
  • s/: a session token unique to a single run of conda. For instance, a command like conda install pandas might make five separate requests to the package repository; each one of these requests would share the same session token.
  • e/: The environment token is unique to each conda environment, including both the base and child environments.

So for instance, if you run the commands ‘conda install -n base pandas’ and ‘conda install -n child1 panel’ on the same machine, the user agent strings would have the same client token, but different session and environment tokens.

Each token is generated in the same way:

  • 16 bytes of cryptographic quality random data is generated using os.urandom.
  • The data is encoded by base64.urlsafe_b64encode to produce a 22-character text-friendly representation.
  • The session token is generated afresh every time conda is called.
  • The client and environment tokens are saved to disk so that they can be reused:
    • macOS/Linux: ~/.conda/aau_token, $CONDA_PREFIX/etc/aau_token
    • Windows: %USERPROFILE%\.conda\aau_token, %CONDA_PREFIX%\etc\aau_token

If you delete the saved tokens, a new set of values will be regenerated automatically during the next conda transaction—unless you disable this telemetry altogether.

The anaconda-anon-usage package is, of course, open source, with a standard BSD license. The source code is available publicly if you seek an even deeper dive!

Let’s Connect

Get in touch to learn more about Anaconda.

Contact Us