Understanding CircleCI Workspaces: How Data Flows Across Jobs

When working with CircleCI, one powerful feature that aids in managing data between jobs is the use of workspaces. However, understanding how data within workspaces flows is not immediately clear. This guide will help you grasp how data persistence in workspaces works.

Understanding CircleCI Workspaces: How Data Flows Across Jobs

Today I struggled for a few hours trying to understand why I was seeing the entire repo and the node_modules folder in my deploy job while using CircleCI workspaces. After some investigation, I found the root cause and managed to move past this issue. To save others from the hours I spent trying to figure this out, I decided to publish the blog that I wish had existed before I invested so much time into this mystery.

What Are CircleCI Workspaces?

CircleCI workspaces provide a mechanism to persist data from one job and make it available to subsequent jobs. This feature is particularly useful for sharing build artifacts, dependencies, and other important files between jobs in a workflow.

How Workspaces Handle Data

A key aspect of CircleCI workspaces is that they compound data from multiple jobs. This means that each time you persist data to the workspace, it adds to the existing workspace content rather than replacing it. This compounding behavior can lead to unexpected results if not managed carefully.

The Compounding Effect: A Step-by-Step Example

Let's illustrate how this compounding effect works with a practical example.

Step 1: Initial Job - Checkout and Install Dependencies

In the first job, we check out the repository and install dependencies. We then persist these to the workspace.

jobs:
  checkout_and_install_deps:
    docker:
      - image: cimg/node:18.14.0
    working_directory: .
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: npm install
      - persist_to_workspace:
          root: .
          paths:
            - .

Step 2: Build Job - Attach Workspace and Build Project

In the second job, we attach the workspace, which now includes the repository and node_modules, then build the project and persist the build directory.

jobs:
  build:
    docker:
      - image: cimg/node:18.14.0
    working_directory: .
    steps:
      - attach_workspace:
          at: .
      - run:
          name: Build project
          command: npm run build
      - persist_to_workspace:
          root: .
          paths:
            - build

Step 3: Deploy Job - Attach Workspace and Deploy Artifacts

In the third job, we attach the workspace again. Now, the workspace contains the repository, node_modules, and the build directory.

jobs:
  deploy_qa:
    docker:
      - image: cimg/node:18.14.0
    working_directory: .
    steps:
      - attach_workspace:
          at: .
      - run:
          name: List contents of the build folder
          command: ls -al build
      - run:
          name: Deploy to S3
          command: aws s3 cp ./build s3://my-s3-bucket-url/ --recursive

Understanding the Result

When you list the contents of the root directory in the deploy job, you might expect to see only the build directory and the artifacts nested within the build directory. However, you will actually see the entire repository, node_modules, and the build directory. This is because each persist_to_workspace step adds to the workspace, and the workspace compounds the data across jobs.

Key Points to Remember

  1. Workspaces Accumulate Data: Each persist_to_workspace step adds to the existing data in the workspace. It does not replace it.
  2. Explicit Management: Be explicit about what you persist and attach to avoid unnecessary data transfer.
  3. Verify and Debug: Use debugging steps, such as listing directory contents, to verify what is being persisted and attached.

Example: Explicitly Managing Workspace Contents

If you want to ensure only specific directories are present in the deploy job, you can use specific paths and directory management commands:

jobs:
  deploy_qa:
    docker:
      - image: cimg/node:18.14.0
    working_directory: ~/project
    steps:
      - attach_workspace:
          at: /home/circleci/workspace
      - run:
          name: Copy build folder to project root
          command: cp -r /home/circleci/workspace/build /home/circleci/project/
      - run:
          name: List contents of the build folder
          command: ls -al /home/circleci/project/build
      - run:
          name: Deploy to S3
          command: aws s3 cp ./build s3://my-s3-bucket-url/ --recursive

Alternative Method: Using save_cache

Before I realized that the workspace was compounding the files it received, I had reached out to my Principal Engineer for help. However, I managed to figure out the issue on my own before he replied, so I asked him to ignore my request for help. When I shared my findings, he congratulated me and mentioned that he wasn’t familiar with this method of sharing files across jobs. Instead, he introduced me to an alternative method he often uses: save_cache. This approach not only persists data between jobs but can also carry over data across different workflow runs. Here’s how you can use caching as an alternative:

jobs:
  checkout_and_install_deps:
    docker:
      - image: cimg/node:18.14.0
    working_directory: ~/project
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: npm install
      - save_cache:
          key: v1-dependencies-{{ checksum "package-lock.json" }}
          paths:
            - node_modules

  build:
    docker:
      - image: cimg/node:18.14.0
    working_directory: ~/project
    steps:
      - checkout
      - restore_cache:
          key: v1-dependencies-{{ checksum "package-lock.json" }}
      - run:
          name: Build project
          command: npm run build
      - save_cache:
          key: v1-build-{{ checksum "package-lock.json" }}
          paths:
            - build

  deploy_qa:
    docker:
      - image: cimg/node:18.14.0
    working_directory: ~/project
    steps:
      - checkout
      - restore_cache:
          key: v1-build-{{ checksum "package-lock.json" }}
      - run:
          name: List contents of the build folder
          command: ls -al build
      - run:
          name: Deploy to S3
          command: aws s3 cp ./build s3://my-s3-bucket-url/ --recursive

Official CircleCI Documentation

For more detailed information on these methods, refer to the official CircleCI documentation:

Conclusion

Understanding how CircleCI workspaces compound data is crucial for managing your CI/CD pipeline effectively. By being explicit about what you persist and attach, and by verifying the contents at each step, you can avoid unexpected results and ensure your pipeline runs smoothly. Additionally, exploring other methods like caching can provide alternative solutions that might better fit your workflow needs.

If you’ve ever been puzzled by seeing more files than expected in your workspace, then I hope this guide helps you to understand the compounding nature of CircleCI workspaces. Thank you for reading and Happy Hacking!