Earthly Blog

Python Web Scraping with Beautiful Soup and Selenium

2024-04-26T00:00:00-04:00

This article surveys web scraping techniques. Earthly ensures reproducible builds, helping keep your data scraping projects on track. Learn more about Earthly.

Now that you can fine-tune large language models (LLMs), they’re more adaptable and can be tailored for specific tasks. However, fine-tuning LLMs requires a large amount of data in order to train the model to generate customized and domain-specific responses.

Web scraping gives you the ability to automatically gather the data you need to fine-tune your LLM. With web scraping, you can target websites to extract relevant data and ensure that your LLM encounters the language and style you need.

In this tutorial, you’ll learn how to scrape the content of a website using Beautiful Soup and Selenium. By the end of the article, you’ll have a better understanding of web scraping in Python and how it can help you fine-tune your LLM.

What Is Web Scraping?

Web scraping involves using specialized software or scripts to navigate through web pages, locate specific information, and then gather that data into a structured format (such as a spreadsheet, JSON, or database).

Web scraping is used in all kinds of scenarios. For example, First Bite scraped data from around a million US restaurants to create a food service platform for lead collection for food suppliers and distributors.

Web Scraping and LLMs

If you’re trying to improve the accuracy of your LLMs, you need to make sure you’re scraping up-to-date data. LLMs used for customer support benefit from being trained on the latest organizational data. This up-to-date data helps generate precise responses, ultimately improving the quality of service.

When you need to gather in-house data from internal knowledge bases or build a data set of a specialized external source, two popular Python web scraping tools will be able to handle the task.

Popular Python Web Scraping Tools

Python offers many tools for web scraping, but two popular ones are Beautiful Soup and Selenium.

Beautiful Soup is a Python library used to extract text from HTML and XML data. It parses an HTML tree and provides various methods to navigate and extract data from that tree. It’s generally used to extract data from a static page, meaning it doesn’t work with data rendered on the client side using JavaScript.

In contrast, Selenium is a browser automation tool that interacts with a website programmatically. It retrieves information by replicating user interactions like keyboard input and mouse clicks. It helps users gather data rendered by client-side JavaScript or data behind paywalls.

Implementing Web Scraping with Python

So, let’s create a web scraping solution from scratch with both Selenium and Beautiful Soup. All the code is available in this GitHub repository.

Prerequisites

Before you begin this tutorial, make sure you have the following:

Python 3.9 or above installed on your system
A virtual environment to isolate your project dependencies

In addition to Python and a virtual environment, you need to install a few packages using pip. To do so, run the following command in your terminal:

pip install beautifulsoup4 requests selenium

This command installs the following packages:

requests, which is used to send HTTP requests to the server to obtain HTML pages
beautifulsoup4, which is used to traverse the HTML tree and extract information from a web page
selenium, which is used to automate browsers and interact with websites via keyboard events and mouse clicks

Scraping a Website with Beautiful Soup

Beautiful Soup retrieves web page content through various methods, including CSS classes and IDs, XPath, or HTML tags. We’ll be scraping data from a sandbox bookstore website specifically designed for web scraping. Here, you’ll extract all the book data into a JSON file.

Analyzing Your Target Website

When web scraping, before writing any code, you need to analyze the website you’re going to scrape to pinpoint the relevant data. To do this, all you need to do is right-click the website and select Inspect. This action opens a developer tools window that displays the HTML code utilized to render the website.

To view the classes or IDs of individual books, you need to click the arrow icon in the top-left corner of the developer tools window:

Developer tools menu

You can access an element’s details simply by hovering over it.

We will scrape the URL, title, and price of each book. Here, you can see that each book is encapsulated within an

tag:

 class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    class="product_pod">
       class="image_container">
          href="in-her-wake_980/index.html"> src="../media/cache/5d/72/5d72709c6a7a9584a4d1cf07648bfce1.jpg" alt="In Her Wake" class="thumbnail">


       class="star-rating One">
          class="icon-star">
          class="icon-star">
          class="icon-star">
          class="icon-star">
          class="icon-star">
      
       href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake
       class="product_price">
          class="price_color">£12.84
          class="instock availability">
             class="icon-ok">
            In stock
         
         
             type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket

Each product resides within an

tag with the class product_pod. Within this tag, you can extract the title and book URL from an

tag. The price is located within a
tag with the class `price_color`.

Now that we’ve identified the elements to scrape, let’s create a script that automates the scraping process.

Creating a Scraping Script

To create a scraping script, open your project directory and create a new file called beautifulsoup_scrape.py. Open this file and add the following code:

from bs4 import BeautifulSoup
import requests
import json

ENDPOINT = "https://books.toscrape.com/catalogue/page-1.html"

This code imports the necessary packages for your script. The ENDPOINT variable specifies the web page you intend to scrape (ie the first page of the catalog of books). Next, let’s retrieve the html::

html_response = requests.get(ENDPOINT).text

html_soup = BeautifulSoup(html_response, 'html.parser')

This code sends a query to the specified endpoint, storing all the HTML response code within the html_response variable. Subsequently, a new BeautifulSoup object is created in the html_soup variable to parse the HTML response for queries.

Next, you need to search the HTML response for the relevant elements. To do so, add the following code in the script file:

products = html_soup.find_all('article', class_='product_pod')
products_data = []

The find_all function of the html_soup object recursively searches the HTML response and creates a list of all the elements that pass the given conditions as arguments. In this case, you’re searching all the

tags with the class product_pod. You also declare an empty list, where you’ll store all the books’ data.

To collect data, append your script with the following code:

for product in products:
    book_data = {
        'url' : "https://books.toscrape.com/" + product.h3.a['href'],
        'title': product.h3.a['title'],
        'price': product.find('p', class_='price_color').text
    }
    products_data.append(book_data)

Here, each product container is iterated over, and the URL, title, and price are scraped from each

tag container. Within Beautiful Soup, child elements of the parent tag can be accessed using dot notation, while attributes of the tag can be accessed like dictionary elements in Python, as demonstrated by extracting the URL and title of the book.

The find function is used to extract the price of the book, which returns the first element satisfying the specified conditions; in this case, it’s a

tag with the class price_color. Subsequently, the text attribute is invoked to retrieve the text within the element.

At this point, you’re almost finished. You just need to store the scraped data. Since it’s structured record-type data, it makes sense to store it in JSON format:

with open('books.json', 'w') as f:
    json.dump(products_data, f, ensure_ascii=False)

The json.dump() function serializes the products_data list, which contains all the scraped book information in a JSON-formatted string. This JSON string is then written to books.json via the file object f created using the open() function. The w mode ensures that the file is opened for writing, which allows the JSON data to be written into it. By setting ensure_ascii=False, non-ASCII characters are preserved without escaping, ensuring the accurate representation of textual data.

Now, let’s run the script.

Running the Script

Use the following command to run the script:

python beautifulsoup_scrape.py

In just a few seconds, you’ll see a new file named books.json created next to your script file.

To review the output data, open the books.json file:

[
    {
        "url": "https://books.toscrape.com/a-light-in-the-attic_1000/index.html",
        "title": "A Light in the Attic",
        "price": "£51.77"
    },
    {
        "url": "https://books.toscrape.com/tipping-the-velvet_999/index.html",
        "title": "Tipping the Velvet",
        "price": "£53.74"
    },
← output omitted →
]

As you can see, all the data has been stored as a JSON file.

Scraping a Dynamic Website with Selenium

If you’re looking to scrape dynamic content or content rendered with JavaScript, Beautiful Soup may not be the best option. Selenium, on the other hand, provides fine-grained control over the browser by programmatically interacting with it. It works by sending user commands to the browser driver, which are then translated into actions to execute mouse clicks and keyboard inputs.

Let’s scrape the freeCodeCamp YouTube channel. This involves scraping the top hundred recently uploaded videos, capturing each video’s URL, title, duration, upload info, and views. Scraping YouTube videos is a complex process because the content is dynamically loaded as users scroll down to view more videos. Selenium handles this dynamic behavior by replicating user scroll interactions. As before, you must analyze the website to pinpoint all the elements that you want to scrape.

Analyzing the freeCodeCamp YouTube Channel

When you access developer tools for the freeCodeCamp YouTube channel, you’ll notice the structure of the HTML is much more complicated than for the Books to Scrape website. As before, utilize the hover-to-inspect feature to pinpoint the tags responsible for rendering each web element:

HTML container for video in DevTools

Using developer tools, you can see that each video container is encapsulated within a ytd-rich-grid-media tag:

 id="time-status">
 size="16">
 id="text" aria-label="8 hours, 17 minutes, 11 seconds">
    8:17:11

This code segment contains the duration of the video, with the time specified within the aria-label attribute of the span tag. In this tutorial, you’ll extract the value from the aria-label attribute.

Next, you need to identify elements to extract the URL and title:


 id="video-title-link" href="/watch?v=YrtFtdTTfv0" > id="video-title" >
Learn C# – Full Course with Mini-Projects

You can extract the href attribute of the a tag with the ID video-title-link, along with the tag’s inner text, to obtain the video title.

To retrieve views and upload data for the video, you’ll target the following code:


 class="inline-metadata-item style-scope ytd-video-meta-block">27K views
 class="inline-metadata-item style-scope ytd-video-meta-block">20 hours ago

You’ll extract text from a span with the class inline-metadata-item.

Creating a Scraping Script

Create selenium_scrape.py with the following code:

from selenium import webdriver
from time import sleep
import json
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

We import the webdriver module from the package. This module contains web browser drivers that you’ll use to instantiate a new browser and automate it. The sleep function pauses the execution of the script while the web page loads. Then, the json module stores the scraped data as a JSON file.

The By class is imported to help select specific elements on web pages, while the Keys class is imported to enable keyboard interactions with web elements.

Next, you need to create a new Chrome browser instance to control the browser and interact with the specified URL programmatically:

ENDPOINT = "https://www.youtube.com/@freecodecamp/videos"
videos_data = []

driver = webdriver.Chrome()
driver.get(ENDPOINT)
sleep(3)

ENDPOINT is created to hold the URL of the freeCodeCamp YouTube page. A videos_data list is initialized to store data related to videos. Then, a Chrome WebDriver instance is initialized using webdriver.Chrome(). After that, the WebDriver navigates to the specified endpoint URL using driver.get(ENDPOINT). Finally, a pause of three seconds is introduced to ensure that the page is fully rendered before accessing any elements.

After the web page has finished loading in the browser, you’ll need to scroll to the bottom of the page four times. As mentioned earlier, YouTube is dynamic, and additional videos are loaded onto the page only when you reach the bottom. To simulate scrolling, you’ll programmatically press the End key on the page:

html = driver.find_element(By.TAG_NAME, 'html')
for i in range(4):
    html.send_keys(Keys.END)
    sleep(3)

video_elements = driver.find_elements(By.TAG_NAME, 'ytd-rich-grid-media')

In this code, the find_element function selects the html tag, which returns the first web element satisfying the specified criteria (in this case, based on the TAG_NAME of html). Subsequently, the End key is programmatically pressed four times to load the required number of videos onto the page. Each key press is followed by a three-second pause using the sleep function to ensure that the videos are fully loaded before proceeding. Following this, all individual video containers enclosed within the ytd-rich-grid-media tag are selected using the find_elements function. Since there are multiple videos on the page, this function returns a list of web elements, which is how Selenium represents an individual HTML element on the page.

Next, you need to extract data from each video container using custom selectors. To do this, update your script with the following code:

for video in video_elements:
    container = video.find_element(By.XPATH, './/div[@id="dismissible"]')
    url = container.find_element(By.XPATH, './/a[@id="video-title-link"]').get_attribute('href')
    duration = container.find_element(By.XPATH, './/div[@id="time-status"]/span').get_attribute('aria-label')
    title = container.find_element(By.XPATH, './/h3/a/yt-formatted-string').text
    metadata = container.find_elements(By.CSS_SELECTOR, 'span.inline-metadata-item')
    views = metadata[0].text
    uploaded = metadata[1].text
    videos_data.append(
        {
            "url": url,
            "duration":  duration,
            "title" : title,
            "views": views,
            "uploaded" : uploaded
        }
    )

driver.close()

Here, each video element is iterated over to extract its data and store it as a dictionary in the videos_data list. Initially, an XPath selector is used to pinpoint a div with the ID dismissible, which narrows down the search area for essential values such as the title and URL. This container element serves as the basis for extracting additional data.

The URL is retrieved by navigating within the container element and locating the a tag with the ID video-title-link, from which the value of the href attribute is extracted using the get_attribute function to obtain the URL. Similarly, the duration is obtained by accessing the aria-label attribute of the span tag within the div with the ID time-status. The video title is contained within a yt-formatted-string tag, which is extracted using the text attribute (as the title resides within the inner text of the tag).

Both the views and uploaded values are encapsulated within similar tags with the same CSS class. This is why both values are extracted simultaneously using the find_elements function. Each value is extracted by accessing it through list indexing and invoking the text attribute on each element.

Once you collect all the values, they’re stored as a dictionary within a list. After iteration over all elements is complete, the browser window is closed by invoking the close function on the driver.

At this point, all the extracted data is stored within the videos_data list. Your next step is to save this data into a JSON file by appending your script with the following code:

with open('videos.json', 'w') as f:
    json.dump(videos_data, f, ensure_ascii=False)

Here, you create a new file called videos.json and store all the data from videos_data as JSON inside the file.

Analyzing the Data

To verify the output of your script, open your terminal and execute the following command:

python selenium_scrape.py

After running this command, a new Chrome window is launched that displays the freeCodeCamp YouTube page. For a brief moment, the window automatically scrolls to the bottom of the page to load an adequate number of videos. After a few seconds, the window closes, and you’ll see a new file named videos.json.

Open this file. You should see something like this:

[
    {
        "url": "https://www.youtube.com/watch?v=OHvfgaDl-yY",
        "duration": "5 hours, 12 minutes, 47 seconds",
        "title": "Task Manager Coding Project Tutorial � Next.js, React, Prisma, MongoDB",
        "views": "17K views",
        "uploaded": "16 hours ago"
    },
    {
        "url": "https://www.youtube.com/watch?v=YrtFtdTTfv0",
        "duration": "8 hours, 17 minutes, 11 seconds",
        "title": "Learn C# � Full Course with Mini-Projects",
        "views": "40K views",
        "uploaded": "1 day ago"
    }
← output omitted →  
]

This output contains all the relevant video information from the freeCodeCamp channel.

Conclusion

In this tutorial, you’ve learned how to harness Python for web scraping, using Beautiful Soup to efficiently extract data from static pages and Selenium for dynamic content requiring interaction. Choose Beautiful Soup when dealing with straightforward data extraction and opt for Selenium when you need to navigate through or interact with web pages dynamically. These tools are essential for compiling targeted datasets necessary for fine-tuning AI models to specific domains or tasks and for lots else besides.

Earthly Cloud: Consistent, Fast Builds, Any CI
Consistent, repeatable builds across all environments. Advanced caching for faster builds. Easy integration with any CI. 6,000 build minutes per month included.

Get Started Free

Introducing Earthly Functions: Reusable Code for More Modular, Consistent Build Files

2024-04-23T00:00:00-04:00

The concept of functions runs deep in software development. Pretty much every programming language has functions or something similar that delivers the same capabilities: a block of code that performs a specific task that you can call from anywhere. Functions are a fundamental part of C. Methods, functions associated with specific classes, are part of almost every object-oriented programming language. Even SQL has functions.

Every programming language has functions because they are incredibly valuable. They make it easier to make programs modular and code reusable. These same benefits of modularity and reusability are valuable in your builds too. That’s why we want to introduce you to Earthly Functions. Functions are exactly what you’d expect, reusable sets of instructions that can be imported into build targets or other functions in your Earthfiles. They are designed to make it easier to make your Earthfiles more modular and your build code less redundant.

How to Use Functions

Functions are defined similarly to build targets except that the function name must be in ALL_UPPERCASE_SNAKE_CASE and they must start with FUNCTION. For example:

MY_COPY:
    FUNCTION
    ARG src
    ARG dest=./
    ARG recursive=false
    RUN cp $(if $recursive =  "true"; then printf -- -r; fi) "$src" "$dest"

You invoke a function using the DO command. For example:

build:
    FROM alpine:3.18
    WORKDIR /function-example
    RUN echo "hello" >./foo
    DO +MY_COPY --src=./foo --dest=./bar
    RUN cat ./bar # prints "hello"

Functions look and are used in a very similar way to build targets. There are a few differences though. Functions inherit the build context and the build environment from the caller. So any local COPY operations in a function will use the directory where the calling Earthfile exists; any files, directories, and dependencies created by previous steps of the caller are available to the function to operate on; and any file changes resulting from execution of the function are passed back to the caller as part of the build environment.

Visit our docs from more information and details about using functions

You can use functions that are defined in other Earthfiles in your repo or even other repositories. To do this you need to use IMPORT just like you would if you were importing build targets from another directory or repository. For an example, I’ll be using Earthly lib. Earthly lib is a collection of reusable functions that we maintain to be used for common operations in Earthfiles. Here’s how to easily mount and cache Gradle’s cache using our Gradle functions from Earthly lib.

VERSION 0.8
IMPORT github.com/earthly/lib/gradle:3.0.2 AS gradle
FROM gradle:8.7.0-jdk21

deps:
    COPY settings.gradle.kts build.gradle.kts ./
    COPY src src
    # Sets $EARTHLY_GRADLE_USER_HOME_CACHE and $EARTHLY_GRADLE_PROJECT_CACHE
    DO gradle+GRADLE_GET_MOUNT_CACHE

build:
    FROM +deps
    RUN --mount=$EARTHLY_GRADLE_USER_HOME_CACHE --mount=$EARTHLY_GRADLE_PROJECT_CACHE gradle --no-daemon build
    SAVE ARTIFACT ./build AS LOCAL ./build

Visit our Earthly lib repo for more pre-built functions and details on how to use them

You can use Functions with open source Earthly. The easiest way to get Earthly is to sign up for Earthly Cloud. It walks you through the process of downloading and getting started with Earthly. Earthly Cloud also gives you 6,000 build minutes per month free on Earthly Satellites. Try Functions out, and let us know how they work for you.

Get Started Free

go delve - The Golang Debugger

2024-04-18T00:00:00-04:00

Delve (dlv) is a CLI-based debugger for Go, tailored to the language’s concurrency model and runtime. It allows you to set breakpoints, inspect goroutines, and evaluate and manipulate variables in real-time. Delve supports remote debugging and seamlessly integrates with major IDEs, including Visual Studio Code. Let me walk you through using it, but first, some background.

Background

In my recent foray into Python, one thing that has become endlessly useful is the ability to add a breakpoint() to a line of code without an IDE or any setup at all, ending up in a console-based debugger at that point.

In the past, I had always associated debuggers with IDEs like Visual Studio or IntelliJ. Those are great. However, I find myself in VS Code most of the time now, and I’ve never fully understood its debugging support. Instead, I resort to using prints, which usually suffices. But there are times when I need to step through some code and require an easy-to-use debugger. This is where Delve shines, making it a breeze to debug a go executable, a go test, and even, in theory, a go docker container running on a remote host.

Installing Delve

To install just run go install with the package path:

$ go install github.com/go-delve/delve/cmd/dlv@latest

go: downloading github.com/go-delve/delve v1.22.1
go: downloading github.com/hashicorp/golang-lru v1.0.2
go: downloading github.com/cosiner/argv v0.1.0
go: downloading github.com/derekparker/trie v0.0.0-20230829180723-39f4de51ef7d
go: downloading github.com/go-delve/liner v1.2.3-0.20231231155935-4726ab1d7f62
go: downloading golang.org/x/arch v0.6.0
go: downloading github.com/google/go-dap v0.11.0
go: downloading go.starlark.net v0.0.0-20231101134539-556fd59b42f6
go: downloading github.com/mattn/go-runewidth v0.0.13
go: downloading github.com/rivo/uniseg v0.2.0

That will install it into your GOPATH:

$ go env GOPATH
/Users/adam/go

Make sure you have GOPATH in your path ( I didn’t):

echo 'export PATH=$PATH:$(go env GOPATH)/bin' >> ~/.zshrc && source ~/.zshrc

Then you can run it:

$ dlv version                                                                 
Delve Debugger
Version: 1.22.1
Build: $Id: 0c3470054da6feac5f6dcf3e3e5144a64f7a9a48 $

Debugging `dlv debug`

Now before dlv, I was using go run with a lot of parameters to run Earthly:

go run cmd/earthly/main.go -P -i --buildkit-image earthly/buildkitd:prerelease ./tests/raw-output+gha

To debug with delve I just change go run to dlv debug and introduce a -- to delimit where my apps params start.

dlv debug cmd/earthly/main.go -- -P -i --buildkit-image earthly/buildkitd:prerelease ./tests/raw-output+gha

Now, I want to debug this header printing code in Earthly:

func (cl ConsoleLogger) PrintPhaseHeader(phase string, disabled bool, special string) {
    w := new(bytes.Buffer)
    cl.mu.Lock()
    defer func() {
        _, _ = w.WriteTo(cl.errW)
        cl.mu.Unlock()
    }()
    msg := phase
    c := cl.color(phaseColor)
    if disabled {
        c = cl.color(disabledPhaseColor)
        msg = fmt.Sprintf("%s (disabled)", msg)
    } else if special != "" {
        c = cl.color(specialPhaseColor)
        msg = fmt.Sprintf("%s (%s)", msg, special)
    }
    underlineLength := utf8.RuneCountInString(msg) + 2
    if underlineLength < barWidth {
        underlineLength = barWidth
    }
    c.Fprintf(w, " %s", msg)
    fmt.Fprintf(w, "\n")
    c.Fprintf(w, "%s", strings.Repeat("—", underlineLength))
    fmt.Fprintf(w, "\n\n")
}

I want to understand more about the underlineLength calculation at run time. To do so, I set a breakpoint and continue execution until that point is hit:

(dlv) break conslogging/conslogging.go:230
Breakpoint 1 set at 0x1050c4e44 for github.com/earthly/earthly/conslogging.ConsoleLogger.PrintPhaseHeader() ./conslogging/conslogging.go:230
(dlv) continue

dlv returns this, when I hit a break point:

   225:         }
   226:         underlineLength := utf8.RuneCountInString(msg) + 2
   227:         if underlineLength < barWidth {
   228:                 underlineLength = barWidth
   229:         }
=> 230:         c.Fprintf(w, " %s", msg)
   231:         fmt.Fprintf(w, "\n")
   232:         c.Fprintf(w, "%s", strings.Repeat("—", underlineLength))
   233:         fmt.Fprintf(w, "\n\n")
   234: }
   235:

To check the value of underline length I can use print (p) or list all locals with locals

(dlv) p underlineLength
80

(dlv) locals
w = (*bytes.Buffer)(0x1400041c420)
msg = "Init 🚀"
c = ("*github.com/fatih/color.Color")(0x140001e49e0)
underlineLength = 80
(dlv) p underlineLength

I can also use args to get the arguments passed into my function:

(dlv) args
cl = github.com/earthly/earthly/conslogging.ConsoleLogger {prefix: "", metadataMode: false, isLocal: false,...+14 more}
phase = "Init 🚀"
disabled = false
special = ""

From this point, I can do many things, including saving a checkpoint that lets me return to this point of execution later, but for now, I just want to modify underlineLength and see how that changes the execution. I can do this using set and print p.

(dlv) set underlineLength = 5
(dlv) p underlineLength
5

I can then use next (n) to step through the code and continue (c) to continue execution and see my new shortened underline. c works just like you’d expect continue to work, it continues the execution until another break point is hit.

(dlv) c
Init 🚀
-----

That ---- is my much shortened underscore line, but the problem is this function gets called in a loop and so I quickly end up back at my breakpoint again:

   225:         }
   226:         underlineLength := utf8.RuneCountInString(msg) + 2
   227:         if underlineLength < barWidth {
   228:                 underlineLength = barWidth
   229:         }
=> 230:         c.Fprintf(w, " %s", msg)
   231:         fmt.Fprintf(w, "\n")
   232:         c.Fprintf(w, "%s", strings.Repeat("—", underlineLength))
   233:         fmt.Fprintf(w, "\n\n")
   234: }
   235:

I can keep mashing c until I get past this, or clear my breakpoint with clear:

breakpoints
Breakpoint runtime-fatal-throw (enabled) at 0x1048551ac,0x10483e3c0,0x10483e480 for (multiple functions)() :0 (0)
Breakpoint unrecoverable-panic (enabled) at 0x10483e750 for runtime.fatalpanic() /opt/homebrew/Cellar/go/1.21.1/libexec/src/runtime/panic.go:1188 (0)
        print runtime.curg._panic.arg
Breakpoint 1 (enabled) at 0x1050c4e44 for github.com/earthly/earthly/conslogging.ConsoleLogger.PrintPhaseHeader() ./conslogging/conslogging.go:230 (1)

My breakpoint is breakpoint 1, so I can clear it and continue:

(dlv) clear 1
(dlv) continue

Side Note: Breakpoints runtime-fatal-throw and unrecoverable-panic do exactly what they sound like. Delve includes breakpoints on these critical failure points to debug a fatal runtime throw or an unrecoverable panic easily.

With the breakpoint cleared, Earthly now runs to completion.

Process 91365 has exited with status 0
Process 91365 has exited with status 0
(dlv)

From that point, I can restart to rerun the same process with the same arguments or quit. And that is quick tutorial on dlv.

Quick Tips

For all the in and outs of dlv checkout the docs, but here is some quick tips.

Task	Go Command	Delve Command
Run Program	`go run main.go`	`dlv debug`
Run Tests	`go test`	`dlv test`
Compile and Run Executable	`go build -o myapp && ./myapp`	`dlv exec ./myapp`
Run Specific Test	`go test -run TestName`	`dlv test -- -test.run TestName`

Keyboard Shortcuts

Command Name	Shortcut	Description
`break`	`b`	Sets a breakpoint at a specific source location.
`continue`	`c`	Resumes program execution until a breakpoint.
`step`	`s`	Executes the current line, entering functions.
`next`	`n`	Executes the current line, skipping functions.
`print`	`p`	Prints the value of a variable or expression.
`list`	`l`	Displays source around the current execution point.
`clear`		Removes a breakpoint.
`set`		Modifies the value of a variable during debugging.

Headless for remote or Containerize Debugging

Delve can be run headless, --headless , and then connected to remotely. This can be useful in a docker container where the trick is to install dlv and then run it and open the correct port:

FROM golang:1.18

RUN go install github.com/go-delve/delve/cmd/dlv@latest

COPY myapp  .
EXPOSE 40000

# Command to run Delve server
CMD ["dlv", "--listen=:40000", "--headless=true", "--api-version=2", "--accept-multiclient", "exec", "./myapp"]

Your app needs to be built with debug capabilities, and of course, you can use Earthy to do this building and containerizing. Here is an Earthfile for building the app with debug symbols and building the debug container:

VERSION 0.8
FROM golang:1.22

build-debug:
    WORKDIR /myapp
    COPY ./src .
    RUN go build -gcflags="all=-N -l" -o /myapp/myapp .
    SAVE ARTIFACT /myapp/myapp myapp

containerize-debug:
    FROM golang:1.18
    # Install Delve 
    RUN go install github.com/go-delve/delve/cmd/dlv@latest
    # Copy the built app from the previous stage
    COPY +build-debug/myapp /myapp/myapp
    WORKDIR /myapp
    EXPOSE 40000
    CMD ["dlv", "--listen=:40000", "--headless=true", "--api-version=2", "--accept-multiclient", "exec", "/myapp/myapp"]
    SAVE IMAGE myapp-debug

Then you can debug like this:

earthly +containerize-debug
docker run -d -p 40000:40000 myapp-debug
dlv connect localhost:40000

You could even run delve from right within Earthly:

interactive-debug:
    FROM golang:1.18
    RUN go install github.com/go-delve/delve/cmd/dlv@latest
    COPY +build-debug/myapp /myapp/myapp

    # Did you know earthly also allows you to run commands that require a tty?
    RUN --interactive dlv exec /myapp/myapp

VS Code and GoLand

Delve is what powers the debugger you see in various go IDEs. If you want to debug go code in VS Code, then the official go extension will probably just work once you configure a launch.js.

For me, I had to add this under Run-> Add Configuration:

{
    "version": "0.2.0",
    "configurations": [
 {
            "name": "Debug Earthly",
            "type": "go",
            "request": "launch",
            "mode": "debug",
            "program": "./cmd/earthly/main.go",
            "args": [
                "./examples/c+deps"
            ],
            "env": {
                "FORCE_COLOR": "1"
            },
            "cwd": "${workspaceFolder}"
        }

    ]
}

This is the equivalent of dlv debug. To Debug an executable in VS Code, build with debug symbols -gcflags="all=-N -l" and then use mode=exec, which corresponds to dlv exec:

        {
            "name": "Debug Earthly Binary",
            "type": "go",
            "request": "launch",
            "mode": "exec",
            "env": {
                "FORCE_COLOR": "1"
            },
            "program": "./build/darwin/amd64/earthly",
            "args": [
                "./examples/c+deps"
            ],
            "cwd": "${workspaceFolder}"
        },

You can make similar changes to mode to run tests with dlv test and presumably to connect to remote or headless debug sessions.

GoLand and IntelliJ

Delve works out of the box with GoLand, and with the IntelliJ Go Plugin. It works just like IntelliJ users would expect a debugger to work.

To setup remote debugging a bit more configuration is required but the folks at JetBrains kindly provide a GUI with instructions embedded in it.

GoLand Remote Go Debugging ( find this under Run&Debug Configurations -> Add Configuration -> Go Remote)

That’s a Wrap

Ok, that’s a wrap. dlv can integrate with neovim, vim, neovim, Sublime and probably a number of other editors as well, but setting those up is left as an exercise for the reader. Check out github for more information on dlv, including trace points and checkpoints which are valuable tools that deserve their own write up.

Get Started Free

Building a Monorepo with Java

2024-04-10T00:00:00-04:00

This article explains how to streamline Java monorepos. Earthly simplifies and stabilizes Java monorepo builds. Check it out.

Microservices architecture has introduced new challenges for managing code and versions. Monorepos, despite being simpler than polyrepos, still require careful implementation, especially for Java projects. That’s why, in this guide, you’ll learn how to put together a monorepo in Java.

The Basic Structure of Java Monorepos

A monorepo for Java typically is made up of directories for microservices, libraries, and documentation:

my-monorepo/
├─ microservices/
│ ├─ MicroService1/
│ ├─ MicroService2/
├─ libs/
│ ├─ Lib1/
│ ├─ Lib2/
├─ docs/
  ├─ readme.md

You can alter your monorepo based on your project’s needs. For instance, you can substitute the microservices folder with projects, apps, applications, or web.

The Best Monorepo Tools for Java

Once you know how you want to structure your monorepo, it’s time to decide what you’ll use to implement it.

Monorepos can be instantiated, defined, and managed with the following open-source tools:

Apache Maven: Widely used for Java projects, Apache Maven offers robust build automation and dependency management.
Gradle: An alternative to Maven, Gradle uses a domain-specific language for configuration and offers features like incremental builds and compile avoidance.
Bazel: Google’s open-source tool, Bazel, is known for fast and reliable builds, making it ideal for large-scale repositories.
Earthly: Earthly can help cut build time when working with Java Monorepos.

Pairing these tools with an integrated development environment (IDE) like IntelliJ IDEA, NetBeans, or Visual Studio Code can enhance your development workflow further.

Building a Monorepo with IntelliJ IDEA and Maven

To help enforce structure across your projects, you can create a Maven archetype tailored for your monorepo setup. This archetype acts as a standardized template, ensuring that all your monorepos adhere to a consistent structure and design patterns.

Creating the Archetype

The first thing you’ll need to do is create a new Maven archetype project:

mvn archetype:generate -B \
  -DarchetypeArtifactId=maven-archetype-archetype \
  -DarchetypeGroupId=org.apache.maven.archetypes \
  -DgroupId=org.earthly \
  -DartifactId=maven-archetype-microservice-monorepo

This command creates an archetype at the maven-archetype-microservice-monorepo directory and scaffolds the basic folder structure.

Edit the provided pom.xml file by adding the packaging tags, adding or editing the build tags, and removing the properties. When finished, your file should look like this:

 version="1.0" encoding="UTF-8"?>

 xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
         http://maven.apache.org/maven-v4_0_0.xsd">
    4.0.0
    org.earthly
    maven-archetype-microservice-monorepo
    1.0
    Archetype - maven-archetype-microservice-monorepo
    http://maven.apache.org
    maven-archetype
    
        
            
                org.apache.maven.archetype
                archetype-packaging
                3.2.1

Delete the src/test directory as this tutorial does not cover testing the archetype.

Next, it’s time to edit the archetype descriptor. This file resides in the src/main/resources/META-INF/maven directory.

The archetype descriptor lets you define the directory structure of your monorepo:

 xmlns="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0 http://maven.apache.org/xsd/archetype-descriptor-1.1.0.xsd"
      name="monorepo">

    
        
            docs
        
        
            libs
        
        
            microservices

This tells Maven to create directories named docs, libs, and microservices in any project created using this archetype.

Now, you can delete the src folder along with its subdirectories in the archetype-resources folder, as it’s unnecessary. Then, edit the prototype POM (pom.xml in the archetype-resources folder):

 xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
         http://maven.apache.org/maven-v4_0_0.xsd">

    4.0.0
    ${groupId}
    ${artifactId}
    ${version}
    pom

Please note: The dependencies tags are mandatory. You can use the dependencies section as an alternative or in addition to your libs folder for external dependencies.

This pom.xml file is copied to the root of the monorepo generated using this archetype.

Once you’ve established the basic structure of the monorepo, you can refine it further. Monorepo-specific changes should be made to the monorepo itself, while global monorepo changes should be made in the archetype.

For instance, if you decide that all your monorepos should contain a README.md file, you’ll need to create a template for it and then use the includes tags in your archetype descriptor file:


    README.md

You can learn more about Maven’s archetype descriptor/metadata files and their capabilities in the official documentation.

You can also organize and list your monorepo’s folders as modules in the prototype pom.xml file. This technique allows you to enforce hierarchy in your monorepo:


    microservices
    libs

As with most modular Maven-based projects, each module (folder) must have its own pom.xml file defining the structure, properties, and build instructions. You can also add build plugins to the monorepo, such as the Flatten plugin:


    
        
            org.codehaus.mojo
            flatten-maven-plugin
            1.1.0
            
                ${project.build.directory}
            
            
                
                
                    flatten
                    process-resources
                    
                        flatten
                    
                
                
                
                    flatten.clean
                    clean
                    
                        clean

Finally, install the archetype using mvn install:

Installing the archetype

Note: This archetype is available on GitHub. You can clone this repo and use it as a starting point for customizing your monorepo template.

Creating the Monorepo

At this point, you can use the archetype to scaffold your monorepo. To do so, you’ll need to take note of the groupId, artifactId, and version of the archetype. You can find these in the pom.xml file of the newly created archetype. In this example, the groupId is org.earthly, artifactId is maven-archetype-microservice-monorepo, and version is 1.0:

mvn archetype:generate \
  -DgroupId=org.example \
  -DartifactId=Microservices \
  -DarchetypeGroupId=org.earthly \
  -DarchetypeArtifactId=maven-archetype-microservice-monorepo \
  -DarchetypeVersion=1.0 \
  -DinteractiveMode=false

Refining the Monorepo

Once the monorepo structure has been generated, you’ll need to refine and tweak it to add project-specific details. This includes changing the project-wide configuration and configuring the modules. You can explore a sample monorepo created using this archetype as you read this section. The sample monorepo has two libraries (Lib1 and Lib2) and two microservices (MicroService1 and MicroService2).

Each of the modules (libs and microservices) has pom.xml files that contain information about the modules. The pom.xml file for the libs module looks like this:

 version="1.0" encoding="UTF-8"?>

 xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    4.0.0

    
        org.example
        Microservices
        1.0-SNAPSHOT
    

    org.example.libs
    libs
    pom

    
        Lib1
        Lib2

It declares the groupId and artifactId of the libs module and the two individual child modules (Lib1 and Lib2). The child modules also have their own pom.xml files. For example, this is what the pom.xml file looks like for Lib1:

 version="1.0" encoding="UTF-8"?>
 xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    4.0.0
    org.example
    Lib1
    1.0-SNAPSHOT
    jar
    
        UTF-8
        17
        17

This is a typical pom.xml file that creates a JAR file from the library code.

A similar configuration is present in microservices/pom.xml, which also declares the two child modules:

 version="1.0" encoding="UTF-8"?>

 xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    4.0.0

    
        org.example
        Microservices
        1.0-SNAPSHOT
    

    org.example.microservices
    microservices
    pom

    
        MicroService1
        MicroService2

Each individual microservice has a pom.xml file that builds a typical executable JAR. However, it’s important to note that the microservices depend on the libraries. For example, take a look at this section in MicroService1/pom.xml:


    
        org.example
        Lib1
        1.0-SNAPSHOT
        jar

The library is imported into the microservice code using import org.example.lib1.Lib1;. You can create dependencies between modules in a monorepo this way.

You can compile the monorepo by running mvn package and then run the microservices using java, like this:


$ java -jar microservices/MicroService2/target/MicroService2-1.0-SNAPSHOT-jar-with-dependencies.jar
Greeting from MicroService2 using Lib2

Relying on Maven to manage, compile, and build your monorepo projects can result in long compile times and heavy consumption of local disk space. You can mitigate this issue by hosting only a few of your monorepo applications (microservices) locally. However, this undermines one of the key purposes and advantages of a monorepo, which is why you may want to consider using a different build tool.

Optimizing Your Monorepo Builds

To help you define and maintain the structure of your monorepo, you can keep using Maven’s pom.xml files. However, you need something more robust for building and deployment, especially if you want to optimize your monorepo for CI/CD pipeline functionality.

Each project, application, or microservice should be viewed as separate. This means they need to be individually built, containerized, and tested. Earthly is the perfect tool for this. It tackles your resource utilization limitations by using caching to only rebuild and compile the necessary packages.

Earthly follows an implementation structure that is very similar to Maven’s. Every library and application/microservice folder contains an Earthfile connected to a “parent” Earthfile located at the root of the monorepo. The parent Earthfile acts similarly to your project or monorepo’s main pom.xml file.

For instance, the Earthfile for MicroService1 will look something like this:

VERSION 0.8

build:
  FROM maven:3-eclipse-temurin-17
  COPY ../../libs/Lib1+build/Lib1-1.0-SNAPSHOT.jar .
  RUN mvn install:install-file \
    -Dfile=Lib1-1.0-SNAPSHOT.jar \
    -DgroupId=org.example \
    -DartifactId=Lib1 \
    -Dversion=1.0-SNAPSHOT \
    -Dpackaging=jar
  COPY . .
  RUN mvn package
  SAVE ARTIFACT microservices/MicroService1/target/MicroService1-1.0-SNAPSHOT-jar-with-dependencies.jar

docker:
  FROM +build
  CMD ["java", "-jar", "target/MicroService1-1.0-SNAPSHOT-jar-with-dependencies.jar"]
  SAVE IMAGE microservice1:latest

This Earthfile has two sections: the build section builds the specific project and the docker section creates a Docker image from generated JAR file. Note that the build section uses the command COPY ../../libs/Lib1+build/Lib1-1.0-SNAPSHOT.jar . This invokes the build section of the Earthfile for Lib1 and copies the generated JAR file. The next line installs this jar as a dependency of Microservice1.

You can run Earthly from the microservices/MicroService1 directory:

earthly +docker

This runs the docker section from the Earthfile. This section depends on the build section, so Earthly executes the build section first, followed by the docker section, and caches the build outputs.

If you run this command again without making any code changes, you’ll notice that Earthly reuses the cache, which saves you time and resources since you won’t have to unnecessarily compile the entire project:

Earthly reuses the cache

You can also compile your libraries into a self-contained artifact that can be referenced by your microservices:

VERSION 0.8

build:
  FROM maven:3-eclipse-temurin-17
  COPY . .
  RUN mvn package
  RUN cp target/Lib1-1.0-SNAPSHOT.jar .
  SAVE ARTIFACT Lib1-1.0-SNAPSHOT.jar

You can connect all these individual Earthfiles in the parent Earthfile at the root of the project:

VERSION 0.8

build:
    BUILD ./microservices/MicroService1+build
    BUILD ./microservices/MicroService2+build
    BUILD ./libs/Lib1+build
    BUILD ./libs/Lib2+build

docker:
    BUILD ./microservices/MicroService1+docker
    BUILD ./microservices/MicroService2+docker

The monorepo can then be built and run using the earthly +build CLI command and packaged into Docker images using earthly +docker from the root of the project.

Conclusion

Monorepos can ease collaboration, make version control easier, and increase transparency. However, the process does have a few disadvantages, including high system resource usage and more complicated builds and deployments. Fortunately, many of these issues can be overcome by implementing the right build and orchestration toolchains.

In addition to acting as a monorepo platform, Earthly can function as a full-featured CI/CD framework that enables you to define repeatable builds. This means you don’t have to use separate tools for your build and integration operations.

Earthly helps you achieve faster compile and build times, streamlined module and package management, and more efficient containerization. Visit Earthly.dev to learn more.

Get Started Free

Introducing Earthly docker-build: Faster Docker Builds, Persistent Cache, Works with Any CI

2024-04-09T00:00:00-04:00

Earthly has a lot of benefits and useful features, all built on a foundation provided by BuildKit. If you’re unfamiliar with BuildKit, it’s a tool designed to enhance the process of building container images. It optimizes build performance through parallel processing and efficient caching, significantly reducing build times. BuildKit is the default builder for Docker but supports a variety of frontends, which is how Earthly plugs into it and uses it in the execution of Earthfiles.

Learn more about BuildKit in our blog post, What is Buildkit?.

We have team offsites every six to nine months. At our offsites, we discuss how business and product development are going, figure out what we’re going to build next, and set some goals for our company. A few months ago at our last offsite, we had an idea… I bet it’d be pretty easy to add a feature to Earthly that lets it run good, old-fashioned Dockerfiles. That way, developers who were new to Earthly and wanted to try it out could get value quickly without investing the time and effort into writing Earthfiles. One of our engineers looked into it, and it was pretty easy. So they built it, and now it’s available to you.

Announcing earthly docker-build, a feature of Earthly that lets you run builds using your Dockerfiles, no Earthfiles required. This makes it much easier to have a repo with mixed Dockerfiles and Earthfiles as you start using Earthly, and possibly more importantly, lets you use Earthly Satellites as a persistent remote BuildKit cache for Dockerfile-based builds. So you can get faster Dockerfile-based builds that work with any CI.

What Is `earthly docker-build`?

earthly docker-build is a command built-in to Earthly that lets you build container images from Dockerfiles, very similar to running docker build. Since both Earthly and Docker use BuildKit to build container images, you get all the benefits you get with docker build, such as parallelism and caching for reduced build times.

There are two primary benefits that earthly docker-build provides:

You can use both Earthfiles and Dockerfiles in CI. This provides flexibility for your builds. It allows you to start using Earthly early while gradually migrating big projects, using your existing Dockerfiles, and over time, converting them to Earthfiles. You can migrate incrementally, tackling migration of easier parts of your build early and more complex parts later, when you’re more familiar with Earthly.
You can use Earthly Satellites as a persistent remote BuildKit cache. CI runners are almost always ephemeral. If you’re using docker build in CI, either your BuildKit cache, one of the most beneficial features of BuildKit, will get blown away at the end of every build, or you’ll have to manually configure caching in your CI, which generally requires downloading the cache at the beginning of the build and uploading it at the end of the build, which is slow. Using earthly docker-build With Earthly Satellites gives you a persistent BuildKit cache that will speed up every build with no cache upload and download required.

How To Use `earthly docker-build`

If you’re familiar with Earthly and docker build, earthly docker-build is very easy to use. If you want to build a Dockerfile of that name in the build context of the current directory, you run earthly docker-build .. There are also several options to provide the functionality you’re used to with docker build.

--dockerfile to specify an alternative Dockerfile.
--tag to specify the name and tag to use for an image (multiple --tag options are supported for multiple tags).
--target to specify a target in a multi-target Dockerfile.
--platform to specify the platforms to build the image for.
--push to push container images to remote docker registries. With Docker, this requires a separate push command after you build and tag your image.

If you want to use earthly docker-build on an Earthly Satellite, that’s easy too. You just specify the satellite name with --sat directly after the docker-build part of the command. For example: earthly docker-build --sat my-satellite --tag my-image:latest ..

Visit our docs from more information and details about using earthly docker-build

Sign Up for Earthly Cloud and Start Using `earthly docker-build` Today

You can use earthly docker-build with open source Earthly, but if you sign up for Earthly Cloud you get 6,000 build minutes per month free on Satellites and the ability to use them as a persistent remote BuildKit cache. Try earthly docker-build out, and let us know how it works for you.

Get Started Free

What Is Platform Engineering?

2024-04-09T00:00:00-04:00

The article focuses on the transformative impact of platform engineering. Earthly’s reproducible builds significantly enhance consistency and efficiency for platform engineers. Check it out.

Platform engineering is a discipline that improves developer productivity by providing automated tools and processes that accelerate the software delivery lifecycle (SDLC). It’s an evolution of DevOps that focuses on centralization, consistency, and self-service developer access.

Platform engineering is a specialist role that’s usually handled by a dedicated team, although it often overlaps with other aspects of software delivery management. Platform teams are likely to collaborate with—or be staffed from—DevOps teams, infrastructure teams, and internal IT services teams that have a similar remit to support developer needs.

Independent functional groups (such as data analysis, AI/ML, security, and compliance teams) will also contribute to platform design. These groups aren’t strictly part of the platform team, but accommodating their requirements results in a more robust platform design, such as by ensuring developers are held to relevant security standards or can provision adequate resources to develop AI applications.

In this article, you’ll take a look at platform engineering through a DevOps lens. DevOps optimization is where internal platforms normally begin because it accounts for some of the most common SDLC pain points—whether due to flaky builds, long deployment times, inflexible developer access restrictions, or poor governance of changes. Let’s learn how platform engineering addresses these challenges.

Platform Engineering in Action

Platform engineering produces internal developer platforms (IDPs) that act as one-stop shops for developers to achieve their tasks. Developers interact with an IDP using CLIs, APIs, IDE integrations, and other interfaces that the platform team provides. An IDP can be thought of as a centralized toolbox of components that simplify day-to-day developer tasks.

Not only does an IDP make development quicker and easier, but it also facilitates more consistency and control for team leaders. Platform use shouldn’t be optional; you need to make sure that all developers rely on the platform. This provides assurance that changes have been built, tested, and deployed using approved processes. This improves your security posture and helps prevent compliance lapses from occurring.

Building an IDP

IDPs can sound complex, but this doesn’t have to be the case. It’s important to recognize that IDPs are inherently bespoke to your development requirements. Implementations can vary significantly between organizations, although most will include the following DevOps functions:

CI/CD pipelines: Automated builds, tests, and deploys using CI/CD pipelines are critical platform features. They let devs avoid clunky manual tasks while providing vital consistency to improve software quality.
Self-service environment provisioning: Platforms should make it easy for devs to create new production-like environments where they can safely test changes. This also makes it easier for new devs to get started with a project.
Orchestration of infrastructure components: IDPs provide an interface between developers and infrastructure components (like cloud compute nodes and databases). Devs can interact with these resources through the IDP without requiring direct access to cloud accounts.
Secure access to live deployments and observability data: Developers should be able to use the platform to access any data they need for their work. This includes streamlined access to logs, metrics, and traces from production deployments so that bug reports and performance issues can be more efficiently resolved.
Identity management and access control: Platforms need to unify developer identities across different cloud providers, apps, and services. They must also provide robust role-based access control (RBAC) so that devs can be assigned the minimum set of privileges they need for their position. This maintains security around your resources while allowing devs to easily achieve their tasks within clearly defined guardrails.

To establish your own IDP, you should first map out your current processes to identify where problems are occurring. Next, seek tools and systems that are capable of addressing those challenges. Then, integrate those tools into a cohesive catalog of services that developers can reach for to achieve their tasks.

If this seems complicated, don’t worry—several prebuilt platforms and open source frameworks can help you get your IDP off the ground. Spotify Backstage, Port, and Qovery are some of the most popular options for rapidly deploying service catalogs and providing infrastructure access to developers.

Automating DevOps Tasks With IDPs

IDPs support developer workflows by automating key developer tasks. This isn’t just about builds and deployments; IDPs can also automate peripheral functions such as environment provisioning, security scans, regression testing, and even new code generation using AI-driven large language models.

These techniques reduce the strain on developers. Automating the tedious and menial parts of software delivery helps lighten the cognitive burden, allowing devs to concentrate more closely on the core responsibilities of their role.

Of course, somebody has to implement the automated processes within your platform. This is the day-to-day work of platform engineers. A platform team will continually look for DevOps tasks that would benefit from being automated, based on analysis of developer activity and conversations with individual engineers. The team will then build and maintain new automated scripts, pipelines, and commands that systemize processes discovered to be clunky.

For example, if devs are struggling to test their changes in realistic environments, the IDP could provide a utility that starts a new deployment, seeds sanitized data into it, executes the test script, and sends the results back to the developer. Instead of manually stepping through this complex process, the burden on developers is reduced to starting the utility and interpreting the information it delivers.

Enabling Self-Service Developer Access

In a DevOps context, self-service access is about empowering developers to interact with resources such as infrastructure components, cloud accounts, and production environments. These resources have traditionally been kept off-limits, with access restricted to operations teams and administrators.

Despite their sensitivity, there are compelling reasons for developers to be able to more closely engage with these kinds of assets. Debugging problems, designing solutions that suit the available infrastructure, and monitoring real-world performance are all easier tasks when devs can reach for relevant data themselves. Having to wait for ops teams to supply information introduces roadblocks that reduce efficiency and lead to longer incident resolution times.

The challenge is how to connect devs to resources they need without handing them credentials for your cloud accounts. IDPs are one of the main strategies for resolving this conundrum. By positioning the platform as an intermediary between developers and your infrastructure, devs can use the tools that the platform provides to safely interact with permitted resources.

For example, your platform team could create a command that allows devs to access logs from production servers without having to directly expose the associated infrastructure. This lets developers debug more productively without subverting any access control constraints or compliance guardrails.

How Platform Engineering Affects Your Builds

Building code ready for deployment is one of the DevOps areas that benefits the most from platform engineering. Many teams struggle to standardize their build processes or have to endure lengthy delays before devs can access build results. Platform engineering provides pragmatic solutions for these roadblocks, helping devs stay productive.

Improved Build Quality and Reliability

Build reliability is necessary for devs to be confident that builds will be completed successfully each time they’re run. It’s also critical that builds are reliable and reproducible, meaning that a result obtained on a developer’s workstation shouldn’t differ from a repeat build made on a CI server or after a rollback to an earlier deployment.

Standardizing your build process as an IDP component ensures one pipeline configuration is being used, making it less likely you’ll encounter these problems. As an example, you could provide a build system within your platform that runs new builds for developers but delivers the output directly to where they’re working. The same system can then be used within your CI/CD pipelines to run your production builds, ensuring there’s only one config syntax and one environment to work with.

Reduced Build Times

Long build times are among the biggest developer productivity blockers. Having to wait while builds complete increases your testing and review cycle times and reduces the amount of code you can ship. There are many reasons why builds can become uncomfortably long, but it’s often due to relatively simple configuration oversights. Missing caches, unnecessary rebuilds of unchanged content, and resource-constrained build machines are all common causes.

Using an IDP gives devs self-service access to run builds on centrally managed hardware that’s preconfigured for performance. Operating a single build platform can permit more efficient utilization of available hardware, delivering improvements to build times and your infrastructure costs.

Platform Engineering and Developer Productivity

So far, you’ve learned that platform engineering is all about forging a path towards greater developer productivity. Self-service access, faster builds, and dependable deployment pipelines all contribute to less idle time and administrative work for developers, helping them stay productive on meaningful code creation tasks.

Besides increasing development velocity, the benefits provided by an IDP positively contribute to developer satisfaction and can therefore improve retention rates. Most developers aren’t trained in managing infrastructure, pipeline configurations, and software delivery workflows, so establishing a platform engineering team that purposely handles these background tasks helps the whole DevOps lifecycle run more efficiently.

Conclusion

Platform engineering is the practice of creating automated internal tools and processes that improve developer productivity. Compared with familiar DevOps strategies, platform engineering places even greater emphasis on standardizing developer systems and providing self-service access options. This allows devs to increase their throughput without compromising on security, reliability, or compliance concerns.

Building an IDP requires a significant investment, but it can be one of the most impactful steps you can take to boost developer satisfaction and reduce time to market for your products. An IDP lets you solve key DevOps pain points by achieving fast and reliable builds, consistent deployments, and simplified developer access to infrastructure and cloud environments. This frees up more time for devs to write new code that contributes value to your organization.

Get Started Free

How a Platform Team Helps Your Developers

2024-04-08T00:00:00-04:00

Platform teams support the work of development teams by building internal tools and platforms that automate processes, unify operations, and enable self-service developer access to infrastructure resources.

A dedicated platform team enables engineers to focus on developing your core product with enhanced productivity and efficiency, using the tools provided as part of the platform. This can improve application quality, reduce time to market, and lessen the overhead associated with system design and maintenance.

In this article, you’ll explore these benefits and what you need to know when introducing a platform team to your organization.

What Is a Platform Team?

Platform teams are groups of software developers, infrastructure operators, and site reliability engineers who fulfill the internal requirements of product engineering teams. Sometimes also referred to as DevOps specialists and other similar terms, platform teams create the technological systems that let you build and deploy your apps.

The concept of platform engineering has risen to prominence over the past decade, alongside the interest in DevOps, cloud computing, and software development methods driven by continuous integration, continuous delivery (CI/CD). It can mean slightly different things in different places, with various organizations using platform teams to manage infrastructure, DevOps processes, internal tools, and companion services, as well as production apps. Subteams and working groups may also offer specialist experience in areas such as data and security.

Here, you’ll examine platform teams in the context of DevOps and the software delivery lifecycle (SDLC). The overarching aim is to support your internal teams by providing any resources they need for their work, ultimately enabling more value to be delivered to customers.

Core Functions of a Platform Team

Equipping developers to succeed means platform teams wear many hats within the overall DevOps sphere. Here are some of the tasks that a typical platform team completes:

Self-service infrastructure provisioning: Platform teams systemize infrastructure interactions—such as the creation of compute nodes and databases—as automated processes that developers can initiate via self-service actions in a CLI or web app.
Standardized development environments: Similarly, platform teams standardize the functionality and structure of development environments, letting devs spin up new ones on demand.
Automated testing and deployment pipelines: Shifting responsibility for CI/CD pipeline configurations to platform teams can improve pipeline consistency, performance, and security across your organization.
Unified monitoring and observability: As platform teams manage infrastructure, environments, and pipelines, they’re also ideally positioned to configure observability suites that developers can access to investigate app health and performance.
Documentation and knowledge sharing: Documenting how processes work and which tools are involved is essential to a platform team’s work and ensures knowledge can be shared throughout the organization. Developers can thus be informed of which actions are available to them.
Tool maintenance and evolution: Platform teams are responsible for continually maintaining platform implementations. They iterate to improve efficiency and performance while also applying any changes required by new tools, working methods, and infrastructure requirements.

These capabilities reveal how platform teams focus on refining the DevOps cycle on behalf of developers. The internal platforms they create provide foundational tools for use throughout the software delivery process, enabling higher-quality work to be produced in less time.

How Platform Teams Help Your Developers

Adding a platform team to your engineering architecture is an effective step toward addressing the complexity of modern software delivery. It restores the role of developers back to authoring new code for your products, which provides several tangible benefits.

Increased Developer Productivity

For many organizations, an increase in developer productivity is the most obvious change. Platform teams free up developers from mundane DevOps tasks that they’re often ill-equipped to handle.

Infrastructure provisioning, pipeline configuration, and the collation of observability data are all specialist skills that detract from a developer’s core work. Reassigning these responsibilities to dedicated platform experts allows devs to focus on innovating around the core areas of your products.

Fewer distractions, combined with access to simple self-service tools, means devs can increase their throughput and contribute more value to the organization.

Improved Application Quality

Standardized internal platforms can also improve the quality of the applications you deliver to customers. Increased use of automated processes and CI/CD pipelines that reliably build and test apps makes it less likely that changes will silently introduce bugs.

Similarly, the use of on-demand developer environments can eliminate subtle differences between environments. This prevents flakiness and unexpected behavior in production when compared to what a developer observed while a feature was being built.

Because platform teams are responsible for defining, documenting, and iterating upon development processes, they also enable the unification of engineering methods across different dev teams. This can further improve quality as all engineers push code through the same pipelines, improving consistency and making it easier for different teams to collaborate.

Reduced Time to Market

Platform teams spend all their time making your development and deployment processes more efficient. This helps cut the lead time for launching new products and features and accelerates your time to market.

Ready-made platform tools mean devs don’t need to worry about creating infrastructure for new apps. They also don’t have to configure a deployment pipeline, work out how the app’s health will be monitored, or decide how to start new development and test environments.

All these capabilities are already captured within the platform, allowing devs to focus on the meaningful work involved with the new launch. Once the solution is ready to go live, it can be deployed using the familiar platform functionality via the same process that’s already being used for your other assets. With improved operational agility, you can more readily react to market demand and competitor activity.

Enhanced Developer Experience

Better tools, more consistent processes, and higher productivity have a positive effect on developer satisfaction. Being able to create infrastructure, run tests, and stand up new testing environments when they’re needed also promotes developer autonomy, which makes devs feel more valued in their roles.

Anything you can do to remove friction from software delivery helps reduce the development workload and keep devs motivated. Even if a platform only saves individuals a few minutes per day, that can still have a profound effect on a developer’s sense of satisfaction if they’re freed from completing repetitive manual tasks. Those savings add up to a substantial gain at the organizational level when applied across teams of hundreds or thousands of developers.

Hence, a platform team fosters a happier developer experience, leading to a positive feedback loop that further increases productivity. In turn, this can contribute to improved engineer retention rates that let you continually deliver value more reliably, as you’re freed from the interruptions caused by team members being replaced.

Reduced Operational Costs

Establishing a platform team demands a significant initial investment. You’ve got to hire or reassign team members and then provide them with time to design the initial platform implementation. Integrating the platform into your existing developer processes also takes time; it can temporarily reduce productivity while devs learn the new working methods.

Beyond this one-time setup, however, committing to a platform positions you to reduce operational costs over the life of your product. What you spend on maintaining the platform can be recouped through optimized resource utilization, lower staff turnover, and shorter lead times to deliver new features to customers.

The standardization advantages provided by internal platforms can also reduce your operating costs. Gating infrastructure access behind a platform means you might be able to serve multiple apps and development environments from a smaller set of resources without incurring any security concerns.

Ultimately, any increase in efficiency is going to improve your bottom line. Platform teams support developers by providing ready-to-use DevOps technology stacks that reduce context-switching and cognitive load; this allows more output to be delivered with fewer resources.

Collaboration between Platform Teams and Development Teams

Although platform teams exist independently of development teams, the two groups shouldn’t be too separated. Both are still part of the same DevOps process, so collaboration is vital in order to align goals and objectives. Platform teams need to listen to changing developer requirements, while devs should be prepared to respect occasions when security or performance concerns prevent capabilities from being added to the platform.

Therefore, it’s crucial to provide clear communication channels between the two types of teams. Regular meetings, updates, and sync-ups in shared spaces—physically or virtually—allow analysis of what’s working and where improvements could be made.

Best Practices When Working with Platform Teams

When you start a platform engineering team, you should follow a few best practices to maximize your chances of success:

Document everything: Documenting how and why processes have been implemented is essential to preserving long-term maintainability.
Optimize for performance, observability, and scalability: Platforms need to be flexible to support current and future developer requirements without compromising velocity or your ability to inspect your systems.
Embrace new tools and technologies: Platform teams should help developers unlock the promise of new technologies. Part of a platform team’s responsibility should be trialing new tools and techniques and then analyzing how they affect software delivery outcomes.
Future-proof the team: Platform teams should themselves be future-proofed. Using tools and services that are portable across clouds and infrastructure types will make your internal platforms more resilient to future shifts in the industry.
Centralize all infrastructure and application management: The platform team should be your one-stop shop for infrastructure, process, and app management tasks. Prevent development teams from devising their own methods to promote consistency across your organization.

Having your platform teams stick to these policies will produce effective platforms that developers can rely on throughout the life of your apps.

Real-World Examples of Platform Teams

Real organizations are using platform teams to successfully improve their development outcomes. Uber, for example, found that its adoption of platform teams was an essential step in allowing its product engineering to scale with its growth curve. Although some issues were encountered at the outset—including less flexibility when responding to unforeseen changes—the shift made the company more resilient in the long term.

At Meta, the DevInfra platform team has a stated mission to “increase developer efficiency.” The group builds tools and automated infrastructure that let devs stay focused on “things that matter.” DevInfra standardizes processes for the thousands of engineers within Meta, ensuring consistent and reliable results even when major changes are required on a short timescale—such as the full-scale infrastructure rollout for Meta’s new Threads app, achieved within two days.

Conclusion

This guide explained how platform teams help your developers improve software delivery outcomes by providing self-service access to automated processes that are consistent, reliable, and managed as part of a cohesive internal platform.

Establishing a platform team requires investment, but this pays off in increased dev velocity and satisfaction. Developers can concentrate on their primary roles, freed from the intricacies of provisioning and maintaining infrastructure and other resources. Platform teams aren’t necessarily suitable for all organizations—smaller groups are less likely to benefit—but they’re an effective way to maintain efficiency as you scale to more apps and developers.

Get Started Free

Cosine Similarity and Text Embeddings In Python with OpenAI

2024-04-04T00:00:00-04:00

Okay, so I wanted to add related items to the sidebar on the Earthly Blog. Since we are approaching 500 blog posts, building this related list for each post manually wasn’t going to work.

Thankfully, with the available ML libraries and the OpenAI embedding API, I can use text embeddings and cosine similarity to find related blog posts in a couple of lines of Python. What are those? Well let me show you how this all works. It’s not that complicated and by the end you’ll be able to understand how text embedding can be used to find related documents.

What Is A Text Embedding

Imagine a simpler problem. You want to figure out how similar a given word is to another word. In this case, I have a blog page for dog, and I want to show the related post cat, and bulldog, but not ones for the inanimate objects shoe and brick.

One way to solve this problem is to create a table of words and their membership in various classes. A dog and a cat are both pets. A brick is a building material, and a shoe is footwear. So if we have a list of bits, marking isPet,isConstructionMaterial, isFootwear then we can store words like this:

items = {
  "dog":[1,0,0],
   "cat":[1,0,0],
   "bulldog":[1,0,0],
   "brick":[0,1,0],
   "shoe":[0,0,1],
}

Pedants might say, well, someone could have a pet brick, couldn’t they? And maybe you could use a shoe as construction material? And yes, that is true. Categories are not all or nothing. Let’s make them float from 1 to 0. The closer to 1, the more relevant the word is to the category.

items = {
  "dog":[1.0,0.0,0.0],
   "cat":[1.0,0.0,0.0],
   "bulldog":[1.0,0.0,0.0],
   "brick":[0.1,1.0,0.1],
   "shoe":[0.1,0.1,1.0],
}

Now, if you look at these three numbers as a point in three-dimensional space, you can see that we’ve found a way to map a word into three-dimensional space such that items near each other are related to the dimensions we care about.

In our footwear, construction materials, and pets website, we should find that this view of ours has three pretty clear clusters of related data, but there might be some outlier groups for the pet rock people of the world. This projection of the data is a text embedding.

The problem with all this is coming up with all the dimensions and the giant membership list for every word that’s important to you. In the real world, we will have a lot more words than this, and we will need a lot more categories to disambiguate them. For instance, many words would score [0,0,0] like sadness or purple and philosophy even though they have nothing to do with each other. We will get to that soon enough, but assuming we have these values, how do we figure out what’s related to what?

What Is Cosine Similarity

Ok, it’s time to get a little mathy. If we take our points in three-dimensional space and treat them as a vector from [0,0,0] to their value, we get a bunch of arrows in three-dimensional space. Here is dog and brick.

Large Angle Between Dog and Brick

You’ll notice that the angle between these points is quite large. But if we compare related terms, that’s not the case.

Small Angle Between Dog and Cat

So, the angle is a great measurement to use for similarity, and thankfully, it’s fairly easy to calculate. This is high school math, but we are just going to be adding some more dimensions.

import numpy as np

def calculate_angle_degrees(vector_a, vector_b):
    dot_product = np.dot(vector_a, vector_b)
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)
    cosine_of_angle = dot_product / (magnitude_a * magnitude_b)
    angle_in_degrees = np.degrees(np.arccos(cosine_of_angle))
    return angle_in_degrees

# Example vectors for 'brick' and 'shoe'
vector_brick = np.array([0.1, 1.0, 0.1])
vector_shoe = np.array([0.1, 0.1, 1.0])

calculate_angle_degrees(vector_brick, vector_shoe)

( How we calculate that angle isn’t essential if you want to gloss over it. )

This gives 78.118 degrees for the angle between brick and shoe. The maximum possible angle with this formula is 90 degrees, completely perpendicular to each other. And the min value is 0 degrees, the two angles are exactly the same.

To get a similarity score, we just need to invert these values to get them between 0 and 1. 0 degrees should be our exact match value 1, and 90 degrees should be 0. That projection is the cosine of the angle.

 import numpy as np

def cosine_similarity(vector_a, vector_b):
    dot_product = np.dot(vector_a, vector_b)
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)
    return dot_p

cosine_similarity(vector_brick, vector_shoe)

The similarity between brick and shoe is 0.20. Not very high, corresponding to an angle of ~70 degrees of difference. That is the cosine similarity.

For our silly little example, we now have all the necessary components. We can take all our words, calculate the cosine similarity for every possible combination of them, and return the N values as our related items for each.

Now, let’s talk about doing this in the real world.

Word2Vec

In the real world, things don’t cleanly separate into 3 dimensions, and we can’t possibly manually calculate the dimensions for every English word. Thankfully, in 2013, Tomáš Mikolov at Google came up with a technique to calculate vectors for words based on a corpus of training data.

How it works isn’t important for our purposes. Besides, in the vector values generated with word2vect, similar words are near each other, and dissimilar words are far away. Because of this grouping, we can use the same techniques as above, cosine similarity, to calculate relatedness.

We can test this out by grabbing word2vect dataset :

python -m spacy download en_core_web_lg

And then using spacy to test it out:

import spacy

# Load a large English model with word vectors included
nlp = spacy.load('en_core_web_lg')

# Access the vector for a specific word
dog_vector = nlp('dog').vector

print(dog_vector)

In word2vec, the dimensions are discovered via training and are opaque to us. It’s not clear what any specific dimension means when looking at the raw vectors; they just group related items together. To make this all work, the dimensions of en_core_web_lg are 300 instead of our previous 3. That makes it much harder to visualize.

print(dog_vector)
[ 1.2330e+00  4.2963e+00 -7.9738e+00 -1.0121e+01  1.8207e+00  1.4098e+00
 -4.5180e+00 -5.2261e+00 -2.9157e-01  9.5234e-01  6.9880e+00  5.0637e+00
 -5.5726e-03  3.3395e+00  6.4596e+00 -6.3742e+00  3.9045e-02 -3.9855e+00
  1.2085e+00 -1.3186e+00 -4.8886e+00  3.7066e+00 -2.8281e+00 -3.5447e+00
  7.6888e-01  1.5016e+00 -4.3632e+00  8.6480e+00 -5.9286e+00 -1.3055e+00
  8.3870e-01  9.0137e-01 -1.7843e+00 -1.0148e+00  2.7300e+00 -6.9039e+00
  8.0413e-01  7.4880e+00  6.1078e+00 -4.2130e+00 -1.5384e-01 -5.4995e+00
  1.0896e+01  3.9278e+00 -1.3601e-01  7.7732e-02  3.2218e+00 -5.8777e+00
  6.1359e-01 -2.4287e+00  6.2820e+00  1.3461e+01  4.3236e+00  2.4266e+00
 -2.6512e+00  1.1577e+00  5.0848e+00 -1.7058e+00  3.3824e+00  3.2850e+00
 ...

Using this dataset, we can skip the whole creating our own vectors:


import spacy
import numpy as np

# Load a large English model with word vectors included
nlp = spacy.load('en_core_web_lg')

# Access the vector for a specific word
dog_vector = nlp('dog').vector
bulldog_vector = nlp('bulldog').vector
shoe_vector = nlp('shoe').vector
brick_vector = nlp('brick').vector

def cosine_similarity(vector_a, vector_b):
    dot_product = np.dot(vector_a, vector_b)
    magnitude_a = np.linalg.norm(vector_a)
    magnitude_b = np.linalg.norm(vector_b)
    return dot_product / (magnitude_a * magnitude_b)

similarity_dog_bulldog = cosine_similarity(dog_vector, bulldog_vector)
similarity_shoe_brick = cosine_similarity(shoe_vector, brick_vector)

Dog, Bulldog similarity: 0.6215080618858337
shoe, brick similarity: 0.301258385181427

We see that dog is over twice as related to bulldog as shoe is to brick. This seems vaguely right to me. Surprisingly, though, dog is closer to cat than to bulldog, but this will work for our purposes.

Text Embeddings

So now we can do related words but in the real world it would be great to extend this to whole sentences, or titles or even full documents. The simple way to do this might be find the vector of each word in the document and then combine these vectors.

We can use this, but there are some issues. The primary problem is that writing is complex. The meaning of a sentence is not a combination of the meaning of the various words. “I like dogs” and “I hate dogs” mean the opposite, but combining the weights of individual vectors will end up very close to each other since all but 1 word is precisely the same. Meanwhile, a sentence like “You love dogs” will end up further away because of the difference between “You” and “I”

Thankfully, we now have better approaches. A text embedding is a vector, similar to a word2vec vector, but produced based on a whole piece of text ( a word, a sentence, a document ) that produces vectors based on a richer semantic understanding of the text.


import os
from openai import APIError, OpenAI
from sklearn.metrics.pairwise import cosine_similarity

# Set your OpenAI API key here
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

def get_text_embedding(text):
    embedding = client.embeddings.create(input=text, model="text-embedding-3-large"

).data[0].embedding
    return embedding

def calculate_cosine_similarity(vector1, vector2):
    return cosine_similarity([vector1], [vector2])[0][0]

sentences = {
    'like': "I like dogs.",
    'love': "You love dogs.",
    'dont_like': "I hate dogs."
}

embeddings = {label: get_text_embedding(text) for label, text in sentences.items()}

similarity_like_love = calculate_cosine_similarity(embeddings['like'], embeddings['love'])
similarity_like_dont_like = calculate_cosine_similarity(embeddings['like'], embeddings['dont_like'])

closest_sentence_label = 'love' if similarity_like_love > similarity_like_dont_like else 'dont_like'
closest_sentence_similarity = similarity_like_love if similarity_like_love > similarity_like_dont_like else similarity_like_dont_like

print(f"The closest to '{sentences['like']}' is '{sentences[closest_sentence_label]}'")
print(f"Similarity: {closest_sentence_similarity}")

The closest to 'I like dogs.' is 'You love dogs.'
Similarity: 0.7072032971889817

In a text embedding, the context of the surrounding words enriches the semantic meaning so that “I like dogs” is more closely related to “You love dogs” than “I hate dogs”.

How this is all done is outside the scope of this article, but with the OpenAI embedding API, it’s done using Generative Pre-trained Transformers.

Putting It All Together

With all of this information, I can calculate related items for this blog. You can see in the sidebar, if you click around on the blog. Some articles have more related than others, for reasons that hopefully now are clear. The code is in github. You should be able to understand it. It gets the text embedding vector for each blog post and then uses cosine similarity to find the posts closest to it.

The great thing about this technique is that as text embedding technology continues to improve, it becomes easier and easier to find related items.

Top 7 Platform Engineering Tools

2024-04-02T00:00:00-04:00

This article lists key platform engineering tools for developers. Earthly guarantees consistent builds in any environment. Learn more about Earthly.

Platform engineering focuses on improving overall developer productivity through standardized tooling, automation, and best practices. It involves creating and managing a shared infrastructure and internal development platform (IDP) that software teams can use to develop, deploy, and operate their applications.

However, you can’t expect developers to be experts in all the technologies involved in the software development process. A platform helps abstract or aggregate various tools under a unified set of interfaces.

Platform engineering and tooling, courtesy of Alexandre Couëdelo

In this article, you’ll learn about seven of the most popular tools that platform engineers use and the benefits that they provide.

Backstage

Originally developed by Spotify as an IDP, Backstage has evolved into an open source platform for building developer portals. Since its acceptance to the Cloud Native Computing Foundation (CNCF) in 2020, it’s become increasingly popular. Many companies—including Expedia, Netflix, and VMware—have built their developer platform on top of Backstage and created various plugins to extend its functionalities.

At its core, Backstage is a software catalog, or a repository listing applications, services, and tools that are developed and used in an organization. The catalog also captures relations between various applications to provide insights and documentation to developers. Additionally, various plugins extend the possibility of gathering and displaying information; for instance, you can integrate Backstage with your CI/CD so that each entity in the catalog gets details about build and deployment status.

Backstage’s self-service IDP, a set of tools and automation that enable developers to quickly and independently perform typical configuration tasks, is anchored by its software templates feature. Using this template system, developers can create reusable workflow configurations via the Backstage UI. All you have to do is select a template. The automation then kicks in and provisions the required resources, such as code repositories and infrastructure components, on demand.

Backstage is so popular that competitors like Cortex and Humanitec decided to offer Backstage plugins so you can use Backstage as a UI for their platforms.

Backstage Pros and Cons

Backstage provides a centralized platform for developers that helps improve the developer experience. This platform organizes and simplifies access to documentation, services, tooling, and infrastructure services.

Spotify continues to drive the evolution of Backstage, with a dedicated team providing frequent updates and premium plugins at an additional cost.

However, while Backstage’s plugins offer a high degree of flexibility and customization, the process of integrating plugins and customizing Backstage can be complex. Backstage is more of a framework to build an IDP than an off-the-shelf solution.

Additionally, because Backstage is centered around its software catalog, it takes time to customize the platform. For smaller teams, this extensive effort may not be worth it.

When to Use Backstage

Backstage is a great tool for platform engineering and software development. Software catalogs like Backstage shine when you have a large number of applications (100+) distributed across multiple teams (10+). If your company isn’t that big yet, Backstage may have too much configuration and maintenance overhead for your use case.

Terraform

Created by HashiCorp in 2014, Terraform is an open source infrastructure as code (IaC) tool. It simplifies the process of provisioning cloud infrastructure through the use of a high-level configuration language called HashiCorp Configuration Language (HCL).

Terraform’s continued popularity has spurred the development of numerous integrations called providers. With Terraform, you can configure cloud provisioning for any application or software as a service (SaaS) that offers an API, as long as a Terraform provider has been created by either the service itself or the community.

One of Terraform’s key features is the concept of state, which reflects the current configuration of the infrastructure. By saving the state after provisioning the infrastructure, any subsequent changes can be inferred based on what needs to be modified. Instead of blindly applying a recipe to provision infrastructure, Terraform uses a two-stage approach:

Plan: Terraform computes a plan, which is a list of changes required to go from the current state to the desired state.
Apply: If the user is happy with the plan provided by Terraform, they can trigger the apply command to start the provisioning.

HCL captures the dependencies between the resources it defines. If resource A references resource B, then Terraform ensures that resource B is created before provisioning resource A. That way, you don’t need to bother with an execution step, as Terraform handles all that for you.

Terraform also supports reusable infrastructure configurations called modules. Operations teams can provide modules to reduce the configuration work required for developers to provision common systems.

Terraform Pros and Cons

Terraform lets users define and provision infrastructure using a version-controlled declarative configuration language. This IaC approach makes infrastructure changes easily auditable and reproducible. However, while its declarative syntax is simple to learn and use, it’s not a shortcut for managing infrastructure. You need to be familiar with the cloud/SaaS resources you want to provision.

Terraform supports a wide range of service providers (both cloud and on-premise), enabling users to manage a diverse set of infrastructure resources through a single tool. Additionally, Terraform’s execution plans and resource dependency management provide predictability and safety, ensuring that the state of infrastructure deployments is reliable and repeatable.

It should be noted that while Terraform’s state management is an ingenious way to manage infrastructure, it can become a challenge at scale. The more resources that are part of a state, the more intricate it becomes and the more time it takes to compute a plan.

When to Use Terraform

Terraform is a good option for small and large projects. Its universal interface can help provision and configure your infrastructure and SaaS providers. However, keep in mind that you need to adapt your usage of the tools to the size of your infrastructure. As your infrastructure and teams grow, you’ll need to build modules to help simplify common configuration tasks and divide your configuration into relatively small states.

Kubernetes

Kubernetes is an open source container orchestration system for automating software deployment, scaling, and management. It was originally designed by Google in 2014 and donated as the inaugural project to CNCF in 2015.

Currently, Kubernetes is the de facto container orchestration platform, most likely due to its configuration system, which allows for declarative management (using YAML) of its components. Once the configuration is applied to a Kubernetes cluster, the control plane continuously tries to achieve the desired state, reflecting the configuration you provided.

Kubernetes configurations can be version-controlled, enabling teams to apply the practice of IaC. Additionally, Kubernetes offers a CLI tool called kubectl, which provides a powerful interface for managing resources.

Kubernetes Pros and Cons

One of the main advantages of Kubernetes is that it automates the deployment, scaling, and management of containerized applications, addressing the complexity of managing containers at scale. It’s also a modular platform that can be extended in several ways (such as with custom resources and operators, hooks, and plugins).

Kubernetes has robust community support and a vast ecosystem of tools and extensions. Most open source applications these days prove that recipes can be deployed to Kubernetes in minutes.

However, managing Kubernetes involves significant operational overhead. A production cluster requires a lot of additional application pieces, including configuration management (like Argo CD or Flux CD), secrets management, artifact management/registry, logs, metrics, traces, analytics, alerts, and more. This often requires dedicated teams to manage the cluster and its tooling. Nowadays, cloud providers offer alternatives with a much lower barrier to entry, such as GCP Cloud Run and AWS Fargate.

Additionally, Kubernetes’s extensive range of features and configurations can be overwhelming, posing challenges for both developers adapting to its paradigm shift and operators selecting the appropriate tooling.

When to Use Kubernetes

Kubernetes is a mature container orchestration platform, but it’s not for everyone. It’s best suited for companies with over twenty applications that need an elastic workload.

If you’re a small company, the heavy lifting and responsibility of setting up and maintaining Kubernetes clusters could slow you down. Cloud providers provide plenty of low-barrier entry solutions, from serverless (such as AWS Lambda and GCP Cloud Functions) to fully managed container orchestration platforms (such as AWS Fargate and GCP Cloud Run).

Prometheus

Prometheus is an open source systems monitoring and alerting toolkit originally built at SoundCloud in 2012 and later donated to the CNCF in 2016. Prometheus played a significant role in shaping the cloud-native landscape. Its data model for metrics was so popular that it became an independent standard called OpenMetrics.

At its core, Prometheus is a time series database capable of pulling and storing metrics from your application. These metrics can later be accessed and used via its comprehensive query language, PromQL.

The combination of Prometheus, Grafana, and Alertmanager is one of the most popular open source monitoring stacks. Grafana offers metrics visualization and dashboards, while Alertmanager enables you to monitor events and perform real-time alerting.

Prometheus Pros and Cons

Prometheus’s data model and query language make it an effective choice for monitoring the state of infrastructure and applications and alerting on anomalies. Its architecture is capable of handling high volumes of metrics from various sources without external storage (no database or blob storage required), as the Prometheus nodes handle the storage on their local disk.

Because Prometheus helped shape the standards for application metrics, most tooling exposes Prometheus-style metrics. Additionally, you can find a wide variety of exporters that allow you to ingest external metrics into Prometheus, including GCP metrics and AWS CloudWatch.

Although Prometheus excels in efficiency and centralizing metrics—especially for long-term retention or with high-cardinality metrics—it can pose challenges without supplementary components like Thanos. Without such components, managing long-term retention or dealing with high-cardinality metrics would necessitate large and costly Prometheus instances.

Additionally, if you’re looking for fast and efficient queries for time series metrics, you must carefully design your metrics or add additional mechanisms such as aggregation, compaction, and sampling. Without these mechanisms, you may experience slower queries and increased storage costs.

When to Use Prometheus

Prometheus is ideal if you’re getting started with metrics and event monitoring; it’s easy to deploy a Prometheus instance and instrument your application to expose metrics.

Prometheus has a slight learning curve, but challenges only arise on the operation side when the amount of data ingested becomes significant (over ten GB a day).

Logstash

Logstash is an open source data processing pipeline specially designed to handle logs and events from your applications. It was created by Jordan Sissel in 2009 and later became a part of the Elastic Stack.

Logstash can ingest data from multiple sources simultaneously, transform it, and then send it to a “stash” like Elasticsearch or even a database. It’s primarily used in application log processing and analysis, but its flexible pipeline design allows it to process a vast array of data types, including metrics and events. This provides a unified tooling approach for data ingestion and transformation.

Logstash’s versatility is a key component in data analysis and visualization workflows, where you can leverage its power to extract the aggregated data. You don’t want to search through all your data every time you need to load a dashboard, nor do you want all your raw data stored forever. Preprocessing logs (or any data) with Logstash drastically improves query performance and decreases storage costs.

Logstash Pros and Cons

One of the main advantages of Logstash is that it can ingest data from any source, transform it, and store it in a centralized location (like Elasticsearch or any database). It also supports a wide range of input, filter, and output plugins, allowing for the processing of diverse data formats and integration with various storage and analytics platforms.

Since Logstash is part of the Elastic Stack, it works seamlessly with Elasticsearch and Kibana, providing you with a complete suite of tools for searching, analyzing, and visualizing log data.

However, Logstash’s configuration and pipeline management often seem complex for developers who are not familiar with data pipelines. Additionally, Logstash can be resource-intensive, particularly when processing large volumes of data or when many transformations are applied, which might require a dedicated workload (CPU and memory).

When to Use Logstash

If you’re an Elastic Stack user, then learning Logstash will help you improve dashboard and query performance and decrease storage costs.

However, Logstash has a steep learning curve, and you need to be familiar with both data pipelines and the infrastructure to support those pipelines. Logstash is powerful and flexible, but if you’re not an Elastic Stack user, you might want to explore some other options.

Jaeger

Jaeger is an open source distributed tracing system built by Uber in 2015 and inspired by Google’s Dapper. Uber donated the project to the CNCF in 2017.

Jaeger provides client libraries for a wide range of programming languages, including Go, Java, and Python. These libraries are used to instrument your application, allowing you to record spans when a service processes a request, such as an HTTP request or database call. As the request moves from one service to another, context about the trace is passed along with the request via unique identifiers (like span ID and trace ID). Details like the operation name, duration, and metadata are collected, and the client libraries send this data to Jaeger (the collector). Once the trace data is stored, it can be queried and visualized through the Jaeger UI.

Jaeger Pros and Cons

One of the advantages of Jaeger is that it provides distributed tracing to monitor and troubleshoot microservices-based distributed systems. This provides visibility into requests as they travel through the system.

Jaeger also helps identify performance bottlenecks and latency issues, making root cause analysis more efficient for distributed systems.

Keep in mind that Jaeger can generate a significant amount of tracing data, which can become a challenge to store and manage. You often have to use sampling traces (ie only capturing a small percentage of data) to keep the system manageable in terms of resources and storage.

When to Use Jaeger

Jaeger is a simple tool that provides the bare minimum to visualize traces. It’s easy to integrate into your application and provides a great starting point for trace monitoring. However, its UI is a bit outdated.

SigNoz

SigNoz is an open source application performance monitoring (APM) and observability platform that was created in 2021 to solve observability fragmentation issues. To properly monitor your system, you typically need a wide range of tools or solutions.

SigNoz applies the OpenTelemetry standard to collect data from your system, offering a vendor-neutral way to gather telemetry data such as traces, metrics, and logs. After receiving the data, SigNoz processes and stores it using ClickHouse, an open source columnar database designed for analytics data. SigNoz also provides a web interface where you can visualize and analyze the collected data, allowing you to explore metrics and build dashboards.

The platform also includes an alerting system that allows users to configure alert rules based on metrics or log data so that they’ll be notified when those rules are violated.

SigNoz Pros and Cons

SigNoz provides a unified view of metrics, traces, and logs, offering full-stack observability on a single platform. SigNoz even promises to replace Prometheus, Jaeger, and Logstash altogether.

Because it utilizes ClickHouse for data storage, it’s designed for high performance and scalability and is capable of handling large volumes of data efficiently.

When to Use SigNoz

SigNoz is a promising open source full-stack monitoring platform that has a lot of potential. Since it’s a newer tool, it may not have the same level of maturity or as extensive an ecosystem of integrations and community support as more established tools.

Conclusion

Platform engineering plays a crucial role in standardizing tooling, automation, and best practices to enable software teams to focus on developing, deploying, and operating their applications. In this article, you learned about seven of the most popular platform engineering tools currently on the market.

Picking the right tools can make all the difference when it comes to fostering a streamlined and efficient development lifecycle, enhancing collaboration, and ensuring the scalability and reliability of applications in the long run. While not every tool may be necessary for every organization, being aware of the different tools available and their roles in the industry is crucial.

Get Started Free

Introducing Auto-Skip for Even Faster Builds

2024-03-26T00:00:00-04:00

Earthly and Earthly Cloud have a lot of benefits and useful features, but there are three areas where we feel that, if we knock it out of the park, developers will love our tools, and we will succeed as a business.

Simplicity. If Earthly is easy to use, then the barrier to entry is lower and developers will be more likely to use it. Our syntax being a very familiar mixture of Dockerfile and Makefile makes Earthly easy to use. The fact that we work with every language, framework, and build tool also makes it easy to use. So does the fact that it works with any CI. The less you have to think about and research how to use or implement Earthly, the better.
Consistency (or Repeatability). Since Earthly containerizes builds, they can run anywhere Docker can run. So if it runs correctly on your computer, you know that it will run correctly on your colleagues’ computers, and you know it will run correctly in CI. With Earthly, builds are self-contained, isolated, consistent, and portable. This makes build failures across teams due to differing build environments a thing of the past. It also means you can easily reproduce and debug CI failures locally instead of having to commit and push changes to your repo over and over again just to test CI.
Speed. Builds are consistently slower than they should be. Every time a build runs, the same commands execute, even ones where the results of execution are no different than the last time the build ran. Installing dependencies is a great example. Every time a build runs it installs dependencies, but those dependencies change infrequently. So that’s a waste of time. Some CI tools have caching so these superfluous steps can be skipped, but even those solutions are slower than they should be. Since CI runners are almost always ephemeral, they have to download the cache at the beginning of every build and upload it at the end of every build. Even if no steps in the build need to run, this download and upload have to happen. That’s a waste of time too. Earthly gives you automatic caching so superfluous steps are skipped, and Earthly Satellites aren’t ephemeral, so no cache download and upload is necessary.

This announcement is related to that third area, speed. We’re excited to announce an augmentation to our already stellar caching options. A new layer of caching that makes builds run even faster. We’re introducing Auto-Skip.

What Is Auto-Skip

Auto-Skip allows Earthly to skip large parts of a build in certain situations and is especially useful in monorepo setups, where you are building multiple projects at once, and only one of them has changed. It is a global cache stored in a cloud database that is only available to users of Earthly Cloud.

Auto-Skip is an all-or-nothing type of cache. Either the entire target is skipped, or none of it is. This is because Auto-Skip doesn’t know which parts of the target have changed. Instead, it traverses the full Earthly target graph and creates a cache key using a hash of all the graph’s values and files. If the cache key matches the last run, it skips the target. If the cache keys don’t match, Earthly will fallback to the other forms of caching, layer-based caching and cache mounts, to run the build as efficiently as possible.

Unlike layer caching, Auto-Skip can skip RUN --no-cache and RUN --push commands. This can be useful in situations when you would like to skip a deployment if nothing has changed.

Note that Auto-Skip is still in beta, so details about it may change slightly, but it is being used without issue by several happy customers as well as in our own repos.

How to Use Auto-Skip

Auto-Skip is very easy to use. You have to have an Earthly Cloud account and be authenticated in Earthly to use Auto-Skip. There are two ways to invoke Auto-Skip:

Auto-Skip can be activated for an entire run by using earthly --auto-skip in the CLI.
Auto-Skip can be activated for individual targets by using BUILD --auto-skip in your Earthfile.

Visit our docs from more information and details about using Auto-Skip

Sign up for Earthly Cloud to start using Auto-Skip. It’s available to all Earthly Cloud users. Try it out, and let us know how it works for you.

Get Started Free

Earthly Blog

Python Web Scraping with Beautiful Soup and Selenium

What Is Web Scraping?

Web Scraping and LLMs

Popular Python Web Scraping Tools

Implementing Web Scraping with Python

Prerequisites

Scraping a Website with Beautiful Soup

Analyzing Your Target Website

href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake

tag. The price is located within a tag with the class price_color. Now that we’ve identified the elements to scrape, let’s create a script that automates the scraping process.

Creating a Scraping Script

Running the Script

Scraping a Dynamic Website with Selenium

Analyzing the freeCodeCamp YouTube Channel

Creating a Scraping Script

Analyzing the Data

Conclusion

Introducing Earthly Functions: Reusable Code for More Modular, Consistent Build Files

How to Use Functions

Sign Up for Earthly Cloud and Start Using Functions Today

go delve - The Golang Debugger

Background

Installing Delve

Debugging dlv debug

Quick Tips

Keyboard Shortcuts

Headless for remote or Containerize Debugging

VS Code and GoLand

GoLand and IntelliJ

That’s a Wrap

Building a Monorepo with Java

The Basic Structure of Java Monorepos

The Best Monorepo Tools for Java

Building a Monorepo with IntelliJ IDEA and Maven

Creating the Archetype

Creating the Monorepo

Refining the Monorepo

Optimizing Your Monorepo Builds

Conclusion

Introducing Earthly docker-build: Faster Docker Builds, Persistent Cache, Works with Any CI

What Is earthly docker-build?

How To Use earthly docker-build

Sign Up for Earthly Cloud and Start Using earthly docker-build Today

What Is Platform Engineering?

Platform Engineering in Action

Building an IDP

Automating DevOps Tasks With IDPs

Enabling Self-Service Developer Access

How Platform Engineering Affects Your Builds

Improved Build Quality and Reliability

Reduced Build Times

Platform Engineering and Developer Productivity

Conclusion

How a Platform Team Helps Your Developers

What Is a Platform Team?

Core Functions of a Platform Team

How Platform Teams Help Your Developers

Increased Developer Productivity

Improved Application Quality

Reduced Time to Market

Enhanced Developer Experience

Reduced Operational Costs

Collaboration between Platform Teams and Development Teams

Best Practices When Working with Platform Teams

Real-World Examples of Platform Teams

Conclusion

Cosine Similarity and Text Embeddings In Python with OpenAI

What Is A Text Embedding

What Is Cosine Similarity

Word2Vec

Text Embeddings

Putting It All Together

Top 7 Platform Engineering Tools

Backstage

Backstage Pros and Cons

When to Use Backstage

Terraform

Terraform Pros and Cons

When to Use Terraform

tag. The price is located within a
tag with the class `price_color`.

Now that we’ve identified the elements to scrape, let’s create a script that automates the scraping process.

Debugging `dlv debug`

What Is `earthly docker-build`?

How To Use `earthly docker-build`

Sign Up for Earthly Cloud and Start Using `earthly docker-build` Today