April 19th, 2024

At Carriyo we are currently generating 10k PDFs per day for shipment labels and other documents that is needed for shipping process. We design our documents in HTML / CSS and use puppeteer to convert them to PDF (and also ZPL for labels).

Puppeteer uses Chromium browser to generate the PDF, which as you can guess, is very heavy on resources. Despite it’s heavy resource usage we continue using puppeteer as most PDF solutions fail badly at rendering RTL complex scripts like Arabic. Also the design flexibility of CSS layouts is unmatched. There are close to zero tools out there, that has got an awesome layout engine like CSS flexbox, that also works with text wrapping of mixed English-Arabic bidirectional text with custom fonts! Just imagine the amount of code and complexities that goes into building that!

With a lack of other options, the only way forward for us was to “brute force” our way with puppeteer PDF generation. So buckle up, as I take you through a whole bunch of optimizations and error handling considerations to scale this beast up.

Optimization 1: More cores! More tabs!

”* slaps surface of a laptop * This bad boy can fit so many browser tabs in it”

– Probably a laptop salesman to a potential customer

Puppeteer uses chromium browser, and browsers have tabs, just like your regular browser. Puppeteer calls them “pages” in their API. There are two observations that you need to consider when dealing with tabs:

await page.pdf() is quick to respond, but takes 100% CPU of a core for that short moment of time.
await browser.newPage() API takes 20-30 ms on my laptop (could be slower or faster on a server, but overall it isn’t that fast).

Let’s see what constraints these bring. Let’s begin with the 100% CPU issue. If you have a single core machine, and if your node.js process begins two concurrent page.pdf()calls, the total response time would only be marginally better than generating the two PDFs serially one after the other. There isn’t much point trying to create more concurrent PDFs than the number of cores of the machine, as each page consumes a full core. For this reason when I run puppeteer on a non-serverless setup, I use a queue (either in memory or redis or something) to track in-flight PDF generation requests and limit the concurrency to number of cores - 1 (-1 so that node.js process don’t get affected).

However what this indicates is that you need more CPUs when you have lots of concurrent requests. Serverless platforms like AWS Lambda and Google Cloud Run can scale to 100s (or 1000s even) of CPUs in a few seconds. For high volumes, it could be better to run PDF generation in a separate service on these platforms.

We use the Sparticuz/chromium project on AWS Lambda, and a small percentage of requests experience cold starts. A cold start (i.e the very first request when lambda scales) is 5 seconds! So we built another service that keeps pinging the PDF service to keep instances warm. With this only 0.1% requests take the full 5 second cold start time and p95 (with our HTML, CSS, code and everything) is 365ms (response latency time, not round trip time). However after we built the warmer, AWS built a type of warmer itself called proactive initialization, which we haven’t still optimized for. For deployment, we use a lambda layer to avoid having to deploy the same chromium binary on every deployment.

About #2 slow page / tab creation: It is better to not close tab once created, and re-use tabs for future requests. There is bit of an issue in never closing tabs though. They consume more and more memory as you keep re-using them, and they don’t free memory on their own. So at some point you need to close the tab and create a new one, especially in a non-serverless setup.

Optimization 2: Serve files from disk or memory

HTML needs static files such as CSS, fonts, images, all of which causes network requests. For a long time we had an HTTP server running that we pointed puppeteer / Chromium to. However, this is both slow and is another source of unreliability, as occasionally the server could be unresponsive on high loads. And this setup seemed complex when the aim is just to serve HTML authored by my team! There is a better way. Puppeteer has network interception APIs, which one could use to serve static files from the local filesystem. Using this API also prevents CORS issues.

const fs = require('node:fs');
const nodePath = require('node:path');

const serverOrigin = 'http://localhost:8080';
const staticDir = nodePath.join(__dirname, '../static');
 const staticURLPath = '/static';

const mimeTypes = {
  html: 'text/html',
  js: 'text/javascript',
  css: 'text/css',
  png: 'image/png',
  jpg: 'image/jpg',
  jpeg: 'image/jpeg',
  gif: 'image/gif',
  svg: 'image/svg+xml',
  ttf: 'font/ttf',
  otf: 'font/otf',
  woff: 'font/woff',
  woff2: 'font/woff2',
};

function getContentTypeFromPath(filePath) {
  const extname = nodePath.extname(filePath).slice(1);
  return mimeTypes[extname];
};

page.on('request', (request) => {
  if (request.isInterceptResolutionHandled()) return;
  const requestUrl = new URL(request.url());
  const requestPath = requestUrl.pathname;
  const requestOrigin = requestUrl.origin;
  if (requestOrigin === serverOrigin && requestPath.startsWith(staticURLPath)) {
    const filePath = nodePath.join(staticDir, requestPath.slice(staticURLPath.length + 1));
    const contentType = getContentTypeFromPath(filePath);
    let fileContent;
    if (contentType) {
      try {
        fileContent = fs.readFileSync(filePath);
      } catch {
        // ignore
      }
    }
    if (fileContent) {
      request.respond({
        status: 200,
        contentType: getContentTypeFromPath(filePath),
        headers: {
          'Cache-Control': 'max-age=600, stale-while-revalidate=300',
        },
        body: fileContent,
      });
    } else {
      request.respond({
        status: 404,
        contentType: 'text/html',
        body: 'File not found',
      });
    }
  } else {
    // We have external files as well that we should not intercept.
    request.continue();
  }
});

I am not sure if Cache-Control header is necessary as we are serving the file from disk. But what I can say is that for real network requests, Cache-Control is better than using base64 encoded data URI images / fonts, as chromium seems to re-parse the base64 URI everytime without caching, whereas with a Cache controlled file, for the second request onwards, the file is fetched from RAM / memory cache (likely already parsed and ready).

Optimization 3: Error handling and retries

puppeteer can fail in all sorts of ways. page.goto() can timeout, page.close() can hang (be unresponsive). When page can’t be closed, browser.close() could be used but that could hang too. You need timeouts and retries everywhere. And page.goto() has a timeout parameter, however page.close() doesn’t so you need to use Promise.race() and setTimeout to force a timeout.

Retries gets trickier when trying to serve concurrent requests from single machine, as closing browser means in-flight requests needs to be retried as well. AWS Lambda, which only sends one concurrent request to a VM (execution context) at a time, simplifies error handling a bit, as one has to deal with only one page / tab.

The following is a snippet just to close browser.

// Utility to run a function within a timeout duration and throw or run a fallback function on timeout.
async function timeBoundExec(func, timeoutMs = 300, fallback, logError) {
  let timeout = null;
  await Promise.race([
    new Promise((resolve, reject) => {
      timeout = setTimeout(() => {
        timeout = null;
        if (fallback) {
          resolve(fallback());
        } else {
          reject(new Error('Timed out'));
        }
      }, timeoutMs);
    }),
    func()
    .then(() => clearTimeout(timeout))
    .catch((error) => {
      if (logError) logError(error);
      if (timeout) {
        clearTimeout(timeout);
        timeout = null;
        if (fallback) {
          return fallback();
        } else {
          throw error;
        }
      }
      // else case: Timeout already resolved the promise. We logged error already, nothing more to do.
    }),
  ]);
}

async function closeBrowser(browser) {
  // browser.close() hangs sometimes
  // the other way is to kill the process
  await timeBoundExec(
    () => browser.close(),
    300,
    () => browser.process()?.kill?.(9),
  );
}

More optimizations?

Apart from this, apply any regular web dev optimizations you can on the HTML, CSS, etc. Speed up your data fetching and HTML generation. Reduce your image size, reduce JS size and execution time. I had an issue with a client’s font file, that was 8 MB in size, which slowed down page load a lot. I had to trim it down (AKA font subsetting) and include only the languages that end users finally use.

With all these optimization applied, we were able to generate 10k PDFs per day at p95 of 365ms on AWS Lambda (response latency time, not round trip time), at AWS Lambda cost of $7.68, for the month of April 2024, for 430k invocations (remember the invocations are higher than real requests, because we have a warmer that tries to keep 5 lambdas warm by hitting 5 concurrent requests every 5 minutes). This cost goes up and down based on actual number of requests per month. I should be adding API Gateway costs here as well here, but it is hard to find the cost of that from the rest of the requests (any ideas?). You can almost eliminate API Gateway cost by using function URLs though. Use OAuth or HMAC tokens for authorization to prevent abuse of the function URL.

That’s all for this post. You tell me what other optimizations you have applied in the comments below.

Backend

Code Pasta

Optimizing Puppeteer PDF generation

Optimization 1: More cores! More tabs!

Optimization 2: Serve files from disk or memory

Optimization 3: Error handling and retries

More optimizations?